FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI (opens in new tab)

(epochai.org)

185 pointssshroot1y ago105 comments

105 comments

For some context on why this is important: this benchmark was designed to be extremely challenging for LLMs, with problems requiring several hours or days of work by expert mathematicians. Currently, LLMs solve 2% of problems in the set (which is kept private to prevent contamination).

They even provide a quote from Terence Tao, which helped create the benchmark (alongside other Field medalists and IMO question writers):

> “These are extremely challenging. I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages…”

Surprisingly, prediction markets [1] are putting 62% on AI achieving > 85% performance on the benchmark before 2028.

[1]: https://manifold.markets/MatthewBarnett/will-an-ai-achieve-8...

light_hue_11y ago

If I was going to bet, I would bet yes, they will reach above 85% performance.

The problem with all benchmarks, one that we just don't how to solve, is leakage. Systematically, LLMs are much better at benchmarks created before they were trained than after. There are countless papers that show significant leakage between training and test sets for models.

This is in part why so many LLMs are so strong according to benchmarks, particularly older popular benchmarks, but then prove to be so weak in practice when you try them out.

In addition to leakage, people also over-tune their LLMs to specific datasets. They also go out and collect more data that looks like the dataset they want to perform well on.

There's a lot of behind the scenes talk about unethical teams that collect data which doesn't technically overlap test sets, but is extremely close. You can detect this if you look at the pattern of errors these models make. But no one wants to go out and accuse specific teams, at least not for now.

nerdponx1y ago

Could you run the benchmark by bootstrapping (average of repeated subsampling), instead of a straight-across performance score, and regain some leakage resistance that way? As well as a better simulation of "out of sample" data, at least for a little while.

agucova1y ago

This benchmark’s questions and answers will be kept fully private, and the benchmark will only be run by Epoch. Short of the companies fishing out the questions from API logs (which seems quite unlikely), this shouldn’t be a problem.

3 more replies

sebzim45001y ago

>Surprisingly, prediction markets [1] are putting 62% on AI achieving > 85% performance on the benchmark before 2028.

Or they know the ancient technique of training on the test set. I know most of the questions are kept secret, but they are being regularly sent over the API to every LLM provider.

tux31y ago

Although the answer isn't sent, so it would have to be a very deliberate effort to fish those out of the API chatter and find the right domain expert with 4-10 hours to spend on cracking it

Just letting the AI train on its own wrong output wouldn't help. The benchmark already gives them lots of time for trial and error.

1 more reply

andrepd1y ago

Of course lol. How come e.g. o1 scores so high on these reasoning and math and IMO benchmarks and then fails every simple question I ask of it? The answer is training on the test set.

TeMPOraL1y ago

> Surprisingly, prediction markets [1] are putting 62% on AI achieving > 85% performance on the benchmark before 2028.

Why surprisingly?

2028 is twice as long as capable LLMs existed to date. By "capable" here I mean capable enough to even remotely consider the idea of LLMs solving such tasks in the first place. ChatGPT/GPT-3.5 isn't even 2 years old!

4 years is a lot of time. It's kind of silly to assume LLM capabilities have already bottomed out.

ekianjo1y ago

Sure but it is also reasonable to consider that the pace of progress is not always exponential or even linear at best. Diminishing returns are a thing and we already know that a 405b model is not 5 times better than a 70b model.

2 more replies

ak_1111y ago

I think because if you end up having an AI that is as capable as the graduate students Tao is used to dealing with (so basically potential field medalists) then you are basically betting that 85% chance something like AGI (at least in consequence) will be here in 3 years. It is possible, but 85% chance?

2 more replies

andrepd1y ago

People really love pointing at the first part of a logistic curve and go "behold! an exponential".

2 more replies

slashdave1y ago

Except LLM capabilities have already peaked. Scaling has rapidly diminishing returns.

1 more reply

equestria1y ago

Market size matters. There's a whopping total of 71 bidders on that.

ak_1111y ago

Would be interesting to know which model solved the 2% and what is the nature of the problems it solved.

llm_trw1y ago

These benchmarks are entirely pointless.

The people making them are specialists attempting to apply their skills to areas unrelated to LLM performance, a bit like a sprinter making a training regimen for a fighter jet.

What matters is the data structures that underlie the problem space - graph traversal. First, finding a path between two nodes; second, identifying the most efficient path; and third, deriving implicit nodes and edges based on a set of rules.

Currently, all LLMs are so limited that they struggle with journeys longer than four edges, even when given a full itinerary of all edges in the graph. Until they can consistently manage a number of steps greater than what is contained in any math proof in the validation data, they aren’t genuinely solving these problems; they’re merely regurgitating memorized information.

nopinsight1y ago

> Currently, all LLMs are so limited that they struggle with journeys longer than four edges, even when given a full itinerary of all edges in the graph.

This is probably not the case for LLMs in the o1 series and possibly Claude 3.5 Sonnet. Have you tested them on this claim?

1 more reply

youoy1y ago

Not to mention that math proofs are more than graph trasversals... (Although maybe simple math problems are not) There is the problem of extracting the semantics of math formalisms. This is easier in day to day language, I don't know to what extent LLMs can also extract the semantics and relations of different mathematical abstractions.

benchmarkist1y ago

It will be a useful benchmark to validate claims by people like Sam Altman about having achieved AGI.

2 more replies

dr_dshiv1y ago

> they’re merely regurgitating memorized information

Source?

3 more replies

bravura1y ago

Regarding keeping the test set private to avoid contamination, the comments about leakage are spot on. The real test set should always be the future.

We should evaluate LLMs on text from beyond their knowledge cutoff date, by computing their per-byte perplexity or per-byte compression ratio. There's a deep theoretical connection between compression and learning.

The intuition here is that being able to predict the future of science (or any topic, really) is indicative of true understanding. Slightly more formally: When ICLR 2025 announces and publishes the accepted papers, Yoshua Bengio is less surprised/perplexed by what's new than a fresh PhD student. And Terence Tao is less surprised/perplexed by what will be proven in math in the next 10 years than a graduate student in a related field.

This work has it right: https://ar5iv.labs.arxiv.org/html//2402.00861

3abiton1y ago

Interesting take sounds like MDL (Minimum description length) for LLMs!

westurner1y ago

ScholarlyArticle: "FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI" (2024) https://arxiv.org/abs/2411.04872 .. https://epochai.org/frontiermath/the-benchmark :

> [Not even 2%]

> Abstract: We introduce FrontierMath, a benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians. The questions cover most major branches of modern mathematics -- from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory. Solving a typical problem requires multiple hours of effort from a researcher in the relevant branch of mathematics, and for the upper end questions, multiple days. FrontierMath uses new, unpublished problems and automated verification to reliably evaluate models while minimizing risk of data contamination. Current state-of-the-art AI models solve under 2% of problems, revealing a vast gap between AI capabilities and the prowess of the mathematical community. As AI systems advance toward expert-level mathematical abilities, FrontierMath offers a rigorous testbed that quantifies their progress.

westurner1y ago

Additional AI math benchmarks:

- "TheoremQA: A Theorem-driven [STEM] Question Answering dataset" (2023) https://github.com/TIGER-AI-Lab/TheoremQA

benchmarkist1y ago

Very cool. It'll be nice to have a benchmark that can be used to validate abstract reasoning capabilities because the hype is really starting to get out of hand.

nerdponx1y ago

I wonder if the best benchmark is a Prolog program that generates tests of logical reasoning. You could have a functionally infinite stream of test cases!

aithrowawaycomm1y ago

Part of the magic of mathematical reasoning in humans is our ability to sidestep incompleteness theorems or undecidability headaches by simply changing the rules as befits the problem at hand: using a logical tool to solve a math problem seems largely formalizable and testable with Prolog/Lean/etc, but selecting or designing such a tool - e.g. choosing good definitions and axioms - is much more mysterious.

Put a bit more poetically: a Prolog benchmark can adequately test an LLM’s ability to create proofs in Euclidean geometry. But it will never test an LLM’s ability to reason whether a given axiomatization of geometry is actually a reasonable abstraction of physical space. And if our LLMs can do novel Euclidean proofs but are not able to meta-mathematically reason about novel axioms, then they aren’t really using intelligence. Formal logical puzzles are only a small subset of logical reasoning.

Likewise, when Euclidean proofs were a fun pastime among European upper-classes, the real work was being done by mathematicians who built new tools for projective and analytic geometry. In some sense our LLM benchmarks are focusing on the pastime and not the work. But in another sense LLMs are focusing on the tricksy and annoying sides of actually proving things, leaving humans free to think about deeper problems. So I’m not skeptical of LLMs’ utility in mathematical research, but rather the overinflated (and investor-focused) claims that this stuff is a viable path to AGI.

benchmarkist1y ago

You could but most LLMs can't solve sudoku puzzles even though the training corpus already contains books on logic, constraint propagation, and state space exploration with backtracking.

1 more reply

sebzim45001y ago

I mean, this benchmark is really hard.

I don't think it's a requirement that a system claiming to be AGI should be able to solve these problems, 99.99% of humans can't either.

benchmarkist1y ago

An AGI is often claimed to be a general purpose problem solver and these are exactly the types of problems that a general purpose problem solver would be able to solve if given access to a mathematical library. All existing LLMs have been trained on abstract mathematics and logic but it is obvious that they are incapable of abstract logical reasoning, e.g. solving sudoku puzzles.

2 more replies

MichaelRazum1y ago

How do they solve the 2%? This is the question. If those problems were unseen, that might be already very impressive.

Davidzheng1y ago

Not very impressed by the problems they displayed but I guess there should be some good problems in the set given the comments (not in the sense that I find them super easy but they seems random and not super well-posed, and extremely artificial problems--in the sense that they seem to not be of particular mathematical interest[or at least the mathematical content of the problem is being deliberately hidden for testing purposes] but constructed according to some weird criteria). Would be happy to hear an elaboration on the comments by the well-known mathematicians

vessenes1y ago

Hmm. I’m a hard disagree. The problems they show have a number of really nice properties for LLM assessment: They require broad, often integrated knowledge of diverse areas of mathematics, the answers reduce to a number, often a very large number, and thus extremely difficult to guess, and they require a significant amount of symbolic parsing and (I would say) reasoning skills. If we think about what makes a quality mathematician, I’d propose it’s the ability to come at a problem both from the top —- conceptually — and from the bottom — applying various tools and transformations — with a sort of direction in mind that gets to a result.

I’d say these problems strongly encourage that sort of behavior.

I’m also someone who thinks building in abilities like this to LLMs would broadly benefit the LLMs and the world, because I think this stuff generalizes. But, even if not, It would be hard to say that an LLM that could test 80% on this benchmark would be not useful to a research mathematician. Terence Tao’s dream is something like this that can hook up to LEAN, leaving research mathematicians as editors, advisors, and occasionally working on the really hard parts while the rest is automated and provably correct. There’s no doubt in my mind that a high scoring LLM for this benchmark would be helpful in that concept.

Jianghong941y ago

I guess the primary reason is that the answers must be numbers that can be verified easily. Otherwise, you just flood the validator with long LLM reasoning that's hard to verify. People have been proposing using LEAN as a medium for answers but AFAIK even LEAN is not mainstream in the general math community, so there's always trade-offs.

Also, coming up with good problems is an art in its own right; the Soviets was famous for institutionalizing anti-Semitism via special math puzzles for Jews in Moscow Univerisity entrance exams. The questions are constructed as such that are hard to solve but have some elementary solutions to divert criticism.

j / k navigate · click thread line to collapse

105 comments

agucova1y ago

They even provide a quote from Terence Tao, which helped create the benchmark (alongside other Field medalists and IMO question writers):

Surprisingly, prediction markets [1] are putting 62% on AI achieving > 85% performance on the benchmark before 2028.

[1]: https://manifold.markets/MatthewBarnett/will-an-ai-achieve-8...

light_hue_11y ago

If I was going to bet, I would bet yes, they will reach above 85% performance.

This is in part why so many LLMs are so strong according to benchmarks, particularly older popular benchmarks, but then prove to be so weak in practice when you try them out.

In addition to leakage, people also over-tune their LLMs to specific datasets. They also go out and collect more data that looks like the dataset they want to perform well on.

nerdponx1y ago

agucova1y ago

3 more replies

sebzim45001y ago

>Surprisingly, prediction markets [1] are putting 62% on AI achieving > 85% performance on the benchmark before 2028.

Or they know the ancient technique of training on the test set. I know most of the questions are kept secret, but they are being regularly sent over the API to every LLM provider.

tux31y ago

Although the answer isn't sent, so it would have to be a very deliberate effort to fish those out of the API chatter and find the right domain expert with 4-10 hours to spend on cracking it

Just letting the AI train on its own wrong output wouldn't help. The benchmark already gives them lots of time for trial and error.

1 more reply

andrepd1y ago

Of course lol. How come e.g. o1 scores so high on these reasoning and math and IMO benchmarks and then fails every simple question I ask of it? The answer is training on the test set.

TeMPOraL1y ago

> Surprisingly, prediction markets [1] are putting 62% on AI achieving > 85% performance on the benchmark before 2028.

Why surprisingly?

4 years is a lot of time. It's kind of silly to assume LLM capabilities have already bottomed out.

ekianjo1y ago

2 more replies

ak_1111y ago

2 more replies

andrepd1y ago

People really love pointing at the first part of a logistic curve and go "behold! an exponential".

2 more replies

slashdave1y ago

Except LLM capabilities have already peaked. Scaling has rapidly diminishing returns.

1 more reply

equestria1y ago

Market size matters. There's a whopping total of 71 bidders on that.

ak_1111y ago

Would be interesting to know which model solved the 2% and what is the nature of the problems it solved.

llm_trw1y ago

These benchmarks are entirely pointless.

The people making them are specialists attempting to apply their skills to areas unrelated to LLM performance, a bit like a sprinter making a training regimen for a fighter jet.

nopinsight1y ago

> Currently, all LLMs are so limited that they struggle with journeys longer than four edges, even when given a full itinerary of all edges in the graph.

This is probably not the case for LLMs in the o1 series and possibly Claude 3.5 Sonnet. Have you tested them on this claim?

1 more reply

youoy1y ago

benchmarkist1y ago

It will be a useful benchmark to validate claims by people like Sam Altman about having achieved AGI.

2 more replies

dr_dshiv1y ago

> they’re merely regurgitating memorized information

Source?

3 more replies

bravura1y ago

Regarding keeping the test set private to avoid contamination, the comments about leakage are spot on. The real test set should always be the future.

This work has it right: https://ar5iv.labs.arxiv.org/html//2402.00861

3abiton1y ago

Interesting take sounds like MDL (Minimum description length) for LLMs!

westurner1y ago

ScholarlyArticle: "FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI" (2024) https://arxiv.org/abs/2411.04872 .. https://epochai.org/frontiermath/the-benchmark :

> [Not even 2%]

westurner1y ago

Additional AI math benchmarks:

- "TheoremQA: A Theorem-driven [STEM] Question Answering dataset" (2023) https://github.com/TIGER-AI-Lab/TheoremQA

benchmarkist1y ago

Very cool. It'll be nice to have a benchmark that can be used to validate abstract reasoning capabilities because the hype is really starting to get out of hand.

nerdponx1y ago

I wonder if the best benchmark is a Prolog program that generates tests of logical reasoning. You could have a functionally infinite stream of test cases!

aithrowawaycomm1y ago

benchmarkist1y ago

You could but most LLMs can't solve sudoku puzzles even though the training corpus already contains books on logic, constraint propagation, and state space exploration with backtracking.

1 more reply

sebzim45001y ago

I mean, this benchmark is really hard.

I don't think it's a requirement that a system claiming to be AGI should be able to solve these problems, 99.99% of humans can't either.

benchmarkist1y ago

2 more replies

MichaelRazum1y ago

How do they solve the 2%? This is the question. If those problems were unseen, that might be already very impressive.

Davidzheng1y ago

vessenes1y ago

I’d say these problems strongly encourage that sort of behavior.

Jianghong941y ago

j / k navigate · click thread line to collapse