They even provide a quote from Terence Tao, which helped create the benchmark (alongside other Field medalists and IMO question writers):
> “These are extremely challenging. I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages…”
Surprisingly, prediction markets [1] are putting 62% on AI achieving > 85% performance on the benchmark before 2028.
[1]: https://manifold.markets/MatthewBarnett/will-an-ai-achieve-8...
The problem with all benchmarks, one that we just don't how to solve, is leakage. Systematically, LLMs are much better at benchmarks created before they were trained than after. There are countless papers that show significant leakage between training and test sets for models.
This is in part why so many LLMs are so strong according to benchmarks, particularly older popular benchmarks, but then prove to be so weak in practice when you try them out.
In addition to leakage, people also over-tune their LLMs to specific datasets. They also go out and collect more data that looks like the dataset they want to perform well on.
There's a lot of behind the scenes talk about unethical teams that collect data which doesn't technically overlap test sets, but is extremely close. You can detect this if you look at the pattern of errors these models make. But no one wants to go out and accuse specific teams, at least not for now.
Or they know the ancient technique of training on the test set. I know most of the questions are kept secret, but they are being regularly sent over the API to every LLM provider.
Just letting the AI train on its own wrong output wouldn't help. The benchmark already gives them lots of time for trial and error.
Why surprisingly?
2028 is twice as long as capable LLMs existed to date. By "capable" here I mean capable enough to even remotely consider the idea of LLMs solving such tasks in the first place. ChatGPT/GPT-3.5 isn't even 2 years old!
4 years is a lot of time. It's kind of silly to assume LLM capabilities have already bottomed out.
The people making them are specialists attempting to apply their skills to areas unrelated to LLM performance, a bit like a sprinter making a training regimen for a fighter jet.
What matters is the data structures that underlie the problem space - graph traversal. First, finding a path between two nodes; second, identifying the most efficient path; and third, deriving implicit nodes and edges based on a set of rules.
Currently, all LLMs are so limited that they struggle with journeys longer than four edges, even when given a full itinerary of all edges in the graph. Until they can consistently manage a number of steps greater than what is contained in any math proof in the validation data, they aren’t genuinely solving these problems; they’re merely regurgitating memorized information.
This is probably not the case for LLMs in the o1 series and possibly Claude 3.5 Sonnet. Have you tested them on this claim?
We should evaluate LLMs on text from beyond their knowledge cutoff date, by computing their per-byte perplexity or per-byte compression ratio. There's a deep theoretical connection between compression and learning.
The intuition here is that being able to predict the future of science (or any topic, really) is indicative of true understanding. Slightly more formally: When ICLR 2025 announces and publishes the accepted papers, Yoshua Bengio is less surprised/perplexed by what's new than a fresh PhD student. And Terence Tao is less surprised/perplexed by what will be proven in math in the next 10 years than a graduate student in a related field.
This work has it right: https://ar5iv.labs.arxiv.org/html//2402.00861
> [Not even 2%]
> Abstract: We introduce FrontierMath, a benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians. The questions cover most major branches of modern mathematics -- from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory. Solving a typical problem requires multiple hours of effort from a researcher in the relevant branch of mathematics, and for the upper end questions, multiple days. FrontierMath uses new, unpublished problems and automated verification to reliably evaluate models while minimizing risk of data contamination. Current state-of-the-art AI models solve under 2% of problems, revealing a vast gap between AI capabilities and the prowess of the mathematical community. As AI systems advance toward expert-level mathematical abilities, FrontierMath offers a rigorous testbed that quantifies their progress.
- "TheoremQA: A Theorem-driven [STEM] Question Answering dataset" (2023) https://github.com/TIGER-AI-Lab/TheoremQA
Put a bit more poetically: a Prolog benchmark can adequately test an LLM’s ability to create proofs in Euclidean geometry. But it will never test an LLM’s ability to reason whether a given axiomatization of geometry is actually a reasonable abstraction of physical space. And if our LLMs can do novel Euclidean proofs but are not able to meta-mathematically reason about novel axioms, then they aren’t really using intelligence. Formal logical puzzles are only a small subset of logical reasoning.
Likewise, when Euclidean proofs were a fun pastime among European upper-classes, the real work was being done by mathematicians who built new tools for projective and analytic geometry. In some sense our LLM benchmarks are focusing on the pastime and not the work. But in another sense LLMs are focusing on the tricksy and annoying sides of actually proving things, leaving humans free to think about deeper problems. So I’m not skeptical of LLMs’ utility in mathematical research, but rather the overinflated (and investor-focused) claims that this stuff is a viable path to AGI.
I don't think it's a requirement that a system claiming to be AGI should be able to solve these problems, 99.99% of humans can't either.
I’d say these problems strongly encourage that sort of behavior.
I’m also someone who thinks building in abilities like this to LLMs would broadly benefit the LLMs and the world, because I think this stuff generalizes. But, even if not, It would be hard to say that an LLM that could test 80% on this benchmark would be not useful to a research mathematician. Terence Tao’s dream is something like this that can hook up to LEAN, leaving research mathematicians as editors, advisors, and occasionally working on the really hard parts while the rest is automated and provably correct. There’s no doubt in my mind that a high scoring LLM for this benchmark would be helpful in that concept.
Also, coming up with good problems is an art in its own right; the Soviets was famous for institutionalizing anti-Semitism via special math puzzles for Jews in Moscow Univerisity entrance exams. The questions are constructed as such that are hard to solve but have some elementary solutions to divert criticism.