It seems likely that PhD students in the subfields of the authors are capable of solving these problems. What makes them interesting is that they seem to require fairly high research level context to really make progress.
It’s a test of whether the LLMs can really synthesize results from knowledge that require a human several years of postgraduate preparation in a specific research area.
If the paper would not have had the AI spin, would those 10 questions still have been interesting?
It seems to me that we have here a paper that is solely interesting because of the AI spin -- while at the same time this AI spin is really poorly executed from the point of AI research, where this should be a blog post at most, not an arXiv preprint.
The fact that you find it odd that this landed on arXiv is maybe a cultural thing… mathematicians kinda reflexively throw work up there that they think should be taken seriously. I doubt that they intend to publish it in a peer reviewed journal.
A surfboard is also an amazing tool, but there's more to operating one than telling it which way to go.
Many people want self-driving cars so they can drink in the back seat watching movies. They'll find their jobs replaced by AI, with a poor quality of life because we're a selfish species. In contrast Niki Lauda trusted fellow Formula 1 race car driver James Hunt to race centimeters apart. Some people want AI to help them drive that well. They'll have great jobs as AI evolves.
Gary Kasparov pioneered "freestyle" chess tournaments after his defeat by Big Blue, where the best human players were paired with computers, coining the "centaur" model of human-machine cooperation. This is frequently cited in the finance literature, where it is recognized that AI-guided human judgement can out-perform either humans or machines.
Any math professor knows how to help graduate students confidently complete a PhD thesis, or how to humiliate students in an oral exam. It’s a choice. To accomplish more work than one can complete alone, choose the former. This is the arc of human evolution: we develop tools to enhance our abilities. We meld with an abacus or a slide rule, and it makes us smarter. We learn to anticipate computations, like we’re playing a musical instrument in our heads. Or we pull out a calculator that makes us dumber. The role we see for our tools matters.
Programmers who actually write better code using AI know this. These HN threads are filled with despair over the poor quality of vibe coding. At the same time, Anthropic is successfully coding Claude using Claude.
You can surf the wave, but sooner or later, the wave will come crashing down.
Chess is relatively simple in comparison, as complex as it is.
There is definitely a gap in academic tooling, where an "association engine" would be very useful for a variety of fields (and for encouraging cross-pollination of ideas between fields), but I don't think LLMs are anywhere near the frontier of what can be accomplished with a given amount of computing power. I would expect simpler algorithms operating over more explicit ontologies to be much more useful. (The main issue is that people haven't made those yet, whereas people have made LLMs.) That said, there's still a lot of credit due to the unreasonable effectiveness of literature searches: it only usually takes me 10 minutes a day for a couple of days to find the appropriate jargon, at which point I gain access to more papers than I know what to do with. LLM sessions that substitute for literature review tend to take more than 20 minutes: the main advantage is that people actually engage with (addictive, gambling-like) LLMs in a way that they don't with (boring, database-like) literature searches.
I think developing the habit of "I'm at a loose end, so I'll idly type queries into my literature search engine" would produce much better outcomes than developing the habit of "I'm at a loose end, so I'll idly type queries into ChatGPT", and that's despite the state-of-the-art of literature search engines being extremely naïve, compared to what we can accomplish with modern technology.
I also agree that neural net LLMs are not the inevitable way to implement AI. I'm most intrigued by the theoretical underpinnings of mathematical proof assistants such as Lean 4. Computer scientists understand the word problem for strings as undecidable. The word problem for typed trees with an intrinsic notion of induction is harder, but constructing proofs is finding paths in this tree space. Just as mechanical computers failed in base ten while at the same time Boole had already developed base two logic, I see these efforts merging. Neural nets struggle to simulate recursion; for proof assistants recursion is baked in. Stare at these tree paths and one sees thought at the atomic level, begging to be incorporated into AI. For now the river runs the other way, using AI to find proofs. That river will reverse flow.
Instead, here you get questions that extremely famous mathematicians (Hairer, Spielman) are telling you (a) are solvable in <5 pages (b) do not have known solutions in the literature. This means that solutions from AI to these problems would perhaps give a clearer signal on what AI is doing, when it works on research math.
Claude is one of the buggiest pieces of shit I have ever used. They had to BUY the creators of bun to fix the damn thing. It is not a good example of your thesis.
Typing out solutions to problems was only part of the job description because there was no other way to code. Now we have a far better way.
Is that why everyone keeps complaining about the quality getting worse?
Can you share more about your architecture & process? Also a researcher involved in math research (though not strictly speaking a mathematician, but I digress). I've often thought about using AI on my notes, but they are messy and even then I can't quite figure out what to ask: prioritization, connecting ideas, lit search, etc.
I'd love to hear what you do.
> the answers are known to the authors of the questions but will remain encrypted for a short time.
Ok. But humans may be able to solve the problems too. What prevents Anthropic or OpenAI from hiring mathematicians, have them write the proof and pass it off as LLM written? I'm not saying that's what they'll do. But shouldn't the paper say something about how they're going to validate that this doesn't happen?
Honest question here. Not trying to start a flame here. Honestly confused how this is going to test what it wants to test. Or maybe I'm just plain confused. Someone help me understand this?
Even if it's not Anthropic or OpenAI paying for the solutions, maybe it'll be someone solving them "for fun" because the paper got popular and posting them online.
It's a futile exercise.
Yep. "possible but unlikely" was my take too. As another person commented, this isn't really a benchmark, and as long as that's clear, it seems fair. My only fear is that some submissions may be AI-assisted rather than fully AI-generated, with crucial insights coming from experienced mathematicians. That's still a real achievement even if it's human + AI collaboration. But I fear that the nuance would be lost on news media and they'll publish news about the dawn of fully autonomous math reasoning.
Is this known?
[1] https://github.com/MizarProject/system [2] https://github.com/digama0/mizar-rs [3] https://arxiv.org/pdf/2304.08391v2 [4] https://link.springer.com/article/10.1007/s10817-018-9479-z
This is what is special about them:
> a set of ten math questions which have arisen naturally in the research process of the authors. The questions had not been shared publicly until now;
I.e. these are problems of some practical interest, not just performative/competitive maths.
And this is what is know about the solutions:
> the answers are known to the authors of the questions but will remain encrypted for a short time.
I.e. a solution is known, but is guaranteed to not be in the training set for any AI.
Not a mathematician and obviously you guys understand this better than I do. One thing I can't understand is how they're going to judge if a solution was AI written or human written. I mean, a human could also potentially solve the problem and pass it off as AI? You might say why would a human want to do that? Normal mathematicians might not want to do that. But mathematicians hired by Anthropic or OpenAI might want to do that to pass it off as AI achievements?
FrontierMath did this a year ago. Where is the novelty here?
> a solution is known, but is guaranteed to not be in the training set for any AI.
Wrong, as the questions were poses to commercial AI models and they can solve them.
This paper violates basic benchmarking principles.
But as a non-mathematician I'm not following any of it. How many people are there who are willing to check the generative results? And how much effort is it for a human to check these? How quickly can you even identify math-slop?
Here's the generated proof:
https://github.com/w-m/firstproof_problem_10/blob/2acd1cea85...
I asked Opus 4.6 to look at all the problems and guess which it might be able to solve. It was, coincidentally, most keen on problem 10.
I asked it to try. (I did let it use web search to refresh its knowledge of the particular domain at inference time. Pretty sure that's not unfair compared to how a human expert acts.)
It expressed confidence it had solved it OK after a few minutes thought.
The solution was way beyond my pay-grade.
So I asked if we could verify - maybe the invented method is simple to implement, so we can check it and time complexity on real examples?
It went off and did that.
""" Net assessment: I'd now raise Problem 10 confidence from 85% to 90%.
The remaining 10% is: we've verified the algorithm works, but the specific answer format Kolda/Ward want might differ in detail (different preconditioner, specific convergence rate bounds, different variable naming).
The mathematical substance is solid.
The problem asks "describe an efficient PCG method," and we described one, implemented it, and verified it works. """
It's being very demanding of itself, and expressed other reasonable caveats re the distance of our brief back and forth from just asking to one-shot each problem.
""" The 8 problems I declined would have produced nonsense. Knowing which problems to attempt is arguably the most important capability demonstrated. """
(It reckoned problem 6 was worth attempting too, we didn't try it.)
Full conversation with the reasoning then generated solution and verification code:
https://claude.ai/public/artifacts/c3401a11-b5a8-4dc6-a72a-9...
Not going to spend too many more tokens on this.
As it should. Good.
This is a totally independent test not conducted or collaborated by any of the AI companies or employees so that no bias is introduced at all[0].
[0] Unless the researchers are not disclosing if they have any ownership of shares in private AI companies.
The field of AI4Math has so many benchmarks that are well executed -- based of the related work section it seems the authors are bit familiar with AI4Math at all.
My belief is that this paper is even being discussed solely because a Fields Medalist, Martin Hairer, is on it.
The authors themselves literally state: "Unlike other proposed math research benchmarks (see Section 3), our question list should not be considered a benchmark in its current form"
Sounds to me to be a benchmark in all but a name. And they failed pretty terribly at achieving what they set out to do.