To clear up a few misconceptions:
- Aristotle uses modern AI techniques heavily, including language modeling.
- Aristotle can be guided by an informal (English) proof. If the proof is correct, Aristotle has a good chance at translating it into Lean (which is a strong vote of confidence that your English proof is solid). I believe that's what happened here.
- Once a proof is formalized into Lean (assuming you have formalized the statement correctly), there is no doubt that the proof is correct. This is the core of our approach: you can do a lot of (AI-driven) search, and once you find the answer you are certain it's correct no matter how complex the solution is.
Happy to answer any questions!
Given a formal statement of what you want, Lean can validate that the steps in a (tedious) machine-readable purported proof are valid and imply the result from accepted axioms. This is not AI, but a tiny, well reviewed kernel that only accepts correct formal logic arguments.
So, if you have a formal statement that you've verified to represent what you are interested in by some other means, Lean can tell you whether the proof created by genAI is correct. Basically, there is a nigh infallible checker that won't accept incorrect hallucinations.
The natural-language statement of the problem is (from https://www.erdosproblems.com/728):
> Let C>0 and ϵ>0 be sufficiently small. Are there infinitely many integers a,b,n with a≥ϵn and b≥ϵn such that a!b!∣n!(a+b−n)! and a+b>n+Clogn?
The Lean-language statement of the problem (which can be done either by hand or by AI) is (from https://github.com/plby/lean-proofs/blob/f44d8c0e433ab285541...):
∀ᶠ ε : ℝ in [>] 0, ∀ C > (0 : ℝ), ∀ C' > C,
∃ a b n : ℕ,
0 < n ∧
ε * n < a ∧
ε * n < b ∧
a ! * b ! ∣ n ! * (a + b - n)! ∧
a + b > n + C * log n ∧
a + b < n + C' * log n
Yes on the one hand, one needs to know enough about Lean to be sure that this formulation matches what we intend, and isn't stating something trivial. But on the other hand, this is not as hard as finding an error on some obscure line of a long proof.(There's also an older formulation at https://github.com/google-deepmind/formal-conjectures/blob/f... but the new one is more in the spirit of what was intended: see the discussion starting at https://www.erdosproblems.com/forum/thread/728#post-2196 which gives a clear picture, as of course does Tao's thread in the OP that summarizes this discussion.)
However, there are some good heuristics. If you expect a problem to be hard and the proof is very short, you've probably missed something!
This is also a big risk when trying to prove code correctness: "prove this algo works" means you gotta define "works" along certain axes, and if you're very unlucky you might have a proof that exploits the uncertainty around a certain axis.
If you and the AI agree on the translation of the problem, and lean agrees with the solution, then you're done.
Sometimes when I'm using new LLMs I'm not sure if it’s a step forward or just benchmark hacking, but formalized math results always show that the progress is real and huge.
When do you think Harmonic will reach formalizing most (even hard) human written math?
I saw an interview with Christian Szegedy (your competitor I guess) that he believes it will be this year.
In other fields (topology, probability, linear algebra), many key definitions are not in Mathlib yet, so you will struggle to write down the theorem itself. (But in some cases, Aristotle can define the structure you are talking about on the fly!)
This is not an intrinsic limitations of Lean, it's just that nobody has taken the time to formalize much of those fields yet. We hope to dramatically accelerate this process by making it trivial to prove lemmas, which make up much of the work. For now, I still think humans should write the key definitions and statements of "central theorems" in a field, to ensure they are compatible with the rest of the library.
My limited understanding of Rust is that it applies a fixed set of rules to guarantee memory safety. The rules are somewhat simple and limiting, for ease of understanding and implementation, but also because of undecidability.
Programmers run into situations where they know that their code won't cause memory errors, but it doesn't follow the rules. Wouldn't it be cool if something like Aristotle was integrated into the compiler? Any code for which a proof of correctness could be written would pass/compile, without having to add more and more rules
Formal methods are cool because, by contrast to tools like the borrow checker, you can prove some very "nonlocal" properties: this system does not deadlock, or it makes progress at least every N steps, etc.
If the proof is rooted in the programmer's understanding who can give proof hints to the prover then any modification of the program can then be accompanied with a modification of the hints, still allowing automatic proofs. But if the human has no clue then the automatic system can get stuck without the human having a chance to help it along.
Second, when you say language modeling support, it means that can better understand code representation (ASTs) or something else? I am just an AI user, not very knowledgeable in the field. My main interest is if it would be great for static analysis oriented to computer security (SAST).
How does this depend on the area of mathematics of the proof? I was under the impression that it was still difficult to formalize most research areas, even for a human. How close is Aristotle to this frontier?
That's a pretty big assumption, though, isn't it? As we saw the Navier-Stokes psychosis episode over the New Year holiday, formalizing correctly really isn't guaranteed.
It's another reformulation rather than a true proof. Now, instead of wanting a proof of a theorem, now we just need to prove that this proof is actually proving the theorem. The proof itself being so incomprehensible that it can't on its own be trusted, but if it can be shown that it can be trusted then the theorem must be true.
Claude Code can write lean, but we do a heck of a lot of RL on theorem proving, so Aristotle winds up being much better at writing Lean than other coding agents are.
Do you have any links to reading about how often lean core has soundness bugs or mathlib has correctness bugs?
Translate an informal description of the proof into this Lean?
I don’t think it can really be said to have occurred autonomously then?
Looks more like a 50/50 partnership with a super expert human one the one side which makes this way more vague in my opinion - and in line with my own AI tests, ie. they are pretty stupid even OPUS 4.5 or whatever unless you're already an expert and is doing boilerplate.
EDIT: I can see the title has been fixed now from solved to "more or less solved" which is still think is a big stretch.
Seems exactly like the tests at my company where even frontier models are revealed to be very expensive rubber ducks, but completely fails with non experts or anything novel or math heavy.
Ie. they mirror the intellect of the user but give you big dopamine hits that'll lead you astray.
> There seems to be some confusion on this so let me clear this up. No, after the model gave its original response, I then proceeded to ask it if it could solve the problem with C=k/logN arbitrarily large. It then identified for itself what both I and Tao noticed about it throwing away k!, and subsequently repaired its proof. I did not need to provide that observation.
so it was literally "yo, your proof is weak!" - "naah, watch this! [proceeds to give full proof all on its own]"
I'd say that counts
"solved more or less autonomously by AI" were Tao's exact words, so I think we can trust his judgment about how much work he or the AI did, and how this indicates a meaningful increase in capabilities.
Or someone like Terence Tao might figure out how to wield AI better than anyone else, even the labs, and use the tools to take a bunch down at once. I'm excited to see what's coming this year.
For all topics that can be expressed with language, the value of LLMs is shuffling things around to tease out a different perspective from the humans reading the output. This is the only realistic way to understand AI enough to make it practical and see it gain traction.
As much as I respect Tao, I feel like his comments about AI usage can be misleading without carefully reading what he is saying in the linked posts.
A large amount of Tao's work is around using AI to assist in creating Lean proofs.
I'm generally on the more skeptical side of things regarding LLMs and grand visions, but assisting in the creation of Lean proofs is a huge area of opportunity for LLMs and really could change mathematics in fundamental ways.
One naive belief many people have is that proofs should be "intelligible" but it's increasingly clear this is not the case. We have proofs that are gigabytes (I believe even terabytes in some cases) in size, but we know they are correct because they check in Lean.
This particular pattern of using state of the art work in two different areas (LLMs and theorem proving) absolutely has the potentially to fundamentally change how mathematics is done. There's a great picture on pp 381 of Type Theory and Formal Proof where you can easily see how LLMs can be placed in two of the most tricky parts of that diagram to solve.
Because the work is formally verified we can throw out entire classes of LLM problems (like hallucinations).
Personally I think strongly typed language, with powerful type systems are also the long term ideal coding with LLMs (but I'm less optimistic about devs following this path).
To a layman, that doesn't sound like very AI-like? Surely there must be a dozen algorithms to effectively search this space already, given that mathematics is pretty logical?
Isabelle has had the "Sledgehammer" tool for quite awhile [1]. It uses solvers like z3 to search and apply a catalog of proof strategies and then try and construct a proof for your main proof or any remaining subtasks that you have to complete. It's not perfect but it's remarkably useful (even if it does sometimes give you proofs that import like ten different libraries and are hard to read).
I think Coq has Coqhammer but I haven't played with that one yet.
Logic solvers are useful, but not tractable as a general way to approach mathematics.
The fact of how you use the term AI tells me that you are a representative of laymen so what precisely are you trying to define?
It might be helpful to understand the term artificial intelligence first:
What happens if we reason upwards but change some universal constants? What happens if we use Tao instead of Pi everywhere, these kind of fun questions would otherwise require an enormous intellectual effort whereas with the mechanisation and automation of thought, we might be able to run them and see!
Google Scholar was a huge step forward for doing meta-analysis vs a physical library.
But agents scanning the vastness of PDFs to find correlations and insights that are far beyond human context-capacity will I hope find a lot of knowledge that we have technically already collected, but remain ignorant of.
Literally nothing other than mild convenience. It’s just 2pi.
The "intellectual effort" this requires is about 0.
Maybe you meant Euler's number? Since it also relates to PI, it can be used and might actually change the framework in an "interesting way" (making it more awkward in most cases - people picked PI for a reason).
https://ddcolrs.wordpress.com/2018/01/17/maxwells-equations-...
"I doubt that anything resembling genuine "artificial general intelligence" is within reach of current #AI tools. However, I think a weaker, but still quite valuable, type of "artificial general cleverness" is becoming a reality in various ways.
By "general cleverness", I mean the ability to solve broad classes of complex problems via somewhat ad hoc means. These means may be stochastic or the result of brute force computation; they may be ungrounded or fallible; and they may be either uninterpretable, or traceable back to similar tricks found in an AI's training data. So they would not qualify as the result of any true "intelligence". And yet, they can have a non-trivial success rate at achieving an increasingly wide spectrum of tasks, particularly when coupled with stringent verification procedures to filter out incorrect or unpromising approaches, at scales beyond what individual humans could achieve.
This results in the somewhat unintuitive combination of a technology that can be very useful and impressive, while simultaneously being fundamentally unsatisfying and disappointing - somewhat akin to how one's awe at an amazingly clever magic trick can dissipate (or transform to technical respect) once one learns how the trick was performed.
But perhaps this can be resolved by the realization that while cleverness and intelligence are somewhat correlated traits for humans, they are much more decoupled for AI tools (which are often optimized for cleverness), and viewing the current generation of such tools primarily as a stochastic generator of sometimes clever - and often useful - thoughts and outputs may be a more productive perspective when trying to use them to solve difficult problems."
This comment was made on Dec. 15, so I'm not entirely confident he still holds it?
While quickly I noticed that my pre-ChatGPT-3.5 use of the term was satisfied by ChatGPT-3.5, this turned out to be completely useless for 99% of discussions, as everyone turned out to have different boolean cut-offs for not only the generality, but also the artificiality and the intelligence, and also what counts as "intelligence" in the first place.
That everyone can pick a different boolean cut-off for each initial, means they're not really booleans.
Therefore, consider that this can't drive a car, so it's not fully general. And even those AI which can drive a car, can't do so in genuinely all conditions expected of a human, just most of them. Stuff like that.
Are you getting the same value in your work, in your field?
For me, deep research tools have been essential for getting caught up with a quick lit review about research ideas I have now that I'm transitioning fields. They have also been quite helpful with some routine math that I'm not as familiar with but is relatively established (like standard random matrix theory results from ~5 years ago).
It does feel like the spectrum of utility is pretty aligned with what you might expect: routine programming > applied ML research > stats/applied math research > pure math research.
I will say ~1 year ago they were still useless for my math research area, but things have been changing quickly.
So yes, there is value here, and quite a bit but it requires a lot of forethought in how you structure your prompts and you need to be super skeptical about the output as well as able to check that output minutely.
If you would just plug in a bunch of data and formulate a query and would then use the answer in an uncritical way you're setting yourself up for a world of hurt and lost time by the time you realize you've been building your castle on quicksand.
I also found it can be helpful when exploring your mathematical intuitions on something, e.g. like how a dropout layer might effect learned weights and matrix properties, etc. Sometimes it will find some obscure rigorous math that can be very enlightening or relevant to correcting clumsy intuitions.
Overall: useful, but not yet particularly "accelerating" for me.
But maybe that is just me. I have read some of Terence Tao's transcripts, and the questions he asks LLMs are higher complexity than what I ask. Yet, he often gets reasonable answers. I don't yet know how I can get these tools to do better.
The difference between free ChatGPT, GPT-5.2 Thinking, and GPT-5.2 Pro is enormous for areas like logic and math. Often the answer to bad results is just to use a better model.
Additionally, sometimes when I get bad results I just ask the question again with a slightly rephrased prompt. Often this is enough to nudge the models in the right direction (and perhaps get a luckier response in the process). However, if you are just looking at a link to a chat transcript, this may not be clear.
I have wondered if he has access to a better model than I, the way some people get promotional merchandise. A year or two ago he was saying the models were as good as an average math grad student when to me they were like a bad undergrad. In the current models I don't get solutions to new problems. I guess we could do some debugging and try prompting our models with this Erdos problem and see how far we get. (edit: Or maybe not; I guess LLMs search the web now.)
- It took me a minute or two of clicking around to figure out that the (only?) way to use it is to create an API key, then start aristotle in the terminal and interact with it there. It could be more obvious I think.
- Your profile links to http://www.cs.stanford.edu/~tachim/ which doesn't work; should be http://cs.stanford.edu/~tachim/ (without the www) (I think Stanford broke something recently for the former not to work.)
Edit: I just realized from https://news.ycombinator.com/item?id=46296801 that you're the CEO! - in that case maybe you, or whoever you think most appropriate from your organization, could submit it along with a text description of what it is, and what is the easiest and/or most fun way to try it out?
EDIT: Look at all the people below just reacting to the headline and clearly not reading the posts. Aristotle (https://arxiv.org/abs/2510.01346) is key here folks.
EDIT2: It is clear much of the people below don't even understand basic terminology. Something being a transformer doesn't make it an LLM (vision transformers, anyone) and if you aren't training on language (e.g. AlphaFold, or Aristotle on LEAN stuff), it isn't a "language" model.
Do you mean that in this case, it was not a LLM?
I think it's because it comes off as you are saying that we should move off of GenAI, and alot of people use LLM when they mean GenAI.
This is a really hopeful result for GenAI (fitting deep models tuned by gradient descent on large amounts of data), and IMO this is possible because of specific domain knowledge and approaches that aren't there in the usual LLM approaches.
1. "The search algorithm is a highly parallel Monte Carlo Graph Search (MCGS) using a large transformer as its policy and value functon." ... "We use a generative policy to take progressively widened [7] samples from the large action space of Lean tactics, conditioning on the Lean proof state, proof history, and, if available, an informal proof. We use the same model and prompt (up to a task token) to compute the value function which guides the search."
See that 'large transformer' phrase? That's where the LLM is involved.
2. "A lemma-based informal reasoning system which generates informal proofs of mathematical state-ments, breaks these proofs down into lemmas, formalizes each lemma into Lean, and iterates this process based on formal feedback" ... "First, the actions it generates consist of informal comments in addition to Lean tactics. Second, it uses a hidden chain of thought with a dynamically set thinking budget before predicting an action."
Unless you're proposing that this team solved AGI, "chain of thought" is a specific term of art in LLMs.
3. "A geometry solver which solves plane geometry problems outside of Lean using an approach based on AlphaGeometry [45]." ... following the reference: "AlphaGeometry is a neuro-symbolic system that uses a neural language model, trained from scratch on our large-scale synthetic data, to guide a symbolic deduction engine through infinite branching points in challenging problems. "
AlphaGeometry, like all of Deepmind's Alpha tools, is an LLM.
Instead of accusing people of not reading the paper, perhaps you should put some thought into what the things in the paper actually represent.
Please remember that this is a theorem about integers that is subject to a fairly elementary proof that is well-supported by the existing Mathlib infrastructure. It seems that the AI relies on the symbolic proof checker, and the proofs that it is checking don't use very complex definitions in this result. In my experience, proofs like this which are one step removed from existing infra are much much more likely to work.
Again though, this is really insanely cool!!
People are still going to be moving the goalposts on this and claiming it's not all that impressive or that the solution must have been in the training data or something, but at this point that's kind of dubiously close to arguing that Terence Tao doesn't know what he's talking about, which to say the least is a rather perilous position.
At this point, I think I'm making a belated New Years resolution to stop arguing with people who are still staying that LLMs are stochastic parrots that just remix their training data and can never come up with anything novel. I think that discussion is now dead. There are lots of fascinating issues to work out with how we can best apply LLMs to interesting problems (or get them to write good code), but to even start solving those issues you have to at least accept that they are at least somewhat capable of doing novel things.
In 2023 I would have bet hard against us getting to this point ("there's no way chatbots can actually reason their way through novel math!"), but here we are are three years later. I wonder what comes next?
Or are you just saying that solving novel problems involves remixing ideas? Well, that's true for human problem solving too.
Though I suppose if the problem statement in Lean is human-generated and there are no ways to "cheat" in a Lean proof, the proof could be trusted without understanding it
> ... to me, the more interesting capability revealed by these events is the ability to rapidly write and rewrite new versions of a text as needed, even if one was not the original author of the argument.
> This is sharp contrast to existing practice where the effort required to produce even one readable manuscript is quite time-consuming, and subsequent revisions (in response to referee reports, for instance) are largely confined to local changes (e.g., modifying the proof of a single lemma), with large-scale reworking of the paper often avoided due both to the work required and the large possibility of introducing new errors. However, the combination of reasonably competent AI text generation and modification capabilities, paired with the ability of formal proof assistants to verify the informal arguments thus generated, allows for a much more dynamic and high-multiplicity conception of what a writeup of an argument is, with the ability for individual participants to rapidly create tailored expositions of the argument at whatever level of rigor and precision is desired.
Of course this implies that the math works which is the Aristotle part, and that's great ... but this rebuts the "but this isn't AI by itself, this is AI and a bunch of experts working hard, nothing to see here": right, well even "experts working hard" fail to iterate on the paper which significantly hinders research progress.
1. This result is very far from showing something like "human mathematicians are no longer needed to advance mathematics".
2. Even if it did show that, as long as we need humans trained in understanding maths, since "professional mathematicians" are mostly educators, they probably aren't going anywhere.
Educator business survived so far, only because they provided in-person interactive knowledge transfer and credentials - both were not possible by static sources of knowledge such as libraries and internet. But now all that is possible without involvement of human teachers.
Edit: Found it here https://github.com/plby/lean-proofs/blob/main/src/v4.24.0/Er...
The METR institute predicts that the length of tasks AI agents can complete doubles every 7 months.
We should expect it to take until 2033 before AI solves Clay Institute-level problems with 50% reliability.
And once we have that formalized corpus, it's all set up as training data for moving forward.
https://github.com/teorth/erdosproblems/wiki/AI-contribution...
It classifies the advancements based on the level of AI input. In particular, the entry in Table 1 related to the original post has both a green and yellow light, reflecting the skepticism from others.
Meanwhile I can’t get Claude code to fix its own shit to save my life.
If you think about it, it's also what a lot of other intellectual activity looks like, at least in STEM.
Maybe this should give you some hint to that you're trying to use it in a different way than others?