That's when A.I. starts advancing itself and needs humans in the loop no more.
Like the inverse kinematics required for your arm and fingers to move.
The code is also a useful artifact that can be iteratively edited and improved by both the human and LLM, with git history, etc. Running and passing tests/assertions helps to build and maintain confidence that the math remains correct.
I use helper functions to easily render from the sympy code to latex, etc.
A lot of the math behind this quantum eraser experiment was done this way.
https://github.com/paul-gauthier/entangled-pair-quantum-eras...
Probably the main deficiencies are confusion as the context grows (therefore confusion as task complexity grows).
[1]: https://quantumprolog.sgml.net/llm-demo/part1.html
[2]: https://microsoft.github.io/z3guide/docs/fixedpoints/syntax
[1] https://arxiv.org/abs/2505.20047 [2] https://github.com/antlr/grammars-v4/blob/master/datalog/dat...
Calculators are good AI, they rarely lie (due to floating arithmetics rounding). And yes, Wikipedia says calculators are AI tech, since a Computer was once a person, and not it is a tool that shows the intelligent trait of doing math with numbers or even functions/variables/equations.
Querying a calculator or wolfram alpha like symbolic AI system with LLMs seems like the only use for LLMs except for text refactoring that should be feasible.
Thinking LLMs know anything on their own is a huge fallacy.
My team has been prototyping something very similar with encoding business operations policies with LEAN. We have some internal knowledge bases (google docs / wiki pages) that we first convert to LEAN using LLMs.
Then we run the solver to verify consistency.
When a wiki page is changed, the process is run again and it's essentially a linter for process.
Can't say it moved beyond the prototyping stage though, since the LEAN conversion does require some engineers to look through it at least.
But a promising approach indeed, especially when you have a domain that requires tight legal / financial compliance.
If you ever feel like chatting and discussing more details, happy to chat!
I also used agents to synthesize, formalize, and criticize domain knowledge. Obviously, it is not a silver bullet, but it does ensure some degree of correctness.
I think introducing some degree of symbolism and agents-as-a-judge is a promising way ahead, see e.g.: https://arxiv.org/abs/2410.10934
Although that work is not public, you can play with the generally available product here!
[1] https://aws.amazon.com/blogs/aws/minimize-ai-hallucinations-...
Some LLMs are more consistent between text and SMT, while others are not. (Tab 1, Fig 14,15)
You can do uncertainty quantification with selective verification to reduce the "risk", for e.g. shown as the Area Under the Risk Coverage Curve in Tab 4.
And let me be clear that this is a major limitation that fundamentally breaks whatever you are trying to achieve. You start with some LLM-generated text that is, by construction, unrelated to any notion of truth or factuality, and you push it through a verifier. Now you are verifying hot air.
It's like research into the efficacy of homeopathic medicine and there's a lot of that indeed, very carefully performed and with great attention to detail. Except all of that research is trying to prove whether doing nothing at all (i.e. homeopathy) has some kind of measurable effect or not. Obviously the answer is not. So what can change that? Only making homeopathy do something instead of nothing. But that's impossible, because homeopathy is, by construction, doing nothing.
It's the same thing with LLMs. Unless you find a way to make an LLM that can generate text that is conditioned on some measure of factuality, then you can verify the output all you like, the whole thing will remain meaningless.
https://www.reddit.com/r/healthIT/comments/1n81e8g/comment/n...
Interesting that the final answer is provably entailed (or you get a counterexample), instead of being merely persuasive chain-of-thought.
Unless I’m wrong, this is mainly an API for trying to get an LLM to generate a Z3 program which “logically” represents a real query, including known facts, inference rules, and goals. The “oversight” this introduces is in the ability to literally read the logical statement being evaluated to an answer, and running the solver to see if it holds or not.
The natural source of doubt is: who’s going to read a bunch of SMT rules manually and be able to accurately double-check them against real-world understanding? Who double checks the constants? What stops the LLM from accidentally (or deliberately, for achieving the goal) adding facts or rules that are unsound (both logically and from a real-world perspective)?
The paper reports a *51%* false positive rate on a logic benchmark! That’s shockingly high, and suggests the LLM is either bad at logical models or keeps creating unsoundnesses. Sadly, the evaluation is a bit thin on the ground about how this stacks up, and what causes it to fall short.
E.g. https://arxiv.org/pdf/2505.20047 Tab 1, we compare the performance on text-only vs SMT-only. o3-mini does pretty well at mirroring its text reasoning in its SMT, vs Gemini Flash 2.0.
Illustration of this can be seen in Fig 14, 15 on Page 29.
In commercially available products like AWS Automated Reasoning Checks, you build a model from your domain (e.g. from a PDF policy document), cross verify it for correctness, and during answer generation, you only cross check whether your Q/A pairs from the LLM comply with the policy using a solver with guarantees.
This means that they can give you a 99%+ soundness guarantee, which basically means that if the service says the Q/A pair is valid or guaranteed w.r.t the policy, it is right more than 99% of the time.
https://aws.amazon.com/blogs/aws/minimize-ai-hallucinations-...
"Alice has 60 brothers and she also has 212 sisters. How many sisters does Alice's brother have?"
But the generated program is not very useful:
{ "sorts": [], "functions": [], "constants": {}, "variables": [ {"name": "num_brothers_of_alice", "sort": "IntSort"}, {"name": "num_sisters_of_alice", "sort": "IntSort"}, {"name": "sisters_of_alice_brother", "sort": "IntSort"} ], "knowledge_base": [ "num_brothers_of_alice == 60", "num_sisters_of_alice == 212", "sisters_of_alice_brother == num_sisters_of_alice + 1" ], "rules": [], "verifications": [ { "name": "Alice\'s brother has 213 sisters", "constraint": "sisters_of_alice_brother == 213" } ], "actions": ["verify_conditions"] }
When the grammar of the language is better defined, like SMT (https://arxiv.org/abs/2505.20047) - we are able to do this with open source LLMs.
Imagine somebody in 2007: "It's so funny to me that people are still adamant about mortgage default risk after it's become a completely moot point because nobody cares in this housing market."
But this is an incredibly interesting problem!
You'll also see why their applications are limited compared to what you probably hoped for.
Imagine having a bunch of 2D matrices with a combined 1.8 trillion total numbers, from which you pick out a blocks of numbers in a loop and finally merge them and combine them to form a token.
Good luck figuring out what number represents what.
My paper and pen version of the latest LLM (quite a large bit of paper and certainly a lot of ink I might add) also does not think.
I am surprised so many in the HN community have so quickly taken to assuming as fact that LLM's think or reason. Even anthropomorphising LLM's to this end.
For a group inclined to quickly calling out 'God of the gaps' they have quite quickly invented their very own 'emergence'.
It's just shorthand for "that's an extraordinary claim and nobody has provided any remotely extraordinary evidence to support it."