"That's cheating. You have custom code in the loop.": but that's what an LLM does; it feeds input tokens and feeds back output tokens through the LLM one by one. So.
Now, as far as a realistic LLM, no there's no way to prove that it will always get even 1+1=2 correct. There's always a chance that something in the context will throw it off. Generally LLMs are better at interpreting questions, finding some code that maps to the answer, executing that code, and spitting out the answer. As a case in point, try asking one to solve a sudoku. It will grab some code off github, run it, and give you the answer. Now ask it to solve it by pure reasoning step-by-step. It'll get hopelessly lost, tell you numbers are in the wrong places, tell you that eliminating 7 from {2,7} leaves only {3,8}, etc. (And then finally give you the correct answer, now _that's_ cheating!)
So, if not LLMs, and not handwritten loops, the only other option is single-shot. Can a NN be trained to do math in a single run? And the answer is not really. At least, not efficiently. If you think about it, a single run through a NN only has a limited number of steps. So it's going to be limited in what it can do. If your computation requires more steps than that, all your NN can do is guess.
So no, there's really no perfect "pure" AI for math. AI tools for math are generally a combination of NNs that make guesses, and hand-written code that checks or uses those guesses to generate some feedback and ask for next steps. Which, isn't too different from how humans do it either. Make a guess, try it out, look up references, look for tools, create a tool or modify an existing one, and so on until you get it right.
The LLM could invoke several command line programs, including calculators or anything else in which a deterministic answer is desirable. Structured outputs for example, people usually mean Json output, but any schema like Xml or Html could be enforced by some command line tools, and when the validation fails, it should double check it's own output and hopefully fix it.
I don't think this follows, since they are trying to replace humans who are also not perfect at arithmatic.