Every result is explainable by has having come from training data. That's the null hypothesis.
The alternative hypothesis is that it's not explainable as having come from training data. That's a hard-to-believe, hard-to-prove negative.
You don't get anything out of any computational process that you didn't put in.
Similarly, LLMs do not invent a new way of reasoning about problems or language. They do, however, apply these to unseen problems.
LLMs are one level of abstraction up, but it's a very interesting level of abstraction.
Are you implying models that classify hand-written digits don’t generalize and only work on training data?
I'm saying that this is a strawman version of "not in the training data". The newly handwritten digit is squarely the same sort of stuff that is in the training data: an interpolation.
We are not surprised when we fit a curve to a bunch of points and then find points on the curve that are not exactly any of those points, but are located among the points.
Go too far outside of the cluster of points though and the curve is a hallucination.
This is the intuition behind interpolate vs extrapolate.
Dumb question but anything like this that’s written about on the internet will ultimately end up as training fodder, no?
https://arstechnica.com/ai/2025/07/google-deepmind-earns-gol...
We have no idea what the training data is though, so you can't say that.
> and despite their shortcomings they have become extremely useful for a wide variety of tasks.
That seems like a separate question.
O3 pro (but not O3) was successfully able to apply reasoning and math to this domain in interesting ways, much like an expert researcher in these areas would.
Again, the field and the problem is with 100% certainty OOD of the data.
However, the techniques and reasoning methods are of course learned from data. But that's the point, right?
I don't even know that this is possible without seeing the training data. Hence the difficulty in describing how good at "reasoning" O3 Pro is.
The most novel problem would presumably be something only a martian could understand, written in an alien language, the least novel problem would be a basic question taught in preschool like what color is the sky.
Your research falls somewhere between those extremes.