The question is not whether it can work right now, but whether it is possible in the future (i.e. whether it's possible in principle).
I think the concern about out-of-distribution is overstated. If we train it on predicting machine learning papers, writing machine learning papers is not out-of-distribution.
You might say "but writing NOVEL papers" would be OOD; but there's no sharp boundary between old and new. Model's behavior is usually smooth, so it's not like it will output random bs if you try to predict 2025 papers. And predicting 2025 papers in 2024 all we need to do "recursive self-improvement". (There are also many ways to shift distribution towards where you want it to be, e.g. aesthetics tuning, guidance in diffusion models, etc. Midjourney does not faithfully replicate distribution in the input training set, it's specifically tuned to create more pleasing outputs. So I don't see "oh but we don't have 2025 papers in the training set yet!" being an insurmountable problem.)
But more generally, seeing models as interpolators is useful only to some extent. We use statistical language when training the models, but that doesn't mean that all output should be interpreted as statistics. E.g. suppose I trained a model which generates a plausible proofs. I can combine it with proof-checker (which is much easier than generating a proof), and wrap it into a single function `generate_proof` which is guaranteed to generate a correct proof (it will loop until a plausible proof checks out). Now the statistics do not matter much. It's just a function.
If there's such a thing as a general reasoning step, then all we need is a function which perform that. Then we just add an outer loop to explore a tree of possibilities using these steps. And further improvements might be in making these steps faster and better.
Does reasoning generalize? I'd say everything points to "yes". Math is used in variety of fields. We are yet to find something where math doesn't work. If you get somebody educated in mathematical modeling and give them a new field to model, they won't complain about math being out-of-distribution.
If you look at LLMs today, they struggle with outputting JSON. It's clearly not an out-of-distribution problem, it's a problem with training - the dataset was too noisy, it had too many examples where somebody requests a JSON but gets a JSON-wrapped-in-Markdown. It's just an annoying data cleanup problem, nothing fundamental. I think it's reasonable to assume that within 5 years OpenAI, Google, etc, will manage to clean up their datasets and train more capable, reliable models which demonstrate good reasoning capabilities.
FWIW I believe that if we hit a wall on a road towards AGI that might actually be good to buy more time to research what we actually want out of AGI. But I doubt that any wall will last more than 5 years, as it already seems almost within the reach...