Science starts with a guess and you run experiments to test.
I honestly wish this paper actually showed what it claims, since it is a significant open problem to understand CoT reasoning relative to the underlying training set.
What would be your argument against
1. COT models performing way better in benchmarks than normal models
2. people choose to use the COT models in day to day life because they actually find that it gives better performance
I know what you're saying here, and I know it is primarily a critique of my phrasing, but establishing something like this is the objective of in-context learning theory and mathematical applications of deep learning. It is possible to prove that sufficiently well-trained models will generalize for certain unseen classes of patterns, e.g. transformer acting like gradient descent. There is still a long way to go in the theory---it is difficult research!
> performance collapses under modest distribution shift
The problem is that the notion of "modest" depends on the scale here. With enough varied data and/or enough parameters, what was once out-of-distribution can become in-distribution. The paper is purposely ignorant of this fact. Yes, the claims hold for tiny models, but I don't think anyone ever doubted this.