You're treating each list as unique, all the lists have a distribution of digits in common... I'm at a loss to even understand what you're saying here really -- this is why you need to actually state, formally, what you think the "LLMs are just stats" hypothesis amounts to.
It seems you think it amounts to saying LLMs sample from a combinatorial space, naively construed -- but that isnt the claim?
The claim is rather, they sample from a statistical distribution of tokens.
Take each position in the input vector, 1...127. It needs to "learn":
P(x0 position | y, x1...x127 positions), P(1|y, 2...127), P(2|y, 3...127), etc.
Which is a family of 127 conditional distributions that seem trivial to learn.
I really don't know why you think the size of a combinatorial space is relevant here?
All the sorted lists share basically the same tiny family of conditional distributions { P(x_i | x_(i-1)...x_127) }