undefined | Better HN

0 pointsjbay8082y ago0 comments

> The (sequential) distribution of digits amongst sorted numbers is tiny

This is why 10^80 random lists gets reduced to only 10^36 sorted lists. However, 10^36 is still very large with respect to the size of the model.

0 comments

mjburgess2y ago

You're treating each list as unique, all the lists have a distribution of digits in common... I'm at a loss to even understand what you're saying here really -- this is why you need to actually state, formally, what you think the "LLMs are just stats" hypothesis amounts to.

It seems you think it amounts to saying LLMs sample from a combinatorial space, naively construed -- but that isnt the claim?

The claim is rather, they sample from a statistical distribution of tokens.

Take each position in the input vector, 1...127. It needs to "learn":

P(x0 position | y, x1...x127 positions), P(1|y, 2...127), P(2|y, 3...127), etc.

Which is a family of 127 conditional distributions that seem trivial to learn.

I really don't know why you think the size of a combinatorial space is relevant here?

All the sorted lists share basically the same tiny family of conditional distributions { P(x_i | x_(i-1)...x_127) }

1 more reply

j / k navigate · click thread line to collapse