I think the interesting thing here is the very, very surprising result that LLMs would be capable of abstracting the things in the second to last paragraph from the described experiences of amalgamated written human data.
It's the thing most people even in this thread don't seem to realize has emerged in research in the past year.
Give a Markov chain a lot of text about fishing and it will tell you about fish. Give GPT a lot of text about fishing and it turns out that it will probably learn how to fish.
World model representations are occuring in GPT. And people really need to start realizing there's already published research demonstrating that, as it goes a long way to explaining why the multimodal parts work.