I believe that the best way to understand these large language models is that they have models of
patterns of text. To the extent that patterns of text are congruent with patterns in the world, they appear to function well, but I think, in the end, they are statistical models of text, not of the world, and that substantially limits their capabilities.
I do think multi-modal models will be interesting, but text is a very special sort of thing. It is widely available, semantically rich, and informationally pretty dense. I'm not sure there is such a nice set of properties for other modes. Consider that we have already almost reached training data exhaustion with text and it is, by far, the most voluminous/dense training mode there is.