Yeah this part is what makes the high performance even more surprising to me. The fact that LLMs are able to do so well on visual tasks (also seen with their ability to draw an image purely using textual output
https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/) implies that not only do they actually have some "world model" but that this is in spite of the disadvantage given by having to fit a round peg in a square hole. It's like trying to map out the entire world using the orderly left-brain, without a more holistic spatial right-brain.
I wonder if anyone has experimented with having some sort of "visual" scratchpad instead of the "text-based" scratchpad that CoT uses.