So much rides on your implicit notion of semantic relationship, but this dependence needs demonstration. The fact that some pattern of signals on my perceptual apparatus is caused by an apple in the real world does not mean that I have knowledge or understanding of an apple
in virtue of this causal relation. That my sensory signals are caused by apples is an accident of this world, one we are completely blind to. If all apples in the world were swapped with fapples (fake apples), where all sensory experiences that have up to now been caused by apples are now caused by fapples, we would be none the wiser. The semantics (i.e. wide content) of our perceptual experiences is irrelevant to literally everything we know and how we interact with the world. Our knowledge of the world is limited to our sensory experiences and our deductions, inferences, etc derived from our sensory experiences. Our situatedness in the world is only relevant insofar as it entails the space of our sensory experiences.
>the model would need the ability to frame hypotheses about what this sensor data means, and test them by interacting with the world and seeing what the results were.
Why do we need to actively test our model to come to an understanding of the world? Yes, that is how we biological organisms happen to learn about the world. But it is not clear that it is required. Language models learn by developing internal models that predict the next token. But this prediction implicitly generates representations of the processes that generate the tokens. There is no in principle limit to the resolution of this model given a sufficiently large and diverse training set.