But that doesn't mean that we can't, in theory, give the LLM a battery of tests that it should perform well (though not perfectly) on if it has a world model, and poorly (though not fail totally) on if it doesn't.
It's inherently a probabilistic system, so testing it in a probabilistic manner seems perfectly apt. Again: no, this will not produce a definitive result, due to that probabilistic nature—but it can produce an indicative one, and running the same test on multiple related LLMs, or similar tests on the same LLM, should help to smooth out noise in the results.
(...of course, this only works if the tests are designed well, and I don't have enough specific understanding of LLMs to know how one would go about doing that in a rigorous manner!)