But LLMs achieve both your condition and mine. The attention network makes the causal connections that you speak of, while the multi-layer perceptions store and extract facts that respond to the mix of attention.
It is not commonly described as such, but I think “common sense engine” is a far better description of what a GPT-based LLM is doing than mere next word prediction.