> I assumed all of the models were doing that, using at least Web Search tools.
Sometimes. The other week I was asking ChatGPT about the UK PM, and had to stop the generation early because it started ~"Prime Minister Rishi Sunak…"
The unreliability is also why techniques as simple as "ask 5 times and have it take a vote of its own answers" boost performance. Or "thinking" modes which are approximately just replacing the end token with "Wait." and continuing for ten rounds.