Yes that's what I'm saying, to reiterate: The likeliest token does not lead to the highest performing result. Otherwise temperature wouldn't even be an option. I would imagine things like language word frequency affect the token rating a lot while having nothing to do with the task at hand except providing a correctly formatted answer, but it's probably not the whole story.
OpenAI (and others that know what they're doing) always do their benchmarks in a multi-sampled way, by running 5 or 20 times at optimal temp. Using a wrapper that runs these samples and then another pass that judges self-consistency for a final answer can give you a correct answer 100% of the time for a question that would be wrong 100% of the time with temp at zero.