You’re just twisting what “best” means to suit your bias.
That is not a measure of how sophisticated and capable a model is.
GPT4 is a more sophisticated, more capable mode than mistral.
If that doesn’t make it the “better” for you, that’s fine; but any attempt to argue about the capabilities of the models is misguided.
Restrictions placed on a model are an orthogonal concern to its capabilities.
…but sure, you can invent some benchmarks to score models on other criteria, which is entirely valid.
It’s perfectly fair to say that GPT4 doesn’t top all possible metrics… only the meaningful ones about model capabilities.