I wasn't trying to invent anything. Just describing what you would obviously have to do if you were to take a "scientific" or "objective" approach: Sound experiments, reproducible, free of financial incentives.
As far as I can tell, no one is doing that at a significant scale. Everything is buried in hype and marketing.
Now for that broad set of benchmarks (PRs to GitHub, Putnam, Zelda). There is something to that, but it depends on the model. A lot of what is out there are “mixtures of experts" either by implicit or explicit design. So there is a mechanism that looks at the problem and then picks the subsystem to delegate it to. Is it a game of chess - boot up the chess program? Is it poetry? Boot up the poetry generator.
That sort of thing is not showing broad intelligence anymore than a person both knowing a chess player and a poet is having broad intelligence.
Deepseek is, as far as I can tell, the leading open-source model; and in some way, that makes it the leading model. I don't think you can fairly compare a model that you can run locally with something that is running behind a server-side API - because who knows what is really going on behind the API.
Deepseek being Chinese makes it political and even harder to have a sane conversation about; but I am sure that had it been China that did mostly closed models and the US that did open ones; we would hold that against them, big time.