> But those benchmarks are in general fairly narrow. They don't really measure the "broader" intelligence we are after.
I think a general model that can
- finish nethack, doom, zelda and civilization,
- solve the hardest codeforces/atcoder problems,
- formally prove putnam solution with high probability, not given the answer
- write a PR to close a random issue on github
is likely to have some broader intelligence. I may be mistaken, since there were tasks in the past that appeared to be unsolvable without human-level intelligence, but in fact weren't.
I agree that such benchmarks are limited to either environment with well-defined feedback and rules (games) or easily verifiable ones (code/math), but I wouldn't say it's super narrow, and there are no non-LLM models to perform significantly better on these (except some games); though specialized LLMs work better. Finding other examples, I think, is one of the important problems in AI metrology.
> So, yes, we are kind of left with vibe checks, but in theory, we could do more; take a bunch of models, double-blind, and have a big enough, representative group of human evaluators score them against each other on meaningful subjects.
You've invented an arena (who just raised quite a lot of money). Can argue about "representative," of course. However, I think the SNR in the arena is not too high now; it turns out that the average arena user is quite biased, the most of their queries are trivial for LLMs, and for non-trivial ones, they cannot necessarily figure out which answer is better. MathArena goes in opposite directions: narrow domain, but expert evaluation. You could imagine a bunch of small arenas, each with its own domain experts. I think it may happen eventually if money flow into AI continues.