undefined | Better HN

0 pointsEvgeniyZh10mo ago0 comments

There are quite a few benchmarks for which that's not the case:

- live benchmarks (livebench, livecodebench, matharena, SWE-rebench, etc)

- benchmarks that do not have a fixed structure, like games or human feedback benches (balrog, videogamebench, arena)

- (to some extent) benchmark without existing/published answers (putnambench, frontiermath). You could argue that someone could hire people to solve those or pay off benchmark dev, but it's much more complicated.

Most of the benchmarks that don't try to tackle future contamination are much less useful, that's true. Unfortunately, HLE kind of ignored it (they plan to add a hidden set to test for contamination, but once the answers are there, it's a lost game IMHO); I really liked the concept.

Edit: it is true that these benchmarks are focusing only on a fairly specific subset of the model capabilities. For everything else vibe check is your best bet.

0 comments

chvid10mo ago

I agree with you.

Of course, some benchmarks are still valid and will remain valid. Ie. we can make the models play chess against each other and score them on how well they do. But those benchmarks are in general fairly narrow. They don't really measure the "broader" intelligence we are after. And often, LLMs perform worse than specialized models. Ie. I don't think there is any LLM out there that can beat a traditional chess program (surely not using the same computing power).

What is really bad are the QA benchmarks which leak over time into the training data of the models. And sometimes, one can suspect even big labs have an economic incentive in scoring well on popular benchmarks which cause them to manipulate the models way beyond what is reasonable.

And taking a bunch of flawed benchmarks and combining them in indexes, saying this model is 2% better than that model is just completely meaningless but of course fun and draws a lot of attention.

So, yes, we are kind of left with vibe checks, but in theory, we could do more; take a bunch of models, double-blind, and have a big enough, representative group of human evaluators score them against each other on meaningful subjects.

Of course, done right, that would be really expensive. And those sponsoring might not like the result.

EvgeniyZhOP10mo ago

> But those benchmarks are in general fairly narrow. They don't really measure the "broader" intelligence we are after.

I think a general model that can

- finish nethack, doom, zelda and civilization,

- solve the hardest codeforces/atcoder problems,

- formally prove putnam solution with high probability, not given the answer

- write a PR to close a random issue on github

is likely to have some broader intelligence. I may be mistaken, since there were tasks in the past that appeared to be unsolvable without human-level intelligence, but in fact weren't.

I agree that such benchmarks are limited to either environment with well-defined feedback and rules (games) or easily verifiable ones (code/math), but I wouldn't say it's super narrow, and there are no non-LLM models to perform significantly better on these (except some games); though specialized LLMs work better. Finding other examples, I think, is one of the important problems in AI metrology.

> So, yes, we are kind of left with vibe checks, but in theory, we could do more; take a bunch of models, double-blind, and have a big enough, representative group of human evaluators score them against each other on meaningful subjects.

You've invented an arena (who just raised quite a lot of money). Can argue about "representative," of course. However, I think the SNR in the arena is not too high now; it turns out that the average arena user is quite biased, the most of their queries are trivial for LLMs, and for non-trivial ones, they cannot necessarily figure out which answer is better. MathArena goes in opposite directions: narrow domain, but expert evaluation. You could imagine a bunch of small arenas, each with its own domain experts. I think it may happen eventually if money flow into AI continues.

chvid10mo ago

A couple of things:

I wasn't trying to invent anything. Just describing what you would obviously have to do if you were to take a "scientific" or "objective" approach: Sound experiments, reproducible, free of financial incentives.

As far as I can tell, no one is doing that at a significant scale. Everything is buried in hype and marketing.

Now for that broad set of benchmarks (PRs to GitHub, Putnam, Zelda). There is something to that, but it depends on the model. A lot of what is out there are “mixtures of experts" either by implicit or explicit design. So there is a mechanism that looks at the problem and then picks the subsystem to delegate it to. Is it a game of chess - boot up the chess program? Is it poetry? Boot up the poetry generator.

That sort of thing is not showing broad intelligence anymore than a person both knowing a chess player and a poet is having broad intelligence.

Deepseek is, as far as I can tell, the leading open-source model; and in some way, that makes it the leading model. I don't think you can fairly compare a model that you can run locally with something that is running behind a server-side API - because who knows what is really going on behind the API.

Deepseek being Chinese makes it political and even harder to have a sane conversation about; but I am sure that had it been China that did mostly closed models and the US that did open ones; we would hold that against them, big time.

2 more replies

j / k navigate · click thread line to collapse

0 comments

chvid10mo ago

I agree with you.

And taking a bunch of flawed benchmarks and combining them in indexes, saying this model is 2% better than that model is just completely meaningless but of course fun and draws a lot of attention.

Of course, done right, that would be really expensive. And those sponsoring might not like the result.

EvgeniyZhOP10mo ago

> But those benchmarks are in general fairly narrow. They don't really measure the "broader" intelligence we are after.

I think a general model that can

- finish nethack, doom, zelda and civilization,

- solve the hardest codeforces/atcoder problems,

- formally prove putnam solution with high probability, not given the answer

- write a PR to close a random issue on github

is likely to have some broader intelligence. I may be mistaken, since there were tasks in the past that appeared to be unsolvable without human-level intelligence, but in fact weren't.

chvid10mo ago

A couple of things:

As far as I can tell, no one is doing that at a significant scale. Everything is buried in hype and marketing.

That sort of thing is not showing broad intelligence anymore than a person both knowing a chess player and a poet is having broad intelligence.

2 more replies

j / k navigate · click thread line to collapse