undefined | Better HN

0 pointssosodev6mo ago0 comments

How can you be sure that your benchmark is meaningful and well designed?

Is the only thing that prevents a benchmark from being meaningful publicity?

0 comments

I didn't tell you what you should think about the model. All I said is that you should have your own benchmark.

I think my benchmark is well designed. It's well designed because it's a generalization of a problem I've consistently had with LLMs on my code. Insofar that it encapsulates my coding preferences and communication style, that's the proper benchmark for me.

gregsadetsky6mo ago

I asked a semi related question in a different thread [0] -- is the basic idea behind your benchmark that you specifically keep it secret to use it as an "actually real" test that was definitely withheld from training new LLMs?

I've been thinking about making/publishing a new eval - if it's not public, presumably LLMs would never get better at them. But is your fear that generally speaking, LLMs tend to (I don't want to say cheat but) overfit on known problems, but then do (generally speaking) poorly on anything they haven't seen?

Thanks

[0] https://news.ycombinator.com/item?id=45968665

adastra226mo ago

> if it's not public, presumably LLMs would never get better at them.

Why? This is not obvious to me at all.

gregsadetsky6mo ago

You're correct of course - LLMs may get better at any task of course, but I meant that publishing the evals might (optimistically speaking) help LLMs get better at the task. If the eval was actually picked up / used in the training loop, of course.

1 more reply

j / k navigate · click thread line to collapse

0 comments

prodigycorp6mo ago

I didn't tell you what you should think about the model. All I said is that you should have your own benchmark.

gregsadetsky6mo ago

Thanks

[0] https://news.ycombinator.com/item?id=45968665

adastra226mo ago

> if it's not public, presumably LLMs would never get better at them.

Why? This is not obvious to me at all.

gregsadetsky6mo ago

1 more reply

j / k navigate · click thread line to collapse