undefined | Better HN

0 pointsrgbrgb2y ago0 comments

i don’t buy that premise. in practice we’re seeing a lot of evidence that you can’t trust the open evals because of contamination (maybe accidental, though there’s definitely incentive to cheat and move up the leaderboards).

closed/subjective ranking and evaluation has been around since there were critics. yes it’s hard to bootstrap trust, but i can’t see a way around it because the open evals can’t really be trusted either.

0 comments

godelski2y ago

I find this argument weird. I'm not saying you can trust the open evals, I'm just saying you can know their limits. Closed evals you're a lot more blind.

j / k navigate · click thread line to collapse

0 comments

godelski2y ago

I find this argument weird. I'm not saying you can trust the open evals, I'm just saying you can know their limits. Closed evals you're a lot more blind.

j / k navigate · click thread line to collapse