i don’t buy that premise. in practice we’re seeing a lot of evidence that you can’t trust the open evals because of contamination (maybe accidental, though there’s definitely incentive to cheat and move up the leaderboards).
closed/subjective ranking and evaluation has been around since there were critics. yes it’s hard to bootstrap trust, but i can’t see a way around it because the open evals can’t really be trusted either.