If the models run on OpenAIs servers then surely they could still see the questions being put into it if they wanted to cheat? That could only be prevented by making the evaluation a one-time deal that can't be repeated, or by having OpenAI distribute their models for evaluators to run themselves, which I doubt they're inclined to do.