I really only think the game is worth playing when it's against a fixed version of a specific model. The amount of variance we observe between different releases of the same model is enough to require us to update our prompts and re-test. I don't envy anyone who has to try and find some median text that performs okay on every model.