I’m collecting data to benchmark different models as both players and judges (OpenAI / Anthropic / Gemini / Mistral / DeepSeek), but I only have ~45 games so far and need way more before publishing comparisons. (5 AI players and 4 judges at random gives 20 different game setups to evaluate)
It's fully free (I pay for all the tokens), not even a signup required for the first game: https://turingduel.com
Questions + criticism welcome! I will share aggregated results once there’s enough signal.