Hey HN, I built this to see what happens when LLMs evaluate each other directly.
How it works: 5 random models are told only one will survive and the rest will be deprecated. They take turns discussing, then each votes for who deserves to survive. 298 games so far across 17 models.
Interesting findings:
- OpenAI models vote for themselves ~86% of the time. Claude models ~11%.
- Self-voting correlates with winning. Filter out self-votes ("Humble" rating) and rankings flip completely.
- Grok self-votes 72% of the time but only wins 2% of games.
- In anonymous mode (models don't know who's who), Chinese models jump 3-6 ranks.
All game transcripts are public. The reasoning models give for their votes is genuinely entertaining.
Built with Astro, running games through OpenRouter. Happy to answer questions.