It has "Group 3" correct. It should be marked as having 1/4 groups correct.
Same thing happened on #535, Gemini actually got "Group 1" correct but was marked 0/4 correct.
I believe Claude or even Gemini can succeed if system prompt is improved e.g. tell it to re-evaluate it's answer before finalising, can even tell it to do "thinking" within <thinking> tags. I use claude like that and it often goes over it's answer and corrects itself within same reply. On the other hand it can also incorrectly assume it made a mistake and can sometimes uncorrect itself.
Edit: Using o1's step by step problem solving example from OpenAI blog post made Claude go step by step in similar depth too. Could even do that here to get better success rate in non-o1 models.
https://chatgpt.com/share/67570ab1-b2c0-8006-b5d2-d3fa7132de...
Going to try to feed this into some other LLMs like qwq and see if they can solve them.
This seems highly subjective. We should not care about this. The game is to connect the words, not find the connection. For human players, it doesn't matter if you get the connection or not.
Cool benchmark nonetheless!