AI Does NYT Connections (opens in new tab)

(mikehearn.notion.site)

32 pointsmikehearn1y ago20 comments

20 comments

Hmm, isn't there like five lives before "ending"? So instead of doing a "perfect run" there should be chances, and feedback, such as "one missing", like a real human player?

tantalor1y ago

Why is Gemini given 4 Xs for #547?

It has "Group 3" correct. It should be marked as having 1/4 groups correct.

Same thing happened on #535, Gemini actually got "Group 1" correct but was marked 0/4 correct.

tantalor1y ago

Looks like this was fixed, thanks!

ravedave51y ago

Argh put a spoiler cover over today's at least!

troelsSteegin1y ago

This is nicely presented. I would like to see the prompts to the respective services, however. Did I miss them? The "side peek" would be a natural place for them.

smusamashah1y ago

These kind of tests (or may be all tests) should show *success rate* instead of a single pass/fail.

I believe Claude or even Gemini can succeed if system prompt is improved e.g. tell it to re-evaluate it's answer before finalising, can even tell it to do "thinking" within <thinking> tags. I use claude like that and it often goes over it's answer and corrects itself within same reply. On the other hand it can also incorrectly assume it made a mistake and can sometimes uncorrect itself.

Edit: Using o1's step by step problem solving example from OpenAI blog post made Claude go step by step in similar depth too. Could even do that here to get better success rate in non-o1 models.

alexarena1y ago

This is very cool. It seems like the prompt is asking the LLM to one shot an answer. Have you tried asking it to make a group, confirm whether it's correct, and repeat with the remaining words? (like a human would)

deskamess1y ago

Connections is a great game to test AI. It really relies on the ambiguity and loosely connected aspects of culture and language. I am shocked at how well o1-pro does.

KaoruAoiShiho1y ago

Beyond being able to solve Connections, can a LLM generate (good/challenging/solvable) connections? Would be pretty cool to be able to generate a test set.

KaoruAoiShiho1y ago

Took a bit of prompting and a few awful ones, but ultimately it's not bad.

https://chatgpt.com/share/67570ab1-b2c0-8006-b5d2-d3fa7132de...

Going to try to feed this into some other LLMs like qwq and see if they can solve them.

KaoruAoiShiho1y ago

QwQ gets it wrong, Gemini gets it wrong. o1 gets it right, R1 gets a pretty good not originally intended set of 4... tempted to give it partial credit. 4o gets it wrong. Will update with claude once my usage limits are up lol.

jacobsimon1y ago

I’d be surprised if they’re not using AI or some sort of rule-based generator at this point.

wiether1y ago

But creating one must be even funnier than trying to solve it, right?

tantalor1y ago

> Correct group with the wrong connection

This seems highly subjective. We should not care about this. The game is to connect the words, not find the connection. For human players, it doesn't matter if you get the connection or not.

Workaccount21y ago

This is completely unsurprising as the latent space that LLM's rely on basically is a giant web of Connections.

world2vec1y ago

Quite shocked that O1-Pro isn't orders of magnitude better than O1 despite being 10x the price.

Cool benchmark nonetheless!

empath751y ago

I'd love to see them try "Only Connect" puzzles which are _much_ harder.

ditto6641y ago

Spoiler alert

wiether1y ago

I do them in the morning and didn't even recognized the one from today at firsts!

zeroonetwothree1y ago

LLMs don’t have “intelligence”.

j / k navigate · click thread line to collapse

20 comments

tianshuo1y ago

Hmm, isn't there like five lives before "ending"? So instead of doing a "perfect run" there should be chances, and feedback, such as "one missing", like a real human player?

tantalor1y ago

Why is Gemini given 4 Xs for #547?

It has "Group 3" correct. It should be marked as having 1/4 groups correct.

Same thing happened on #535, Gemini actually got "Group 1" correct but was marked 0/4 correct.

tantalor1y ago

Looks like this was fixed, thanks!

ravedave51y ago

Argh put a spoiler cover over today's at least!

troelsSteegin1y ago

This is nicely presented. I would like to see the prompts to the respective services, however. Did I miss them? The "side peek" would be a natural place for them.

smusamashah1y ago

These kind of tests (or may be all tests) should show *success rate* instead of a single pass/fail.

Edit: Using o1's step by step problem solving example from OpenAI blog post made Claude go step by step in similar depth too. Could even do that here to get better success rate in non-o1 models.

alexarena1y ago

deskamess1y ago

Connections is a great game to test AI. It really relies on the ambiguity and loosely connected aspects of culture and language. I am shocked at how well o1-pro does.

KaoruAoiShiho1y ago

Beyond being able to solve Connections, can a LLM generate (good/challenging/solvable) connections? Would be pretty cool to be able to generate a test set.

KaoruAoiShiho1y ago

Took a bit of prompting and a few awful ones, but ultimately it's not bad.

https://chatgpt.com/share/67570ab1-b2c0-8006-b5d2-d3fa7132de...

Going to try to feed this into some other LLMs like qwq and see if they can solve them.

KaoruAoiShiho1y ago

jacobsimon1y ago

I’d be surprised if they’re not using AI or some sort of rule-based generator at this point.

wiether1y ago

But creating one must be even funnier than trying to solve it, right?

tantalor1y ago

> Correct group with the wrong connection

This seems highly subjective. We should not care about this. The game is to connect the words, not find the connection. For human players, it doesn't matter if you get the connection or not.

Workaccount21y ago

This is completely unsurprising as the latent space that LLM's rely on basically is a giant web of Connections.

world2vec1y ago

Quite shocked that O1-Pro isn't orders of magnitude better than O1 despite being 10x the price.

Cool benchmark nonetheless!

empath751y ago

I'd love to see them try "Only Connect" puzzles which are _much_ harder.

ditto6641y ago

Spoiler alert

wiether1y ago

I do them in the morning and didn't even recognized the one from today at firsts!

zeroonetwothree1y ago

LLMs don’t have “intelligence”.

j / k navigate · click thread line to collapse