undefined | Better HN

0 pointsvunderba11mo ago0 comments

That wasn't one of the questions - any reasonable person would have classified that chicken as an animal, albeit a mutant one.

I would also hardly count many of these questions as "tricks" either. Take the chess example. A lot of my friends and myself have been playing chess since we were young children and we all know that a fully populated chess board has 32 pieces (heavily weighted in our internal training data), but not a single one of us would have gotten that question wrong.

0 comments

gowld11mo ago

Don't be too literal.

Imagine walking to a room an seeing someone grab a handful of chess pieces off of a set-up board, and proceed to fill bags with 4 pieces each. As they fill the 8th bag, they notice only 3 pieces are left. Are you confident that you would respond "I saw the board only had 31 pieces on it when you started", or might you reply "perhaos you dropped a piece on the floor"?

vunderbaOP11mo ago

I'm not. I'm referencing the paper - not some hypothetical abstract word problem. Imagine walking into a room, where the pieces are slowly morphing from staid Staunton structures into amorphous blobs of lava lamp Cthulhu nightmares. If a locomotive steam train from Denver passes within 15 meters of the room, how many passengers paid for the tickets using a cashier's check?

Nobody's arguing that humans never take logical shortcuts or that those shortcuts can cause us to make errors.

Some of the rebuttals in this thread are ridiculous. Like what if I forced you to stare at the surface of the sun followed by waterboarding for several hours, and then asked you to look at a 1000 different chess boards. Are you sure you wouldn't make a mistake?

In the paper the various VLLMs are asked to double-check which still didn't make a difference. The argument is more along the lines that VLLMs (and multimodal LLMs) aren't really thinking in the same way that humans do.

And if you REALLY need an example albeit a bit tangential - try this one out. Ask any SOTA (multimodal or otherwise) model such as gpt-image-1, Kontext, Imagen4, etc. for a five-leaf cover. It'll get it about 50% of the time.

Now go and ask any kindergartener for the same thing.

j / k navigate · click thread line to collapse