I'm not. I'm referencing the paper - not some hypothetical abstract word problem. Imagine walking into a room, where the pieces are slowly morphing from staid Staunton structures into amorphous blobs of lava lamp Cthulhu nightmares. If a locomotive steam train from Denver passes within 15 meters of the room, how many passengers paid for the tickets using a cashier's check?
Nobody's arguing that humans never take logical shortcuts or that those shortcuts can cause us to make errors.
Some of the rebuttals in this thread are ridiculous. Like what if I forced you to stare at the surface of the sun followed by waterboarding for several hours, and then asked you to look at a 1000 different chess boards. Are you sure you wouldn't make a mistake?
In the paper the various VLLMs are asked to double-check which still didn't make a difference. The argument is more along the lines that VLLMs (and multimodal LLMs) aren't really thinking in the same way that humans do.
And if you REALLY need an example albeit a bit tangential - try this one out. Ask any SOTA (multimodal or otherwise) model such as gpt-image-1, Kontext, Imagen4, etc. for a five-leaf cover. It'll get it about 50% of the time.
Now go and ask any kindergartener for the same thing.