undefined | Better HN

0 pointsLerc1y ago0 comments

I disagree. I think the title, abstract, and conclusion not only misrepresents the state of the models but it misrepresents Thier own findings.

They have identified a class of problems that the models perform poorly at and have given a good description of the failure. They portray this as a representative example of the behaviour in general. This has not been shown and is probably not true.

I don't think that models have been portrayed as equivalent to humans. Like most AI in it has been shown as vastly superior in some areas and profoundly ignorant in others. Media can overblow things and enthusiasts can talk about future advances as if they have already arrived, but I don't think these are typical portayals by the AI Field in general.

0 comments

youssefabdelm1y ago

Exactly... I've found GPT-4o to be good at OCR for instance... doesn't seem "blind" to me.

prmoustache1y ago

Well maybe not blind but the analogy with myopia might stand.

For exemple in the case of OCR, a person with myopia will usually be able to make up letters and words even without his glasses based on his expectation (similar to vlm training) of seeing letters and words in, say, a sign. He might not see them all clearly and do some errors but might recognize some letters easily and make up the rest based on context, words recognition, etc. Basically experience.

I also have a funny anecdote about my partner, which has sever myopia, who once found herself outside her house without her glasses on, and saw something on the grass right in front. She told her then brother in law "look, a squirrel" Only for the "squirrel" to take off while shouting its typical caws. It was a crow. This is typical of VLM's hallucinations.

Foobar85681y ago

I know that GPT-4o is fairly poor to recognize music sheets and notes. Totally off the marks, more often than not, even the first note is not recognize on a first week solfège book.

So unless I missed something but as far as I am concerned, they are optimized for benchmarks.

So while I enjoy gen AI, image-to-text is highly subpart.

stavros1y ago

Most adults with 20/20 vision will also fail to recognize the first note on a first week solfege book.

youssefabdelm1y ago

Useful to know, thank you!

spookie1y ago

You don't really need a LLM for OCR. Hell, I suppose they just run a python script in its VM and rephrase the output.

At least that's what I would do. Perhaps the script would be a "specialist model" in a sense.

Spivak1y ago

It's not that you need an LLM for OCR but the fact that an LLM can do OCR (and handwriting recognition which is much harder) despite not being made specifically for that purpose is indicative of something. The jump from knowing "this is a picture of a paper with writing on it" like what you get with CLIP to being able to reproduce what's on the paper is, to me, close enough to seeing that the difference isn't meaningful anymore.

1 more reply

subroutine1y ago

I think the conclusion of the paper is far more mundane. It's curious that VLM can recognize complex novel objects in a trained category, but cannot perform basic visual tasks that human toddlers can perform (e.g. recognizing when two lines intersect or when two circles overlap). Nevertheless I'm sure these models can explain in great detail what intersecting lines are, and even what they look like. So while LLMs might have image processing capabilities, they clearly do not see the way humans see. That, I think, would be a more apt title for their abstract.

j / k navigate · click thread line to collapse

0 comments

youssefabdelm1y ago

Exactly... I've found GPT-4o to be good at OCR for instance... doesn't seem "blind" to me.

prmoustache1y ago

Well maybe not blind but the analogy with myopia might stand.

Foobar85681y ago

I know that GPT-4o is fairly poor to recognize music sheets and notes. Totally off the marks, more often than not, even the first note is not recognize on a first week solfège book.

So unless I missed something but as far as I am concerned, they are optimized for benchmarks.

So while I enjoy gen AI, image-to-text is highly subpart.

stavros1y ago

Most adults with 20/20 vision will also fail to recognize the first note on a first week solfege book.

youssefabdelm1y ago

Useful to know, thank you!

spookie1y ago

You don't really need a LLM for OCR. Hell, I suppose they just run a python script in its VM and rephrase the output.

At least that's what I would do. Perhaps the script would be a "specialist model" in a sense.

Spivak1y ago

1 more reply

subroutine1y ago

j / k navigate · click thread line to collapse