Vision language models are blind (opens in new tab)

(vlmsareblind.github.io)

451 pointstaesiri1y ago191 comments

191 comments

Entertaining, but I think the conclusion is way off.

> their vision is, at best, like that of a person with myopia seeing fine details as blurry

is a crazy thing to write in an abstract. Did they try to probe that hypothesis at all? I could (well actually I can't) share some examples from my job of GPT-4v doing some pretty difficult fine-grained visual tasks that invalidate this.

Personally, I rate this paper [1], which makes the argument that these huge GenAI models are pretty good at things - assuming that it has seen a LOT of that type of data during training (which is true of a great many things). If you make up tasks like this, then yes can be REALLY bad at them, and initial impressions of AGI get harder to justify. But in practice, we aren't just making up tasks to trip up these models. They can be very performant on some tasks and the authors have not presented any real evidence about these two modes.

[1] https://arxiv.org/abs/2404.04125

SomaticPirate1y ago

There are quite a few "ai apologists" in the comments but I think the title is fair when these models are marketed towards low vision people ("Be my eyes" https://www.youtube.com/watch?v=Zq710AKC1gg) as the equivalent to human vision. These models are implied to be human level equivalents when they are not.

This paper demonstrates that there are still some major gaps where simple problems confound the models in unexpected ways. These is important work to elevate otherwise people may start to believe that these models are suitable for general application when they still need safeguards and copious warnings.

benreesman1y ago

If we're throwing "citation needed" tags on stuff, how about the first sentence?

"Large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini-1.5 Pro are powering countless image-text processing applications"

I don't know how many a "countless" is, but I think we've gotten really sloppy in terms of what counts for LLMs as a demonstrated, durable win in a concrete task attached to well-measured outcomes and holding up over even modest periods of time.

This stuff is really promising and lots of builders are making lots of nifty things, so if that counts as an application then maybe we're at countless, but in the enterprise and in government and in refereed academic literature we seem to be at the proof-of-concept phase. Impressive chat bots as a use case are pretty dialed in, enough people claim that they help with coding that I tend to believe it's a real thing (I never seem to come out ahead of going directly to the source, StackOverflow).

The amount of breathless press on this seems "countless", so maybe I missed the totally rigorous case study on how X company became Y percent more profitable by doing Z thing with LLMs (or similar), and if so I'd be grateful for citations, but neither Google nor any of the big models seem to know about it.

2 more replies

pixl971y ago

Yea, really if you look at human learning/seeing/acting there is a feedback loop that LLM for example isn't able to complete and train on.

You see an object. First you have to learn how to control all your body functions to move toward it and grasp it. This teaches you about the 3 dimensional world and things like gravity. You may not know the terms, but it is baked in your learning model. After you get an object you start building a classification list "hot", "sharp", "soft and fuzzy", "tasty", "slick". Your learning model builds up a list of properties of objects and "expected" properties of objects.

Once you have this 'database' you create as a human, you can apply the logic to achieve tasks. "Walk 10 feet forward, but avoid the sharp glass just to the left". You have to have spatial awareness, object awareness, and prediction ability.

Models 'kind of' have this, but its seemingly haphazard, kind of like a child that doesn't know how to put all the pieces together yet. I think a lot of embodied robot testing where the embodied model feeds back training to the LLM/vision model will have to occur before this is even somewhat close to reliable.

1 more reply

sweezyjeezy1y ago

The paper I linked should hopefully mark me out as far from an AI apologist, it's actually really bad news for GenAI if correct. All I mean to say is the clickbait conclusion and the evidence do not match up.

1 more reply

lynx231y ago

Be My Eyes user here. I disagree with your uninformed opinion. Be My Eyes is more often than not more useful then a human. And I am reporting from personal experience. What experience do you have?

kenjackson1y ago

Simple is a relative statement. There are vision problems where monkeys are far better than humans. Some may look at human vision and memory and think that we lack basic skills.

With AINwe are creating intelligence but with different strengths and weaknesses. I think we will continue to be surprised at how well they work on some problems and how poor they do at some “simple” ones.

brookst1y ago

I don’t see Be My Eyes or other similar efforts as “implied” to be equivalent to humans at all. They’re just new tools which can be very useful for some people.

“These new tools aren’t perfect” is the dog bites man story of technology. It’s certainly true, but it’s no different than GPS (“family drives car off cliff because GPS said to”).

1 more reply

Lerc1y ago

I disagree. I think the title, abstract, and conclusion not only misrepresents the state of the models but it misrepresents Thier own findings.

They have identified a class of problems that the models perform poorly at and have given a good description of the failure. They portray this as a representative example of the behaviour in general. This has not been shown and is probably not true.

I don't think that models have been portrayed as equivalent to humans. Like most AI in it has been shown as vastly superior in some areas and profoundly ignorant in others. Media can overblow things and enthusiasts can talk about future advances as if they have already arrived, but I don't think these are typical portayals by the AI Field in general.

2 more replies

Melomololotolo1y ago

Ah yes the blind person who constantly needs to know if two lines intersect.

Let's just ignore what a blind person normally needs to know.

You know what blind people ask? Sometimes there daily routine is broken because there is some type of construction and models can tell you this.

Sometimes they need to read a basic sign and models can do this.

Those models help people already and they will continue to get better.

I'm not sure if I'm more frustrated how condescending the authors are or your ignorance.

Valid criticism doesn't need to be shitty

1 more reply

TeMPOraL1y ago

Entertaining is indeed the right word. Nice job identifying corner cases of models' visual processing; curiously, they're not far conceptually from some optical illusions that reliably trip humans up. But to call the models "blind" or imply their low performance in general? That's trivially invalidated by just taking your phone out and feeding a photo to ChatGPT app.

Like, seriously. One poster below whines about "AI apologists" and BeMyEyes, but again, it's all trivially testable with your phone and $20/month subscription. It works spectacularly well on real world tasks. Not perfectly, sure, but good enough to be useful in practice and better than alternatives (which often don't exist).

FrenchDevRemote1y ago

> their vision is, at best, like that of a person with myopia seeing fine details as blurry

It's not that far from reality, most models sees images in very low resolution/limited colors, so not so far from this description

vikramkr1y ago

They didn't test that claim at all though. Vision isn't some sort of 1D sliding scale with every vision condition lying along one axis.

First of all myopia isn't 'seeing fine details as blurry' - it's nearsightedness - and whatever else this post tested it definitely didn't test depth perception.

And second - inability to see fine details is a distinct/different thing from not being able to count intersections and the other things tested here. That hypothesis, if valid, would imply that improving the resolution of the image that the model can process would improve its performance on these tasks even if reasoning abilities were the same. That - does not make sense. Plenty of the details in these images that these models are tripping up on are perfectly distinguishable at low resolutions. Counting rows and columns of blank grids is not going to improve with more resolution.

I mean, I'd argue that the phrasing of the hypothesis ("At best, like that of a person with myopia") doesn't make sense at all. I don't think a person with myopia would have any trouble with these tasks if you zoomed into the relevant area, or held the image close. I have a very strong feeling that these models would continue to suffer on these tasks if you zoomed in. Nearsighted != unable to count squares.

1 more reply

blackmesaind1y ago

My thoughts as well. I too would have trouble with the overlapping lines tests if all the images underwent convolution.

csomar1y ago

> these huge GenAI models are pretty good at things

Is this the sales pitch though? Because 15 years ago, I had a scanner with an app that can scan a text document and produce the text on Windows. The machine had something like 256Mb of RAM.

Tech can be extremely good at niches in isolation. You can have an OCR system 10 years ago and it'll be extremely reliable at the single task it's configured to do.

AI is supposed to bring a new paradigm, where the tech is not limited to the specific niche the developers have scoped it to. However, if it reliably fails to detect simple things a regular person should not get wrong, then the whole value proposition is kicked out of the window.

jrflowers1y ago

>I could (well actually I can't)

I like the idea that these models are so good at some sort of specific and secret bit of visual processing that things like “counting shapes” and “beating a coin toss for accuracy” shouldn’t be considered when evaluating them.

valine1y ago

LLMs are bad at counting things just in general. It’s hard to say whether the failures here are vision based or just an inherent weakness of the language model.

vikramkr1y ago

Those don't really have anything to do with fine detail/nearsightedness. What they measured is valid/interesting - what they concluded is unrelated.

godelski1y ago

> Did they try to probe that hypothesis at all?

I think this is a communication issue and you're being a bit myopic in your interpretation. It is clearly an analogy meant for communication and is not an actual hypothesis. Sure, they could have used a better analogy and they could have done other tests, but the paper still counters quite common claims (from researchers) about VLMs.

> I could (well actually I can't) share some examples from my job of GPT-4v doing some pretty difficult fine-grained visual tasks that invalidate this.

I find it hard to believe that there is no example you can give. It surely doesn't have to be exactly your training data. If it is this good, surely you can create an example no problem. If you just don't want to, that's okay, but then don't say it.

But I have further questions. Do you have complicated prompting? Or any prompt engineering? It sure does matter how robust these models are to prompting. There's a huge difference between a model being able to accomplish a task and a model being able to perform a task in a non-very-specific environment. This is no different than something working in a tech demo and not in the hand of the user.

> But in practice, we aren't just making up tasks to trip up these models.

I see this sentiment quite often and it is baffling to me.

First off, these tasks are not clearly designed to trick these models. A model failing at a task is not suddenly "designed to trick a model." Its common with the river crossing puzzles where they're rewritten to be like "all animals can fit in the boat." If that is "designed to trick a model", then the model must be a stochastic parrot and not a generalist. It is very important that we test things where we do know the answer to because, unfortunately, we're not clairvoyant and can't test questions we don't know the answer to. Which is the common case in the real world usage.

Second, so what if a test was designed to trick up a model? Shouldn't we be determining when and where models fail? Is that not a critical question in understanding how to use them properly? This seems doubly important if they are tasks that humans don't have challenges with them.

> They can be very performant on some tasks and the authors have not presented any real evidence about these two modes.

I don't think people are claiming that large models can't be performant on some tasks. If they are, they're rejecting trivially verifiable reality. But not every criticism and has to also contain positive points. There's plenty of papers and a lot of hype already doing that. And if we're going to be critical of anything, shouldn't it be that the companies creating these models -- selling them, and even charging researchers to perform these types of experiments that the can and are used to improve their products -- should be much more clear about the limitations of their models? If we need balance, then I think there's bigger fish to fry than Auburn and Alberta Universities.

ClumsyPilot1y ago

> Second, so what if a test was designed to trick up a model? Shouldn't we be determining when and where models fail? Is that not a critical question in understanding how to use them properly?

People are rushing to build this AI into all kinds of products, and they actively don’t want to know where the problems are.

The real world outside is designed to trip up the model. Strange things happen all the time.

Because software developers have no governing body, no oaths of ethics and no spine someone will end up dead in a ditch from malfunctioning AI.

2 more replies

orbital-decay1y ago

> I think this is a communication issue and you're being a bit myopic in your interpretation. It is clearly an analogy meant for communication and is not an actual hypothesis.

I don't know, words have meanings. If that's a communication issue, it's on part of the authors. To me, this wording in a what is supposed to be a research paper abstract clearly suggests the insufficient resolution as the cause. How else should I interpret it?

> The shockingly poor performance of four state-of-the-art VLMs suggests their vision is, at best, like that of a person with myopia seeing fine details as blurry

And indeed, increasing the resolution is expensive, and the best VLMs have something like 1000x1000. But the low resolution is clearly not the issue here, and the authors don't actually talk about it in the paper.

>I find it hard to believe that there is no example you can give.

I'm not the person you're answering to, but I actually lazily tried two of authors' examples in a less performant VLM (CogVLM), and was surprised it passed those, making me wonder whether I can trust their conclusions until I reproduce their results. LLMs and VLMs have all kinds of weird failure modes, it's not a secret they fail at some trivial tasks and their behavior is still not well understood. But working with these models and narrowing it down is notoriously like trying to nail a jelly to the wall. If I was able to do this in a cursory check, what else is there? More than one research paper in this area is wrong from the start.

1 more reply

itkovian_1y ago

I think gpt4o is probably doing some ocr as preprocessing. It's not really controversial to say the vmls today don't pick up fine grained details - we all know this. Can just look at the output of a vae to know this is true.

thomasahle1y ago

If so, it's better than any other ocr on the market.

I think they just train it on a bunch of text.

Maybe counting squares in a grid was not probably considered important enough to train for.

_flux1y ago

Why do you think it's probable? The much smaller llava that I can run in my consumer GPU can also do "OCR", yet I don't believe anyone has hidden any OCR engine inside llama.cpp.

kalkin1y ago

There's definitely something interesting to be learned from the examples here - it's valuable work in that sense - but "VLMs are blind" isn't it. That's just clickbait.

diwank1y ago

Yeah I think their findings are def interesting but the title and the strong claims are a tad hyperbolic.

jetrink1y ago

I had a remarkable experience with GPT-4o yesterday. Our garage door started to fall down recently, so I inspected it and found that our landlord had installed the wire rope clips incorrectly, leading to the torsion cables losing tension. I didn't know what that piece of hardware was called, so I asked ChatGPT and it identified the part as I expected it to. As a test, I asked if there was anything notable about the photo. ChatGPT correctly identified that the cables were installed backwards, with the side of the cable that was (previously) under tension on top of the slack end, instead of sandwiched securely in the middle. To diagnose that requires tracing the cable through space and inferring which end is under tension from the geometry, though I can't rule out an educated guess.

What was really remarkable though was that it failed to notice that one of the two nuts was obviously missing, even after I told it there was a second problem with the installation.

Screenshot: https://imgur.com/a/QqCNzOM

sfink1y ago

A human would need to trace the cable. An LLM may just be responding based on (1) the fact that you're asking about the clip in the first place, and that commonly happens when there's something wrong; and (2) that this is a very common failure mode. This is supported by it bringing up the "never saddle a dead horse" mnemonic, which suggests the issue is common.

After you fix it, you should try asking the same questions!

1 more reply

fn-mote1y ago

As a human, I was unable to see enough in that picture to infer which side was supposed to be under tension. I’m not trained, but I know what I expected to see from your description.

Like my sister post, I’m skeptical that the LLM didn’t just get lucky.

MagicMoonlight1y ago

To trace it through space it would need short term memory and the ability to think. It does not have it. It must therefore be guessing.

JeremyHerrman1y ago

VLMs so far have never been good at counting objects or spatial relationships (e.g. the coffee is to the right of the microwave).

There are ways to help the VLM out - Set of Marks [0] from Microsoft being the most prominent, which uses segmentation to outline and label sections of the image before sending to the VLM.

Providing "speakable" labels to regions helps ground the visual abilities of VLMs and is why in this paper the performance is so much better when words are present in the grid for "Task 6: Counting the rows and columns of a grid"

0: https://github.com/microsoft/SoM

jazzyjackson1y ago

I didn't know counting objects was a problem. That's pretty ironic because the very first implementation of a neural net (AFAIK) is the numa-rete artificial retina developed at the Biological Computer Lab [0] circa 1960. It was a parallel analog computer composed of "nuerons" each with a photocell that could be arranged in a grid and count "the number of objects independent of their size, location and form, and independent of strength of illumination" [1] - this paper may be of interest to those in the field, "Perception of Form in Biological and Man Made Systems" Heinz Von Foerster 1962

[0] https://distributedmuseum.illinois.edu/exhibit/biological_co...

[1] https://sites.evergreen.edu/arunchandra/wp-content/uploads/s...

empath751y ago

It really shouldn't be surprising that these models fail to do anything that _they weren't trained to do_. It's trivially easy to train a model to count stuff. The wild thing about transformer based models is that their capabilities are _way_ beyond what you'd expect from token prediction. Figuring out what their limitations actually are is interesting because nobody fully knows what their limitations are.

1 more reply

Eisenstein1y ago

Vision models use CLiP or something similar, which has no conception of anything specific in the image. It sees embeddings which correlate similarly to text embeddings. Take an image then describe it 'there are birds sitting on a power line in front of a blue sky with some clouds', get the embeddings from that and the embeddings from that picture and line them up. If you ask if there are birds in it, it would know, but not how many, unless it was common to describe the number of birds sitting on things and it happened often enough that the number counted was the number in the image descriptions it trained on. If you want to count objects you want something like YOLO.

JeremyHerrman1y ago

VLMs like PaliGemma and Florence-2 support object detection and segmentation, so it's becoming more common to have YOLO like capabilities built into VLMs.

Another benefit of VLMs which support object detection is that they are open vocabulary, meaning you don't have to define the classes ahead of time. Additionally fine tuning tends to keep the previous detection capabilities instead of erasing all previous classes like fine tuning YOLO.

joelburget1y ago

Vision Transformers do a shocking amount of compression in the tokenizer. In the [Chameleon paper](https://arxiv.org/pdf/2405.09818) they say the tokenizer "encodes a 512 × 512 image into 1024 discrete tokens from a codebook of size 8192". That's 256 pixels per token (512 * 512 / 1024). If we assume that a pixel is 24 bits (3x 8 bit channels), this implies that they've compressed 256 * 24 = 6144 bits into 13 = (log2(8192)). [An Image is Worth 32 Tokens for Reconstruction and Generation](https://yucornetto.github.io/projects/titok.html) pushes this even further. If these models work similarly, it's no wonder they struggle with some vision tasks.

ec1096851y ago

It’s not as simple as that. If you ask GPT-4o to create a copy of these images, it generally creates one faithfully (e.g. an image with 5 squares will be produced), so it’s “seeing” things reasonably enough.

It doesn’t seem to have the logic though to answer these questions.

The complete data set is here to play around with it yourself: https://huggingface.co/datasets/XAI/vlmsareblind/viewer/defa...

energy1231y ago

GPT-4o is very good at some visual tasks like optical character recognition. So the selective blindness might just be what you say here -- all of its capacity is dedicated to minimizing loss on a few narrow tasks that had the most training data (like OCR). So it's not necessarily an inherent failure of the architecture to generalize, it could just be a capacity issue that will naturally be resolved with more scale.

sushid1y ago

Is that not just traditional OCR applied on top of LLM?

2 more replies

buryat1y ago

for some reason I started thinking about trying to describe the taste of a fruit to someone who hasn't tried it as something that can be similar to this as a non-visual sensory modal in humans

cs7021y ago

Wow, that is embarrassingly bad performance for current SOTA models (GPT-4o, Gemini-1.5 Pro, Sonnet-3, Sonnet-3.5), which are advertised and sold as being able to understand images, e.g., for guiding the blind or tutoring children in geometry!

The tasks at which they fail are ridiculously simple for human beings, including, for example:

* counting the number of times two lines intersect;

* detecting whether two circles overlap;

* selecting which letter is being circled in a word;

* counting the number of circles in an Olympic-like logo.

This should be at the top of the front page.

tensor1y ago

I don't see how this is "embarrassing" in the slightest. These models are not human brains, and the fact that people equate them with human brains is an embarrassing failure of the humans more than anything about the models.

It's entirely unsurprising that there are numerous cases that these models can't handle that are "obvious to humans." Machine learning has had this property since its invention and it's a classic mistake humans make dealing with these systems.

Humans assume that because a machine learning model has above human accuracy on task X that it implies that it must also have that ability at all the other tasks. While a human with amazing ability at X would indeed have amazing abilities at other tasks, this is not true of machine learning models The opposite thinking is also wrong, that because the model can't do well on task Y it must be unreliable and it's ability on task X is somehow an illusion and not to be trusted.

cs7021y ago

It is embarrassingly, shockingly bad, because these models are advertised and sold as being capable of understanding images.

Evidently, all these models still fall short.

7 more replies

mrbungie1y ago

These models are marketed as being able to guide the blind or tutoring children using direct camera access.

Promoting those use cases and models failing in these ways is irresponsible. So, yeah, maybe the models are not embarrasing but the hype definitely is.

2 more replies

sfink1y ago

Well said.

It doesn't matter how they are marketed or described or held up to some standard generated by wishful thinking. And it especially doesn't matter what it would mean if a human were to make the same error.

It matters what they are, what they're doing, and how they're doing it. Feel free to be embarrassed if you are claiming they can do what they can't and are maybe even selling them on that basis. But there's nothing embarrassing about their current set of capabilities. They are very good at what they are very good at. Expecting those capabilities to generalize as they would if they were human is like getting embarrassed that your screwdriver can't pound in a nail, when it is ever so good at driving in screws.

scotty791y ago

You'd expect them to be trained on simple geometry since you can create arbitrarily large synthetic training set for that.

insane_dreamer1y ago

> is an embarrassing failure of the humans more than anything about the models

No, it's a failure of the companies who are advertising them as capable of doing something which they are not (assisting people with low vision)

1 more reply

drodgers1y ago

I can't help but read comments like this as:

"My talking dog always makes mistakes on calculus problems: how embarrassingly bad!"

Has the expectation treadmill really advanced so quickly that sub-human performance on any category of problems is now an embarrassment?

aezart1y ago

To me I guess it suggests that these models are not using the correct approach. We keep finding new types of tasks the models are bad at, then the next model fixes those issues because those specific tasks are added to the training set. But that approach never results in a generalized problem solving ability, just an ability to solve all the problems we've thought of so far.

diwank1y ago

I think while the findings are interesting, the title "Vision language models are blind" is hyperbolic and misleading of the findings. It is true that VLMs have a very different way of sensing and working with image inputs and for low resolutions each image is split up into blocks and then mapped into discrete tokens. They actually really dont have access to finegrained details because this mapping is quite lossy. So, in that sense, these results completely make sense and were not surprising to me but "blind" is a strong implication and the work isnt really able to back that.

Take the first example of the four 2D line plots for instance. I asked Sonnet 3.5 for an answer and retried it 5 times and 2 out of 5 times, it said something like this:

Me:

here's an image with four 2-D line plots. Can you tell me A) how many lines are in each plot? and B) how many times do they intersect each other in each plot?

Write your answers as: Plot 1 (leftmost): ...

Sonnet:

I'll analyze each plot from left to right and provide the requested information:

Plot 1 (leftmost): A) 2 lines (one red, one blue) B) 0 intersections (the lines do not cross)

Plot 2: A) 2 lines (one red, one blue) B) 1 intersection (the lines cross once near the right edge)

Plot 3: A) 2 lines (one red, one blue) B) 1 intersection (the lines cross once near the center)

Plot 4 (rightmost): A) 2 lines (one red, one blue) B) 2 intersections (the lines cross twice, once near the left side and once near the right side)

Each plot consistently shows two lines, one red and one blue, with varying intersection patterns across the four plots.

---

Which isn't completely correct (the 3rd plot has 2 intersections) but still pretty good and the other 3 times, it said plot 3 has 1 intersection.

akavi1y ago

Speaking as someone with only a tenuous grasp of how VLMs work, this naïvely feels like a place where the "embodiement" folks might have a point: Humans have the ability to "refine" their perception of an image iteratively, focusing in on areas of interest, while VLMs have to process the entire image at the same level of fidelity.

I'm curious if there'd be a way to emulate this (have the visual tokens be low fidelity at first, but allow the VLM to emit tokens that correspond to "focusing" on a region of the image with greater resolution). I'm not sure if/how it's possible to performantly train a model with "interactive" data like that, though

efskap1y ago

Isn't this the attention mechanism, the reason we're using transformers for these things? Maybe not greater resolution per se, but focusing on a region with greater neural connectivity

akavi1y ago

Ah, good point!

But the model is downstream of the "patch" tokenization, so the cut-down in resolution (compression) of the image has already occurred prior to the point where the model can direct greater "attention".

I think the synthesis is that I'm proposing a per-pixel tokenization with a transformer block whose purpose is to output information at a compression level "equivalent" to that of the patch tokens (is this what an autoencoder is?), but where the attention vector is a function of the full state of the LLM (ie, inclusive of the text surrounding the image)).

Naïvely, I'd think a layer like this that is agnostic to the LLM state needn't be any more computationally costly than the patching computation (both are big honks of linear algebra?), but idk how expensive the "full context attention" feedback is...

(I apologize to anyone who actually understands transformers for my gratuitous (ab|mis)use of terminology)

caddemon1y ago

Humans are actually born with blurry vision as the eye takes time to develop, so human learning starts with low resolution images. There is a theory that this is not really a limitation but a benefit in developing our visual processing systems. People in poorer countries that get cataracts removed when they are a bit older and should at that point hardware-wise have perfect vision do still seem to have some lifelong deficits.

It's not entirely known how much early learning in low resolution makes a difference in humans, and obviously that could also relate more to our specific neurobiology than a general truth about learning in connectionist systems. But I found it to be an interesting idea that maybe certain outcomes with ANNs could be influenced a lot by training paradigms s.t. not all shortcomings could be addressed with only updates to the core architecture.

kromem1y ago

Lots and lots of eye tracking data paired with what was being looked at in order to emulate human attention processing might be one of the lower hanging fruits for improving it.

slashdave1y ago

These models have learned to focus on specific portions of an image (after all, this is the stated purpose of a transformer).

Brechreiz1y ago

>Humans have the ability to "refine" their perception of an image iteratively

That's not related to embodied cognition.

akavi1y ago

Is embodied cognition not at least in part about interactivity? I perform action (emit tokens) and receive feedback (non-self-generated tokens)

poikroequ1y ago

It's ironic, they fail these seemingly simple tests that are trivial even for a child to solve. Yet, I used Gemini to read a postcard containing handwritten Russian cursive text with lots of visual noise (postmarks and whatnot). It was able to read the text and translate it into English. I didn't even need to tell it the text is Russian.

On the one hand, it's incredible what these LLMs are capable of. On the other hand, they often fall flat on their face with seemingly simple problems like this. We are seeing the same from self driving cars, getting into accidents in scenarios that almost any human driver could have easily avoided.

slashdave1y ago

Simple for a child, yes. Because we have evolved our vision to recognize patterns like this, because they are important for survival. Reading Russian is not.

From an algorithmic point of view, these vision tasks are actually quite difficult to explicitly program.

mglz1y ago

I tought some Computational Geometry courses and efficiently computing the intersections of N line segments is not as straightforward as you might initially think. Since somewhere some computation must be done to recognize this and LLMs are not specifically trained for this task, it's not suprising they struggle.

In general, basic geometry seems under-explored by learning.

jordan_bonecut1y ago

Yes, but so is telling if a photo contains a dog or understanding sentiment in a paragraph of text. Complexity isn't quite the issue, I think it is that there is a distinction between the type of reasoning which these models have learnt and that which is necessary for concrete mathematical reasoning.

slashdave1y ago

The models do not reason. They have learned associations, because these associations have appeared in their training sets.

1 more reply

samatman1y ago

> Since somewhere some computation must be done to recognize this

Humans don't have a "compute intersections" ability (other than a few who have learned it laboriously through algebra), we have a "see things and count them" mechanism. We aren't visually taking lines in a planar space and determining where they cross. We know what an intersection looks like, we see one, increment a counter, and find the next one. If it's less than around five, we do this all at once. Otherwise we literally count, sometimes in small groups, sometimes one at a time.

GaggiX1y ago

Well, all the models (especially Claude 3.5 Sonnet) seem to perform much better than random, so they are clearly not blind. The only task where Claude Sonnet 3.5 does not perform better than random is the one where you have to follow many different paths (the ones where the answer from A to C is 3), something that would take me several seconds to solve.

I have the feeling that they first choose the title of the paper and then run the evaluation on the new Claude 3.5 Sonnet on these abstract images.

>their vision is, at best, like that of a person with myopia seeing fine details as blurry

This also makes no sense, since the images evaluate the abstract capabilities of the models, not their eyesight.

randcraw1y ago

OK. They're legally blind.

GaggiX1y ago

This really has nothing to do with vision impairment.

yantrams1y ago

Tested these problems with llava-v1.6-mistral-7b and the results aren't bad. Maybe I just got lucky with these samples

Intersecting Lines https://replicate.com/p/s24aeawxasrgj0cgkzabtj53rc

Overlapping Circles https://replicate.com/p/0w026pgbgxrgg0cgkzcv11k384

Touching Circles https://replicate.com/p/105se4p2mnrgm0cgkzcvm83tdc

Circled Text https://replicate.com/p/3kdrb26nwdrgj0cgkzerez14wc

Nested Squares https://replicate.com/p/1ycah63hr1rgg0cgkzf99srpxm

simonw1y ago

These are really interesting examples, thanks for sharing.

yantrams1y ago

You're welcome. I recently noticed I get better performance with VLMs when the queries are phrased this way - Descriptive Keys instead of explaining the problem in sentences. Similar to COT reasoning that many people claim gives better results, I personally found querying in this sequence - existenceOfEntity, numberOfEntities followed by propertiesOfEntities etc tends to give better results. I haven't verified any of this rigorously so please do take it with a pinch of salt :)

taesiriOP1y ago

This paper examines the limitations of current vision-based language models, such as GPT-4 and Sonnet 3.5, in performing low-level vision tasks. Despite their high scores on numerous multimodal benchmarks, these models often fail on very basic cases. This raises a crucial question: are we evaluating these models accurately?

simonw1y ago

I've been generally frustrated at the lack of analysis of vision LLMs generally.

They're clearly a very exciting category of technology, and a pretty recent one - they only got good last October with GPT-4 Vision, but since then we've had more vision models from Anthropic and Google Gemini.

There's so much more information about there about text prompting compared to image prompting. I feel starved for useful information about their capabilities: what are vision models good and bad at, and what are the best ways to put them to work?

r2_pilot1y ago

Why not use them yourself if you have access? I have been using Claude 3.5 Sonnet for gardening recently, and while it's not perfect(and can be a little blind unless you tell it to focus on a specific thing), it's helped me understand how to keep my plants alive in some challenging conditions(for me; this is my second or third attempt at gardening so it's all challenging lol). But just a experiment with it and see where the capabilities lie. I do agree that certain classes of visual data are challenging for it.

simonw1y ago

I've used them a bunch. I want to learn from other people's experiences as well.

Some of my notes so far:

- https://simonwillison.net/2024/Apr/17/ai-for-data-journalism... - my datasette-extract plugin, for structured data from both text and images

- https://simonwillison.net/2024/Apr/17/ai-for-data-journalism... - where they failed to extract data from a handwritten scanned document in various weird ways

- https://simonwillison.net/2024/Feb/21/gemini-pro-video/ talks about video inputs to Gemini Pro (which are actually image inputs, it splits them up to one frame per second)

simonw1y ago

Anthropic have some interesting cookbook examples that provide advice on using their multimodal models here: https://github.com/anthropics/anthropic-cookbook/tree/main/m...

I've assembled a bunch more notes here: https://simonwillison.net/tags/vision-llms/

dheera1y ago

Current approaches of multi-modal models work on embeddings and tokenizations of images, which is the fundamental problem: you are feeding blurry, non-precise data into the model. Yes, they are blind because of exactly this.

An embedding isn't conceptually that much different from feeding a 1024-word description of an image instead of the actual image.

At the moment compute power isn't good enough to feed high-res pixel data into these models, unless we discover a vastly different architecture, which I am also convinced likely exists.

visarga1y ago

> An embedding isn't conceptually that much different from feeding a 1024-word description of an image instead of the actual image.

An embedding needs less words. You can embed individual words, phrases, like a whole prompt and longer paragraphs. You don't need 1024 words for a text embed. For example a famous library is called Sentence BERT (sbert).

When you embed images on the other hand, you cut them up into little squares on the tune of 32x32 px, and embeds one of them separately. chatGPT uses something like 250 tokens for smaller images. So a smaller image costs about as much as 200 words if represented graphically, and maybe much less words if you embed a text description of it.

dheera1y ago

> needs less words

Yes I'm aware of this, and work in ML -- the thing is embeddings are not designed for faithful image reconstruction, and aren't even trained that way. You can easily find two images that have substantially similar CLIP (or whatever) embeddings that are visually very different. If you query the LLM about that difference, the LLM wouldn't even have the information to differentiate answers for the two images if you only supply it with the embedding.

On the other hand, SDXL autoencoder latents passed into an LLM alongside the embedding might be a step up from just an image embedding, since they are designed for image reconstruction, but I don't have access to the compute or data resources to attempt training this.

1 more reply

jayd161y ago

Doesn't Gemini have a 2 million token limit for exactly this?

diwank1y ago

The number of tokens per image are actually fairly small, ranging from 85 to ~500.

jeromeparadis1y ago

One use-case I always try is to have an AI try to read a school calendar image where days off are or days of interest are highlighted using a legend. i.e.: days with a square, circle or triangle or different color, etc.

When asking days for specific days of interest for the school year, AIs always struggle. They get some days right but forget some or fabulate new days. They fare a bit better if you remove some of the noise and give them only a picture of a month but even then, it's unreliable.

_vaporwave_1y ago

It's really interesting that there's a huge performance discrepancy between these SOTA models. In the Olympic logo example, GPT-4o is below the baseline accuracy of 20% (worse than randomly guessing) while Sonnet-3.5 was correct ~76% of the time.

Does anyone have any technical insight or intuition as to why this large variation exists?

ec1096851y ago

The question wasn’t “yes or no” but instead required an exact number: https://huggingface.co/datasets/XAI/vlmsareblind/viewer/defa...

Playing around with GPT-4o, it knows enough to make a copy of an image that is reasonable but it still can’t answer the questions.

ChatGPT went down a rabbit hole of trying to write python code, but it took lots of prompting for it to notice its mistake when solving one of the intersecting line questions.

pjs_1y ago

I don't like this paper for the following reasons:

- The language is unnecessarily scathing

- They repeatedly show data where the models are getting things right 70, 80, 90% of the time, and then show a list of what they call "qualitative samples" (what does "qualitative" mean? "cherry-picked"?) which look very bad. But it got the answer right 70/80/90% of the time! That's hardly "blind"...

- Various of the tasks hinge on the distinction between two objects "exactly touching" vs. "very nearly touching" vs. "very slightly overlapping", a problem which (i) is hard for humans and (ii) is particularly (presumably deliberately) sensitive to resolution/precision, where we should not be surprised that models fail

- The main fish-shaped example given in task 1 seems genuinely ambiguous to me - do the lines "intersect" once or twice? The tail of the fish clearly has a crossing, but the nose of the fish seems a bit fishy to me... is that really an intersection?

- AFAIC deranged skepticism is just as bad as deranged hype, the framing here is at risk of appealing to the former

It's absolutely fair to make the point that these models are not perfect, fail a bunch of the time, and to point out the edge cases where they suck. That moves the field forwards. But the hyperbole (as pointed out by another commenter) is very annoying.

numeri1y ago

I'm also confused about some of the figures' captions, which don't seem to match the results:

- "Only Sonnet-3.5 can count the squares in a majority of the images", but Sonnet-3, Gemini-1.5 and Sonnet-3.5 all have accuracy of >50%

- "Sonnet-3.5 tends to conservatively answer "No" regardless of the actual distance between the two circles.", but it somehow gets 91% accuracy? That doesn't sound like it tends to answer "No" regardless of distance.

neuronet1y ago

To be fair, the paper has an emoji in the _title_, so I wouldn't read it as a particularly particularly serious academic study as much as the equivalent of the Gawker of AI research. It is a "gotcha" paper that exploits some blind spots (sorry) that will easily be patched up with a few batches of training. I do think it highlights the lack of AGI in these things, which some people lacking situational awareness might need to see.

schneehertz1y ago

I am not sure where their experimental data came from. I tested it on GPT-4o using the prompt and images they provided, and the success rate was quite high, with significant differences from the results they provided.

ec1096851y ago

Their examples are here: https://huggingface.co/datasets/XAI/vlmsareblind/viewer/defa...

ChatGPT whiffs completely on very obvious images.

gnutrino1y ago

My guess is that the systems are running image recognition models, and maybe OCR on images, and then just piping that data as tokens into an LLM. So you are only ever going to get results as good as existing images models with the results filtered through an LLM.

To me, this is only interesting if compared with results of image recognition models that can already answer these types of questions (if they even exist, I haven't looked).

Maybe the service is smart enough to look at the question, and then choose one or more models to process the image, but not sure as I can't find anything on their sites about how it works.

simonw1y ago

That's not how they work. The original GPT-4 paper has some detail: https://cdn.openai.com/papers/gpt-4.pdf

Or read up on PaliGemma: https://github.com/google-research/big_vision/blob/main/big_...

gnutrino1y ago

Thanks, I'll read up on this.

Eisenstein1y ago

> My guess is that the systems are running image recognition models

Your guess is incorrect. Look up CLIP, BLIP, and SigLip for an idea of how they work.

gnutrino1y ago

Will do, thank you.

jordan_bonecut1y ago

This is an interesting article and goes along with how I understand how such models interpret input data. I'm not sure I would characterize the results as blurry vision, but maybe an inability to process what they see in a concrete manner.

All the LLMs and multi-modal models I've seen lack concrete reasoning. For instance, ask ChatGPT to perform 2 tasks, to summarize a chunk of text and to count how many words are in this chunk. ChatGPT will do a very good job summarizing the text and an awful job at counting the words. ChatGPT and all the transformer based models I've seen fail at similar concrete/mathematical reasoning tasks. This is the core problem of creating AGI and it generally seems like no one has made any progress towards synthesizing something with both a high and low level of intelligence.

My (unproven and probably incorrect) theory is that under the hood these networks lack information processing loops which make recursive tasks, like solving a math problem, very difficult.

scarface_741y ago

Out of curiosity, I tried your test with ChatGPT 4o

https://chatgpt.com/share/79c5c6e1-e6a9-441b-acb3-54882303a8...

Of course as usual, LLMs are horrible with Math.

Funny enough, the next time it verified the word count by counting it out until I specifically told it to use Python

https://chatgpt.com/share/79e7b922-9b0f-4df9-98d0-2cd72d7041...

infiar1y ago

This counting words task reminded me of a youtube video: https://www.youtube.com/watch?v=-9XKiOXaHlI Maybe LLMs are somehow more like monkeys.

mr_toad1y ago

Counting is hard, even for humans.

A child will start to speak at around the age of one, but most will be about two before they start to count. And it is even longer (maybe the age of three to four) before they understand cardinality and can reliably follow “simple” instructions like “bring me four blocks”.

And basic arithmetic without counting on their fingers is usually not picked up until they are around six or seven.

empiricus1y ago

I hope you are aware of the fact that LLMs does not have direct access to the stream of words/characters. It is one of the most basic things to know about their implementation.

jordan_bonecut1y ago

Yes, but it could learn to associate tokens with word counts as it could with meanings.

Even still, if you ask it for token count it would still fail. My point is that it can’t count, the circuitry required to do so seems absent in these models

Rebuff50071y ago

In fairness, Mira Murati said GPT-4 is only high school level [1]. Maybe it takes PhD level to understand basic shapes?

[1] https://www.ccn.com/news/technology/openais-gpt-5-phd-level-...

londons_explore1y ago

Could some of the "wrong" answers be the LLM attempting to give an explanation rather than the answer, eg. instead of answering 'X', the LLM answers 'The letter is partially hidden by the oval, so cannot be certain, but it appears to be the english letter X'.

The scoring criteria would rank this answer as 'T', which is wrong.

orbital-decay1y ago

That's not anything like "myopia", though.

FWIW I tried the line intersection and the circled letter test from the article with CogVLM (which is far from reaching the current SotA) and it correctly passed both. I haven't tried it with Sonnet/4o but I suspect there might be something wrong with how the author did their tests. Don't get me wrong, but too many "the model can't do that" claims ended up with demonstrations of the model doing exactly that...

Log_out_1y ago

Chat gpt write me an argument that humans are blind because https://en.m.wikipedia.org/wiki/Optical_illusion exist.

Alexa experience that tragic irony for me.

Siri.forget it.

randomtree1y ago

I guess I know what's coming to every captcha tomorrow.

rezaghanbari11y ago

Some of these samples are shocking. How do these models answer chart-based questions, I mean when they can't even count the intersections between two lines?

RodgerTheGreat1y ago

Same way they answer any question: piece together a statistically probable sequence of words to follow the prompt. All they know about an image is a handful of words a classifier might choose to describe it. If those words have nothing to do with the question being asked, they can't nudge the model in the general direction of a correct answer, so it's a crapshoot- even moreso than usual.

joefourier1y ago

That’s not at all how multi-modal LLMs work - their visual input is not words generated by a classifier. Instead the image is divided into patches and tokenised by a visual encoder (essentially, it is compressed), and then fed directly as a sequence to the model.

imtringued1y ago

The dataset most likely contains chart descriptions that describe the raw data, but not the visual interactions of the individual pixels.

nichohel1y ago

Vision language models are blind because they lack the Cartesian Theater, which you and I have. Which you and I say we have.

mr_toad1y ago

May as well argue that they can’t really know things because they lack an immortal soul.

codeulike1y ago

Does the part of you that 'looks at' your cartesian theatre also have a cartesian theatre?

fleshmonad1y ago

[citation needed]

aaroninsf1y ago

The title for this page and argument should be qualified with the specific generation of tools.

That's in the abstract, but, it's bad to not be specific. In this case, because current public-facing models are WIWEB: the worst it will ever be.

And there are trillion-dollar prizes at stake, so, improvement is happening as quickly as it possibly can.

Jack0001y ago

This is kind of the visual equivalent of asking an LLM to count letters. The failure is more related to the tokenization scheme than the underlying quality of the model.

I'm not certain about the specific models tested, but some VLMs just embed the image modality into a single vector, making these tasks literally impossible to solve.

axblount1y ago

Would you say they have Blindsight?

michaelhoney1y ago

This says to me that there are huge opportunities for improvement in providing vision modules for LLMs. Human minds aren't made of just one kind of thing: we have all sorts of hacky modular capabilities – there's no reason to think that a future AGI wouldn't also.

tantalor1y ago

Are the "random-baseline accuracy" numbers correct?

In the "Two circles" test, do they really have 50% chance of overlapping? I think this comes from "Distances between circle perimeters: -0.15 to 0.5 times the diameter" but doesn't say the distribution they use.

jdlshore1y ago

They asked the AI a question with a yes/no response. If the AI chose randomly, it would be correct 50% of the time. That’s what “random baseline accuracy” means.

viraptor1y ago

I love some of the interpretations there. For example "Fig. 10: Only Sonnet-3.5 can count the squares in a majority of the images.", when that model simply returns "4" for every question and happens to be right.

vessenes1y ago

A few comments below talk about how tokenizing images using stuff like CLIP de-facto yields blurry image descriptions, and so these are ‘blind’ by some definitions. Another angle of blurring not much discussed is that the images are rescaled down; different resolutions for different models. I wouldn’t be surprised if Sonnet 3.5 had a higher res base image it feeds in to the model.

Either way, I would guess that we’ll need new model architectures for multimodal to get really good at some of this, and even then some of these tasks are adjacent to things that we know LLMs are already bad at (numeric logic, for instance).

As context lengths get longer, devoting more tokens to the image tokenization should help a bit here as well. Anyway, I’d anticipate next year we’d see 80s and 90s for most of these scores with next gen models.

imtringued1y ago

The problem with the current crop of projectors such as LLaVA is that as far as I know do not take the previous conversation into account. You only really get zero shot responses. This means that you cannot steer the model towards paying attention to specific instruction related details. The projector simply creates a token representation of the visuals (not necessarily human language tokens) and the LLM just processes that as usual.

vessenes1y ago

The original gpt4 did this too, it had almost no memory before or after the image provided. I haven’t tested gpt4o on this directly, but my feeling is that it’s better from casual usage.

I do think some of these thin line drawings are likely extra hard to tokenize depending on the image scaling sizes for tokenization. I’d wager thicker lines would help, although obviously not all of this is just ‘poor tokenization’.

ec1096851y ago

At least for gpt 4o, it can create a facsimile of images that it still can’t analyze properly, so I think it’s more than just its “eyes” that are broken.

It clearly wasn’t trained on this task and suffers accordingly.

However, with chatgpt, it will create python to do the analysis and has better results.

iamleppert1y ago

This could easily be fixed with training and fine tuning. Simply generate 100,000 examples or so, and train with ground truth until however long you want and its a solved problem.

kristjansson1y ago

Solved for this benchmark... and at what cost to the rest of the system?

These tasks are interesting because they're existence proofs of generalization failure. Like the haystack problem, direct solutions here are much less interesting than structural improvements that address the class of failure.

imtringued1y ago

Ok, but most of the data is just captions for images. You're going to have to invest some time into building this dataset at your own expense.

make31y ago

Hugged to death from my perspective. Here is a backup: https://archive.ph/kOE3Q

simonw1y ago

That's weird - GitHub Pages serves static content and rarely (in my experience) fails to load.

kristianpaul1y ago

We see through thoughts and memories. We see when we desire, the vision just adds on a word pf thoughts and consciousness of being conscious.

Vision links thoughts with reality

childintime1y ago

Claude 3.5 does remarkably well though on many tasks, compared to the others, and on those it's not at all blind. It's getting there.

navaed011y ago

Is there a good primer on how these vision LlmS work?

nyxtom1y ago

I wonder how well Alpha Geometry would do on this

nybsjytm1y ago

AlphaGeometry is a hyper-specific system trained to add auxiliary geometric objects, like extra lines, to existing Euclidean geometry configurations. These prompts are not even sensible inputs to AlphaGeometry.

jackblemming1y ago

Ask it to draw any of those things and it can.

cpill1y ago

I wonder how they would score if they used all 4 models and took a majority vote...?

nmca1y ago

please use this opportunity to reflect on whether ARC measures reasoning skills :)

nothrowaways1y ago

The next version will solve all of it.

mkoubaa1y ago

They interact with pixel buffers as a mathematical array. To call them blind is to confuse what they doing with the experience of sight...

codeulike1y ago

Humans 'see' by tightly packed rods and cones in the retina sending signals up the optic nerve. Not as tidy as a mathematical array but nonetheless not all that different. Ultimately what comes to the brain from the retina can be thought of as a data structure of sorts.

verbalstoner1y ago

It's virtually impossible to take a paper seriously when the title has an emoji.

spullara1y ago

in other news, vision models are bad at things they aren't trained to do

hi_dang_1y ago

I was hoping that someone in the comments talking the paper down would have published a paper or have had relevant publications of their own to point to. You know, meet the lads halfway sort of thing.

So what I’m left with to judge instead is anonymous online commenters vs. the publication of 2 prestigious universities. Whose word do I take on this? Decisions, decisions.

You can swap LM out with Web3 out with NFT out with Crypto in this case.

warkdarrior1y ago

> I’m left with [...] is anonymous online commenters vs. the publication of 2 prestigious universities. Whose word do I take on this?

Maybe you need to judge the contents of those online comments and the contents of the publication, instead of relying on argument from authority.

j / k navigate · click thread line to collapse

191 comments

sweezyjeezy1y ago

Entertaining, but I think the conclusion is way off.

> their vision is, at best, like that of a person with myopia seeing fine details as blurry

[1] https://arxiv.org/abs/2404.04125

SomaticPirate1y ago

benreesman1y ago

If we're throwing "citation needed" tags on stuff, how about the first sentence?

"Large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini-1.5 Pro are powering countless image-text processing applications"

2 more replies

pixl971y ago

Yea, really if you look at human learning/seeing/acting there is a feedback loop that LLM for example isn't able to complete and train on.

1 more reply

sweezyjeezy1y ago

1 more reply

lynx231y ago

Be My Eyes user here. I disagree with your uninformed opinion. Be My Eyes is more often than not more useful then a human. And I am reporting from personal experience. What experience do you have?

kenjackson1y ago

Simple is a relative statement. There are vision problems where monkeys are far better than humans. Some may look at human vision and memory and think that we lack basic skills.

brookst1y ago

I don’t see Be My Eyes or other similar efforts as “implied” to be equivalent to humans at all. They’re just new tools which can be very useful for some people.

“These new tools aren’t perfect” is the dog bites man story of technology. It’s certainly true, but it’s no different than GPS (“family drives car off cliff because GPS said to”).

1 more reply

Lerc1y ago

I disagree. I think the title, abstract, and conclusion not only misrepresents the state of the models but it misrepresents Thier own findings.

2 more replies

Melomololotolo1y ago

Ah yes the blind person who constantly needs to know if two lines intersect.

Let's just ignore what a blind person normally needs to know.

You know what blind people ask? Sometimes there daily routine is broken because there is some type of construction and models can tell you this.

Sometimes they need to read a basic sign and models can do this.

Those models help people already and they will continue to get better.

I'm not sure if I'm more frustrated how condescending the authors are or your ignorance.

Valid criticism doesn't need to be shitty

1 more reply

TeMPOraL1y ago

FrenchDevRemote1y ago

> their vision is, at best, like that of a person with myopia seeing fine details as blurry

It's not that far from reality, most models sees images in very low resolution/limited colors, so not so far from this description

vikramkr1y ago

They didn't test that claim at all though. Vision isn't some sort of 1D sliding scale with every vision condition lying along one axis.

First of all myopia isn't 'seeing fine details as blurry' - it's nearsightedness - and whatever else this post tested it definitely didn't test depth perception.

1 more reply

blackmesaind1y ago

My thoughts as well. I too would have trouble with the overlapping lines tests if all the images underwent convolution.

csomar1y ago

> these huge GenAI models are pretty good at things

Is this the sales pitch though? Because 15 years ago, I had a scanner with an app that can scan a text document and produce the text on Windows. The machine had something like 256Mb of RAM.

Tech can be extremely good at niches in isolation. You can have an OCR system 10 years ago and it'll be extremely reliable at the single task it's configured to do.

jrflowers1y ago

>I could (well actually I can't)

valine1y ago

LLMs are bad at counting things just in general. It’s hard to say whether the failures here are vision based or just an inherent weakness of the language model.

vikramkr1y ago

Those don't really have anything to do with fine detail/nearsightedness. What they measured is valid/interesting - what they concluded is unrelated.

godelski1y ago

> Did they try to probe that hypothesis at all?

> I could (well actually I can't) share some examples from my job of GPT-4v doing some pretty difficult fine-grained visual tasks that invalidate this.

> But in practice, we aren't just making up tasks to trip up these models.

I see this sentiment quite often and it is baffling to me.

> They can be very performant on some tasks and the authors have not presented any real evidence about these two modes.

ClumsyPilot1y ago

> Second, so what if a test was designed to trick up a model? Shouldn't we be determining when and where models fail? Is that not a critical question in understanding how to use them properly?

People are rushing to build this AI into all kinds of products, and they actively don’t want to know where the problems are.

The real world outside is designed to trip up the model. Strange things happen all the time.

Because software developers have no governing body, no oaths of ethics and no spine someone will end up dead in a ditch from malfunctioning AI.

2 more replies

orbital-decay1y ago

> I think this is a communication issue and you're being a bit myopic in your interpretation. It is clearly an analogy meant for communication and is not an actual hypothesis.

> The shockingly poor performance of four state-of-the-art VLMs suggests their vision is, at best, like that of a person with myopia seeing fine details as blurry

>I find it hard to believe that there is no example you can give.

1 more reply

itkovian_1y ago

thomasahle1y ago

If so, it's better than any other ocr on the market.

I think they just train it on a bunch of text.

Maybe counting squares in a grid was not probably considered important enough to train for.

_flux1y ago

Why do you think it's probable? The much smaller llava that I can run in my consumer GPU can also do "OCR", yet I don't believe anyone has hidden any OCR engine inside llama.cpp.

kalkin1y ago

There's definitely something interesting to be learned from the examples here - it's valuable work in that sense - but "VLMs are blind" isn't it. That's just clickbait.

diwank1y ago

Yeah I think their findings are def interesting but the title and the strong claims are a tad hyperbolic.

jetrink1y ago

What was really remarkable though was that it failed to notice that one of the two nuts was obviously missing, even after I told it there was a second problem with the installation.

Screenshot: https://imgur.com/a/QqCNzOM

sfink1y ago

After you fix it, you should try asking the same questions!

1 more reply

fn-mote1y ago

As a human, I was unable to see enough in that picture to infer which side was supposed to be under tension. I’m not trained, but I know what I expected to see from your description.

Like my sister post, I’m skeptical that the LLM didn’t just get lucky.

MagicMoonlight1y ago

To trace it through space it would need short term memory and the ability to think. It does not have it. It must therefore be guessing.

JeremyHerrman1y ago

VLMs so far have never been good at counting objects or spatial relationships (e.g. the coffee is to the right of the microwave).

There are ways to help the VLM out - Set of Marks [0] from Microsoft being the most prominent, which uses segmentation to outline and label sections of the image before sending to the VLM.

0: https://github.com/microsoft/SoM

jazzyjackson1y ago

[0] https://distributedmuseum.illinois.edu/exhibit/biological_co...

[1] https://sites.evergreen.edu/arunchandra/wp-content/uploads/s...

empath751y ago

1 more reply

Eisenstein1y ago

JeremyHerrman1y ago

VLMs like PaliGemma and Florence-2 support object detection and segmentation, so it's becoming more common to have YOLO like capabilities built into VLMs.

joelburget1y ago

ec1096851y ago

It doesn’t seem to have the logic though to answer these questions.

The complete data set is here to play around with it yourself: https://huggingface.co/datasets/XAI/vlmsareblind/viewer/defa...

energy1231y ago

sushid1y ago

Is that not just traditional OCR applied on top of LLM?

2 more replies

buryat1y ago

for some reason I started thinking about trying to describe the taste of a fruit to someone who hasn't tried it as something that can be similar to this as a non-visual sensory modal in humans

cs7021y ago

The tasks at which they fail are ridiculously simple for human beings, including, for example:

* counting the number of times two lines intersect;

* detecting whether two circles overlap;

* selecting which letter is being circled in a word;

* counting the number of circles in an Olympic-like logo.

This should be at the top of the front page.

tensor1y ago

cs7021y ago

It is embarrassingly, shockingly bad, because these models are advertised and sold as being capable of understanding images.

Evidently, all these models still fall short.

7 more replies

mrbungie1y ago

These models are marketed as being able to guide the blind or tutoring children using direct camera access.

Promoting those use cases and models failing in these ways is irresponsible. So, yeah, maybe the models are not embarrasing but the hype definitely is.

2 more replies

sfink1y ago

Well said.

scotty791y ago

You'd expect them to be trained on simple geometry since you can create arbitrarily large synthetic training set for that.

insane_dreamer1y ago

> is an embarrassing failure of the humans more than anything about the models

No, it's a failure of the companies who are advertising them as capable of doing something which they are not (assisting people with low vision)

1 more reply

drodgers1y ago

I can't help but read comments like this as:

"My talking dog always makes mistakes on calculus problems: how embarrassingly bad!"

Has the expectation treadmill really advanced so quickly that sub-human performance on any category of problems is now an embarrassment?

aezart1y ago

diwank1y ago

Take the first example of the four 2D line plots for instance. I asked Sonnet 3.5 for an answer and retried it 5 times and 2 out of 5 times, it said something like this:

Me:

here's an image with four 2-D line plots. Can you tell me A) how many lines are in each plot? and B) how many times do they intersect each other in each plot?

Write your answers as: Plot 1 (leftmost): ...

Sonnet:

I'll analyze each plot from left to right and provide the requested information:

Plot 1 (leftmost): A) 2 lines (one red, one blue) B) 0 intersections (the lines do not cross)

Plot 2: A) 2 lines (one red, one blue) B) 1 intersection (the lines cross once near the right edge)

Plot 3: A) 2 lines (one red, one blue) B) 1 intersection (the lines cross once near the center)

Plot 4 (rightmost): A) 2 lines (one red, one blue) B) 2 intersections (the lines cross twice, once near the left side and once near the right side)

Each plot consistently shows two lines, one red and one blue, with varying intersection patterns across the four plots.

---

Which isn't completely correct (the 3rd plot has 2 intersections) but still pretty good and the other 3 times, it said plot 3 has 1 intersection.

akavi1y ago

efskap1y ago

Isn't this the attention mechanism, the reason we're using transformers for these things? Maybe not greater resolution per se, but focusing on a region with greater neural connectivity

akavi1y ago

Ah, good point!

(I apologize to anyone who actually understands transformers for my gratuitous (ab|mis)use of terminology)

caddemon1y ago

kromem1y ago

Lots and lots of eye tracking data paired with what was being looked at in order to emulate human attention processing might be one of the lower hanging fruits for improving it.

slashdave1y ago

These models have learned to focus on specific portions of an image (after all, this is the stated purpose of a transformer).

Brechreiz1y ago

>Humans have the ability to "refine" their perception of an image iteratively

That's not related to embodied cognition.

akavi1y ago

Is embodied cognition not at least in part about interactivity? I perform action (emit tokens) and receive feedback (non-self-generated tokens)

poikroequ1y ago

slashdave1y ago

Simple for a child, yes. Because we have evolved our vision to recognize patterns like this, because they are important for survival. Reading Russian is not.

From an algorithmic point of view, these vision tasks are actually quite difficult to explicitly program.

mglz1y ago

In general, basic geometry seems under-explored by learning.

jordan_bonecut1y ago

slashdave1y ago

The models do not reason. They have learned associations, because these associations have appeared in their training sets.

1 more reply

samatman1y ago

> Since somewhere some computation must be done to recognize this

GaggiX1y ago

I have the feeling that they first choose the title of the paper and then run the evaluation on the new Claude 3.5 Sonnet on these abstract images.

>their vision is, at best, like that of a person with myopia seeing fine details as blurry

This also makes no sense, since the images evaluate the abstract capabilities of the models, not their eyesight.

randcraw1y ago

OK. They're legally blind.

GaggiX1y ago

This really has nothing to do with vision impairment.

yantrams1y ago

Tested these problems with llava-v1.6-mistral-7b and the results aren't bad. Maybe I just got lucky with these samples

Intersecting Lines https://replicate.com/p/s24aeawxasrgj0cgkzabtj53rc

Overlapping Circles https://replicate.com/p/0w026pgbgxrgg0cgkzcv11k384

Touching Circles https://replicate.com/p/105se4p2mnrgm0cgkzcvm83tdc

Circled Text https://replicate.com/p/3kdrb26nwdrgj0cgkzerez14wc

Nested Squares https://replicate.com/p/1ycah63hr1rgg0cgkzf99srpxm

simonw1y ago

These are really interesting examples, thanks for sharing.

yantrams1y ago

taesiriOP1y ago

simonw1y ago

I've been generally frustrated at the lack of analysis of vision LLMs generally.

r2_pilot1y ago

simonw1y ago

I've used them a bunch. I want to learn from other people's experiences as well.

Some of my notes so far:

- https://simonwillison.net/2024/Apr/17/ai-for-data-journalism... - my datasette-extract plugin, for structured data from both text and images

- https://simonwillison.net/2024/Apr/17/ai-for-data-journalism... - where they failed to extract data from a handwritten scanned document in various weird ways

- https://simonwillison.net/2024/Feb/21/gemini-pro-video/ talks about video inputs to Gemini Pro (which are actually image inputs, it splits them up to one frame per second)

simonw1y ago

Anthropic have some interesting cookbook examples that provide advice on using their multimodal models here: https://github.com/anthropics/anthropic-cookbook/tree/main/m...

I've assembled a bunch more notes here: https://simonwillison.net/tags/vision-llms/

dheera1y ago

An embedding isn't conceptually that much different from feeding a 1024-word description of an image instead of the actual image.

At the moment compute power isn't good enough to feed high-res pixel data into these models, unless we discover a vastly different architecture, which I am also convinced likely exists.

visarga1y ago

> An embedding isn't conceptually that much different from feeding a 1024-word description of an image instead of the actual image.

dheera1y ago

> needs less words

1 more reply

jayd161y ago

Doesn't Gemini have a 2 million token limit for exactly this?

diwank1y ago

The number of tokens per image are actually fairly small, ranging from 85 to ~500.

jeromeparadis1y ago

_vaporwave_1y ago

Does anyone have any technical insight or intuition as to why this large variation exists?

ec1096851y ago

The question wasn’t “yes or no” but instead required an exact number: https://huggingface.co/datasets/XAI/vlmsareblind/viewer/defa...

Playing around with GPT-4o, it knows enough to make a copy of an image that is reasonable but it still can’t answer the questions.

ChatGPT went down a rabbit hole of trying to write python code, but it took lots of prompting for it to notice its mistake when solving one of the intersecting line questions.

pjs_1y ago

I don't like this paper for the following reasons:

- The language is unnecessarily scathing

- AFAIC deranged skepticism is just as bad as deranged hype, the framing here is at risk of appealing to the former

numeri1y ago

I'm also confused about some of the figures' captions, which don't seem to match the results:

- "Only Sonnet-3.5 can count the squares in a majority of the images", but Sonnet-3, Gemini-1.5 and Sonnet-3.5 all have accuracy of >50%

neuronet1y ago

schneehertz1y ago

ec1096851y ago

Their examples are here: https://huggingface.co/datasets/XAI/vlmsareblind/viewer/defa...

ChatGPT whiffs completely on very obvious images.

gnutrino1y ago

To me, this is only interesting if compared with results of image recognition models that can already answer these types of questions (if they even exist, I haven't looked).

Maybe the service is smart enough to look at the question, and then choose one or more models to process the image, but not sure as I can't find anything on their sites about how it works.

simonw1y ago

That's not how they work. The original GPT-4 paper has some detail: https://cdn.openai.com/papers/gpt-4.pdf

Or read up on PaliGemma: https://github.com/google-research/big_vision/blob/main/big_...

gnutrino1y ago

Thanks, I'll read up on this.

Eisenstein1y ago

> My guess is that the systems are running image recognition models

Your guess is incorrect. Look up CLIP, BLIP, and SigLip for an idea of how they work.

gnutrino1y ago

Will do, thank you.

jordan_bonecut1y ago

My (unproven and probably incorrect) theory is that under the hood these networks lack information processing loops which make recursive tasks, like solving a math problem, very difficult.

scarface_741y ago

Out of curiosity, I tried your test with ChatGPT 4o

https://chatgpt.com/share/79c5c6e1-e6a9-441b-acb3-54882303a8...

Of course as usual, LLMs are horrible with Math.

Funny enough, the next time it verified the word count by counting it out until I specifically told it to use Python

https://chatgpt.com/share/79e7b922-9b0f-4df9-98d0-2cd72d7041...

infiar1y ago

This counting words task reminded me of a youtube video: https://www.youtube.com/watch?v=-9XKiOXaHlI Maybe LLMs are somehow more like monkeys.

mr_toad1y ago

Counting is hard, even for humans.

And basic arithmetic without counting on their fingers is usually not picked up until they are around six or seven.

empiricus1y ago

I hope you are aware of the fact that LLMs does not have direct access to the stream of words/characters. It is one of the most basic things to know about their implementation.

jordan_bonecut1y ago

Yes, but it could learn to associate tokens with word counts as it could with meanings.

Even still, if you ask it for token count it would still fail. My point is that it can’t count, the circuitry required to do so seems absent in these models

Rebuff50071y ago

In fairness, Mira Murati said GPT-4 is only high school level [1]. Maybe it takes PhD level to understand basic shapes?

[1] https://www.ccn.com/news/technology/openais-gpt-5-phd-level-...

londons_explore1y ago

The scoring criteria would rank this answer as 'T', which is wrong.

orbital-decay1y ago

That's not anything like "myopia", though.

Log_out_1y ago

Chat gpt write me an argument that humans are blind because https://en.m.wikipedia.org/wiki/Optical_illusion exist.

Alexa experience that tragic irony for me.

Siri.forget it.

randomtree1y ago

I guess I know what's coming to every captcha tomorrow.

rezaghanbari11y ago

Some of these samples are shocking. How do these models answer chart-based questions, I mean when they can't even count the intersections between two lines?

RodgerTheGreat1y ago

joefourier1y ago

imtringued1y ago

The dataset most likely contains chart descriptions that describe the raw data, but not the visual interactions of the individual pixels.

nichohel1y ago

Vision language models are blind because they lack the Cartesian Theater, which you and I have. Which you and I say we have.

mr_toad1y ago

May as well argue that they can’t really know things because they lack an immortal soul.

codeulike1y ago

Does the part of you that 'looks at' your cartesian theatre also have a cartesian theatre?

fleshmonad1y ago

[citation needed]

aaroninsf1y ago

The title for this page and argument should be qualified with the specific generation of tools.

That's in the abstract, but, it's bad to not be specific. In this case, because current public-facing models are WIWEB: the worst it will ever be.

And there are trillion-dollar prizes at stake, so, improvement is happening as quickly as it possibly can.

Jack0001y ago

This is kind of the visual equivalent of asking an LLM to count letters. The failure is more related to the tokenization scheme than the underlying quality of the model.

I'm not certain about the specific models tested, but some VLMs just embed the image modality into a single vector, making these tasks literally impossible to solve.

axblount1y ago

Would you say they have Blindsight?

michaelhoney1y ago

tantalor1y ago

Are the "random-baseline accuracy" numbers correct?

jdlshore1y ago

They asked the AI a question with a yes/no response. If the AI chose randomly, it would be correct 50% of the time. That’s what “random baseline accuracy” means.

viraptor1y ago

vessenes1y ago

imtringued1y ago

vessenes1y ago

The original gpt4 did this too, it had almost no memory before or after the image provided. I haven’t tested gpt4o on this directly, but my feeling is that it’s better from casual usage.

ec1096851y ago

At least for gpt 4o, it can create a facsimile of images that it still can’t analyze properly, so I think it’s more than just its “eyes” that are broken.

It clearly wasn’t trained on this task and suffers accordingly.

However, with chatgpt, it will create python to do the analysis and has better results.

iamleppert1y ago

This could easily be fixed with training and fine tuning. Simply generate 100,000 examples or so, and train with ground truth until however long you want and its a solved problem.

kristjansson1y ago

Solved for this benchmark... and at what cost to the rest of the system?

imtringued1y ago

Ok, but most of the data is just captions for images. You're going to have to invest some time into building this dataset at your own expense.

make31y ago

Hugged to death from my perspective. Here is a backup: https://archive.ph/kOE3Q

simonw1y ago

That's weird - GitHub Pages serves static content and rarely (in my experience) fails to load.

kristianpaul1y ago

We see through thoughts and memories. We see when we desire, the vision just adds on a word pf thoughts and consciousness of being conscious.

Vision links thoughts with reality

childintime1y ago

Claude 3.5 does remarkably well though on many tasks, compared to the others, and on those it's not at all blind. It's getting there.

navaed011y ago

Is there a good primer on how these vision LlmS work?

nyxtom1y ago

I wonder how well Alpha Geometry would do on this

nybsjytm1y ago

jackblemming1y ago

Ask it to draw any of those things and it can.

cpill1y ago

I wonder how they would score if they used all 4 models and took a majority vote...?

nmca1y ago

please use this opportunity to reflect on whether ARC measures reasoning skills :)

nothrowaways1y ago

The next version will solve all of it.

mkoubaa1y ago

They interact with pixel buffers as a mathematical array. To call them blind is to confuse what they doing with the experience of sight...

codeulike1y ago

verbalstoner1y ago

It's virtually impossible to take a paper seriously when the title has an emoji.

spullara1y ago

in other news, vision models are bad at things they aren't trained to do

hi_dang_1y ago

So what I’m left with to judge instead is anonymous online commenters vs. the publication of 2 prestigious universities. Whose word do I take on this? Decisions, decisions.

You can swap LM out with Web3 out with NFT out with Crypto in this case.

warkdarrior1y ago

> I’m left with [...] is anonymous online commenters vs. the publication of 2 prestigious universities. Whose word do I take on this?

Maybe you need to judge the contents of those online comments and the contents of the publication, instead of relying on argument from authority.

j / k navigate · click thread line to collapse