> their vision is, at best, like that of a person with myopia seeing fine details as blurry
is a crazy thing to write in an abstract. Did they try to probe that hypothesis at all? I could (well actually I can't) share some examples from my job of GPT-4v doing some pretty difficult fine-grained visual tasks that invalidate this.
Personally, I rate this paper [1], which makes the argument that these huge GenAI models are pretty good at things - assuming that it has seen a LOT of that type of data during training (which is true of a great many things). If you make up tasks like this, then yes can be REALLY bad at them, and initial impressions of AGI get harder to justify. But in practice, we aren't just making up tasks to trip up these models. They can be very performant on some tasks and the authors have not presented any real evidence about these two modes.
This paper demonstrates that there are still some major gaps where simple problems confound the models in unexpected ways. These is important work to elevate otherwise people may start to believe that these models are suitable for general application when they still need safeguards and copious warnings.
"Large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini-1.5 Pro are powering countless image-text processing applications"
I don't know how many a "countless" is, but I think we've gotten really sloppy in terms of what counts for LLMs as a demonstrated, durable win in a concrete task attached to well-measured outcomes and holding up over even modest periods of time.
This stuff is really promising and lots of builders are making lots of nifty things, so if that counts as an application then maybe we're at countless, but in the enterprise and in government and in refereed academic literature we seem to be at the proof-of-concept phase. Impressive chat bots as a use case are pretty dialed in, enough people claim that they help with coding that I tend to believe it's a real thing (I never seem to come out ahead of going directly to the source, StackOverflow).
The amount of breathless press on this seems "countless", so maybe I missed the totally rigorous case study on how X company became Y percent more profitable by doing Z thing with LLMs (or similar), and if so I'd be grateful for citations, but neither Google nor any of the big models seem to know about it.
You see an object. First you have to learn how to control all your body functions to move toward it and grasp it. This teaches you about the 3 dimensional world and things like gravity. You may not know the terms, but it is baked in your learning model. After you get an object you start building a classification list "hot", "sharp", "soft and fuzzy", "tasty", "slick". Your learning model builds up a list of properties of objects and "expected" properties of objects.
Once you have this 'database' you create as a human, you can apply the logic to achieve tasks. "Walk 10 feet forward, but avoid the sharp glass just to the left". You have to have spatial awareness, object awareness, and prediction ability.
Models 'kind of' have this, but its seemingly haphazard, kind of like a child that doesn't know how to put all the pieces together yet. I think a lot of embodied robot testing where the embodied model feeds back training to the LLM/vision model will have to occur before this is even somewhat close to reliable.
With AINwe are creating intelligence but with different strengths and weaknesses. I think we will continue to be surprised at how well they work on some problems and how poor they do at some “simple” ones.
“These new tools aren’t perfect” is the dog bites man story of technology. It’s certainly true, but it’s no different than GPS (“family drives car off cliff because GPS said to”).
They have identified a class of problems that the models perform poorly at and have given a good description of the failure. They portray this as a representative example of the behaviour in general. This has not been shown and is probably not true.
I don't think that models have been portrayed as equivalent to humans. Like most AI in it has been shown as vastly superior in some areas and profoundly ignorant in others. Media can overblow things and enthusiasts can talk about future advances as if they have already arrived, but I don't think these are typical portayals by the AI Field in general.
Let's just ignore what a blind person normally needs to know.
You know what blind people ask? Sometimes there daily routine is broken because there is some type of construction and models can tell you this.
Sometimes they need to read a basic sign and models can do this.
Those models help people already and they will continue to get better.
I'm not sure if I'm more frustrated how condescending the authors are or your ignorance.
Valid criticism doesn't need to be shitty
Like, seriously. One poster below whines about "AI apologists" and BeMyEyes, but again, it's all trivially testable with your phone and $20/month subscription. It works spectacularly well on real world tasks. Not perfectly, sure, but good enough to be useful in practice and better than alternatives (which often don't exist).
It's not that far from reality, most models sees images in very low resolution/limited colors, so not so far from this description
First of all myopia isn't 'seeing fine details as blurry' - it's nearsightedness - and whatever else this post tested it definitely didn't test depth perception.
And second - inability to see fine details is a distinct/different thing from not being able to count intersections and the other things tested here. That hypothesis, if valid, would imply that improving the resolution of the image that the model can process would improve its performance on these tasks even if reasoning abilities were the same. That - does not make sense. Plenty of the details in these images that these models are tripping up on are perfectly distinguishable at low resolutions. Counting rows and columns of blank grids is not going to improve with more resolution.
I mean, I'd argue that the phrasing of the hypothesis ("At best, like that of a person with myopia") doesn't make sense at all. I don't think a person with myopia would have any trouble with these tasks if you zoomed into the relevant area, or held the image close. I have a very strong feeling that these models would continue to suffer on these tasks if you zoomed in. Nearsighted != unable to count squares.
Is this the sales pitch though? Because 15 years ago, I had a scanner with an app that can scan a text document and produce the text on Windows. The machine had something like 256Mb of RAM.
Tech can be extremely good at niches in isolation. You can have an OCR system 10 years ago and it'll be extremely reliable at the single task it's configured to do.
AI is supposed to bring a new paradigm, where the tech is not limited to the specific niche the developers have scoped it to. However, if it reliably fails to detect simple things a regular person should not get wrong, then the whole value proposition is kicked out of the window.
I like the idea that these models are so good at some sort of specific and secret bit of visual processing that things like “counting shapes” and “beating a coin toss for accuracy” shouldn’t be considered when evaluating them.
I think this is a communication issue and you're being a bit myopic in your interpretation. It is clearly an analogy meant for communication and is not an actual hypothesis. Sure, they could have used a better analogy and they could have done other tests, but the paper still counters quite common claims (from researchers) about VLMs.
> I could (well actually I can't) share some examples from my job of GPT-4v doing some pretty difficult fine-grained visual tasks that invalidate this.
I find it hard to believe that there is no example you can give. It surely doesn't have to be exactly your training data. If it is this good, surely you can create an example no problem. If you just don't want to, that's okay, but then don't say it.
But I have further questions. Do you have complicated prompting? Or any prompt engineering? It sure does matter how robust these models are to prompting. There's a huge difference between a model being able to accomplish a task and a model being able to perform a task in a non-very-specific environment. This is no different than something working in a tech demo and not in the hand of the user.
> But in practice, we aren't just making up tasks to trip up these models.
I see this sentiment quite often and it is baffling to me.
First off, these tasks are not clearly designed to trick these models. A model failing at a task is not suddenly "designed to trick a model." Its common with the river crossing puzzles where they're rewritten to be like "all animals can fit in the boat." If that is "designed to trick a model", then the model must be a stochastic parrot and not a generalist. It is very important that we test things where we do know the answer to because, unfortunately, we're not clairvoyant and can't test questions we don't know the answer to. Which is the common case in the real world usage.
Second, so what if a test was designed to trick up a model? Shouldn't we be determining when and where models fail? Is that not a critical question in understanding how to use them properly? This seems doubly important if they are tasks that humans don't have challenges with them.
> They can be very performant on some tasks and the authors have not presented any real evidence about these two modes.
I don't think people are claiming that large models can't be performant on some tasks. If they are, they're rejecting trivially verifiable reality. But not every criticism and has to also contain positive points. There's plenty of papers and a lot of hype already doing that. And if we're going to be critical of anything, shouldn't it be that the companies creating these models -- selling them, and even charging researchers to perform these types of experiments that the can and are used to improve their products -- should be much more clear about the limitations of their models? If we need balance, then I think there's bigger fish to fry than Auburn and Alberta Universities.
People are rushing to build this AI into all kinds of products, and they actively don’t want to know where the problems are.
The real world outside is designed to trip up the model. Strange things happen all the time.
Because software developers have no governing body, no oaths of ethics and no spine someone will end up dead in a ditch from malfunctioning AI.
I don't know, words have meanings. If that's a communication issue, it's on part of the authors. To me, this wording in a what is supposed to be a research paper abstract clearly suggests the insufficient resolution as the cause. How else should I interpret it?
> The shockingly poor performance of four state-of-the-art VLMs suggests their vision is, at best, like that of a person with myopia seeing fine details as blurry
And indeed, increasing the resolution is expensive, and the best VLMs have something like 1000x1000. But the low resolution is clearly not the issue here, and the authors don't actually talk about it in the paper.
>I find it hard to believe that there is no example you can give.
I'm not the person you're answering to, but I actually lazily tried two of authors' examples in a less performant VLM (CogVLM), and was surprised it passed those, making me wonder whether I can trust their conclusions until I reproduce their results. LLMs and VLMs have all kinds of weird failure modes, it's not a secret they fail at some trivial tasks and their behavior is still not well understood. But working with these models and narrowing it down is notoriously like trying to nail a jelly to the wall. If I was able to do this in a cursory check, what else is there? More than one research paper in this area is wrong from the start.
I think they just train it on a bunch of text.
Maybe counting squares in a grid was not probably considered important enough to train for.
What was really remarkable though was that it failed to notice that one of the two nuts was obviously missing, even after I told it there was a second problem with the installation.
Screenshot: https://imgur.com/a/QqCNzOM
After you fix it, you should try asking the same questions!
Like my sister post, I’m skeptical that the LLM didn’t just get lucky.
There are ways to help the VLM out - Set of Marks [0] from Microsoft being the most prominent, which uses segmentation to outline and label sections of the image before sending to the VLM.
Providing "speakable" labels to regions helps ground the visual abilities of VLMs and is why in this paper the performance is so much better when words are present in the grid for "Task 6: Counting the rows and columns of a grid"
[0] https://distributedmuseum.illinois.edu/exhibit/biological_co...
[1] https://sites.evergreen.edu/arunchandra/wp-content/uploads/s...
Another benefit of VLMs which support object detection is that they are open vocabulary, meaning you don't have to define the classes ahead of time. Additionally fine tuning tends to keep the previous detection capabilities instead of erasing all previous classes like fine tuning YOLO.
It doesn’t seem to have the logic though to answer these questions.
The complete data set is here to play around with it yourself: https://huggingface.co/datasets/XAI/vlmsareblind/viewer/defa...
The tasks at which they fail are ridiculously simple for human beings, including, for example:
* counting the number of times two lines intersect;
* detecting whether two circles overlap;
* selecting which letter is being circled in a word;
* counting the number of circles in an Olympic-like logo.
This should be at the top of the front page.
It's entirely unsurprising that there are numerous cases that these models can't handle that are "obvious to humans." Machine learning has had this property since its invention and it's a classic mistake humans make dealing with these systems.
Humans assume that because a machine learning model has above human accuracy on task X that it implies that it must also have that ability at all the other tasks. While a human with amazing ability at X would indeed have amazing abilities at other tasks, this is not true of machine learning models The opposite thinking is also wrong, that because the model can't do well on task Y it must be unreliable and it's ability on task X is somehow an illusion and not to be trusted.
Evidently, all these models still fall short.
Promoting those use cases and models failing in these ways is irresponsible. So, yeah, maybe the models are not embarrasing but the hype definitely is.
It doesn't matter how they are marketed or described or held up to some standard generated by wishful thinking. And it especially doesn't matter what it would mean if a human were to make the same error.
It matters what they are, what they're doing, and how they're doing it. Feel free to be embarrassed if you are claiming they can do what they can't and are maybe even selling them on that basis. But there's nothing embarrassing about their current set of capabilities. They are very good at what they are very good at. Expecting those capabilities to generalize as they would if they were human is like getting embarrassed that your screwdriver can't pound in a nail, when it is ever so good at driving in screws.
No, it's a failure of the companies who are advertising them as capable of doing something which they are not (assisting people with low vision)
"My talking dog always makes mistakes on calculus problems: how embarrassingly bad!"
Has the expectation treadmill really advanced so quickly that sub-human performance on any category of problems is now an embarrassment?
Take the first example of the four 2D line plots for instance. I asked Sonnet 3.5 for an answer and retried it 5 times and 2 out of 5 times, it said something like this:
Me:
here's an image with four 2-D line plots. Can you tell me A) how many lines are in each plot? and B) how many times do they intersect each other in each plot?
Write your answers as: Plot 1 (leftmost): ...
Sonnet:
I'll analyze each plot from left to right and provide the requested information:
Plot 1 (leftmost): A) 2 lines (one red, one blue) B) 0 intersections (the lines do not cross)
Plot 2: A) 2 lines (one red, one blue) B) 1 intersection (the lines cross once near the right edge)
Plot 3: A) 2 lines (one red, one blue) B) 1 intersection (the lines cross once near the center)
Plot 4 (rightmost): A) 2 lines (one red, one blue) B) 2 intersections (the lines cross twice, once near the left side and once near the right side)
Each plot consistently shows two lines, one red and one blue, with varying intersection patterns across the four plots.
---
Which isn't completely correct (the 3rd plot has 2 intersections) but still pretty good and the other 3 times, it said plot 3 has 1 intersection.
I'm curious if there'd be a way to emulate this (have the visual tokens be low fidelity at first, but allow the VLM to emit tokens that correspond to "focusing" on a region of the image with greater resolution). I'm not sure if/how it's possible to performantly train a model with "interactive" data like that, though
But the model is downstream of the "patch" tokenization, so the cut-down in resolution (compression) of the image has already occurred prior to the point where the model can direct greater "attention".
I think the synthesis is that I'm proposing a per-pixel tokenization with a transformer block whose purpose is to output information at a compression level "equivalent" to that of the patch tokens (is this what an autoencoder is?), but where the attention vector is a function of the full state of the LLM (ie, inclusive of the text surrounding the image)).
Naïvely, I'd think a layer like this that is agnostic to the LLM state needn't be any more computationally costly than the patching computation (both are big honks of linear algebra?), but idk how expensive the "full context attention" feedback is...
(I apologize to anyone who actually understands transformers for my gratuitous (ab|mis)use of terminology)
It's not entirely known how much early learning in low resolution makes a difference in humans, and obviously that could also relate more to our specific neurobiology than a general truth about learning in connectionist systems. But I found it to be an interesting idea that maybe certain outcomes with ANNs could be influenced a lot by training paradigms s.t. not all shortcomings could be addressed with only updates to the core architecture.
On the one hand, it's incredible what these LLMs are capable of. On the other hand, they often fall flat on their face with seemingly simple problems like this. We are seeing the same from self driving cars, getting into accidents in scenarios that almost any human driver could have easily avoided.
From an algorithmic point of view, these vision tasks are actually quite difficult to explicitly program.
In general, basic geometry seems under-explored by learning.
Humans don't have a "compute intersections" ability (other than a few who have learned it laboriously through algebra), we have a "see things and count them" mechanism. We aren't visually taking lines in a planar space and determining where they cross. We know what an intersection looks like, we see one, increment a counter, and find the next one. If it's less than around five, we do this all at once. Otherwise we literally count, sometimes in small groups, sometimes one at a time.
I have the feeling that they first choose the title of the paper and then run the evaluation on the new Claude 3.5 Sonnet on these abstract images.
>their vision is, at best, like that of a person with myopia seeing fine details as blurry
This also makes no sense, since the images evaluate the abstract capabilities of the models, not their eyesight.
Intersecting Lines https://replicate.com/p/s24aeawxasrgj0cgkzabtj53rc
Overlapping Circles https://replicate.com/p/0w026pgbgxrgg0cgkzcv11k384
Touching Circles https://replicate.com/p/105se4p2mnrgm0cgkzcvm83tdc
Circled Text https://replicate.com/p/3kdrb26nwdrgj0cgkzerez14wc
Nested Squares https://replicate.com/p/1ycah63hr1rgg0cgkzf99srpxm
They're clearly a very exciting category of technology, and a pretty recent one - they only got good last October with GPT-4 Vision, but since then we've had more vision models from Anthropic and Google Gemini.
There's so much more information about there about text prompting compared to image prompting. I feel starved for useful information about their capabilities: what are vision models good and bad at, and what are the best ways to put them to work?
Some of my notes so far:
- https://simonwillison.net/2024/Apr/17/ai-for-data-journalism... - my datasette-extract plugin, for structured data from both text and images
- https://simonwillison.net/2024/Apr/17/ai-for-data-journalism... - where they failed to extract data from a handwritten scanned document in various weird ways
- https://simonwillison.net/2024/Feb/21/gemini-pro-video/ talks about video inputs to Gemini Pro (which are actually image inputs, it splits them up to one frame per second)
I've assembled a bunch more notes here: https://simonwillison.net/tags/vision-llms/
An embedding isn't conceptually that much different from feeding a 1024-word description of an image instead of the actual image.
At the moment compute power isn't good enough to feed high-res pixel data into these models, unless we discover a vastly different architecture, which I am also convinced likely exists.
An embedding needs less words. You can embed individual words, phrases, like a whole prompt and longer paragraphs. You don't need 1024 words for a text embed. For example a famous library is called Sentence BERT (sbert).
When you embed images on the other hand, you cut them up into little squares on the tune of 32x32 px, and embeds one of them separately. chatGPT uses something like 250 tokens for smaller images. So a smaller image costs about as much as 200 words if represented graphically, and maybe much less words if you embed a text description of it.
Yes I'm aware of this, and work in ML -- the thing is embeddings are not designed for faithful image reconstruction, and aren't even trained that way. You can easily find two images that have substantially similar CLIP (or whatever) embeddings that are visually very different. If you query the LLM about that difference, the LLM wouldn't even have the information to differentiate answers for the two images if you only supply it with the embedding.
On the other hand, SDXL autoencoder latents passed into an LLM alongside the embedding might be a step up from just an image embedding, since they are designed for image reconstruction, but I don't have access to the compute or data resources to attempt training this.
When asking days for specific days of interest for the school year, AIs always struggle. They get some days right but forget some or fabulate new days. They fare a bit better if you remove some of the noise and give them only a picture of a month but even then, it's unreliable.
Does anyone have any technical insight or intuition as to why this large variation exists?
Playing around with GPT-4o, it knows enough to make a copy of an image that is reasonable but it still can’t answer the questions.
ChatGPT went down a rabbit hole of trying to write python code, but it took lots of prompting for it to notice its mistake when solving one of the intersecting line questions.
- The language is unnecessarily scathing
- They repeatedly show data where the models are getting things right 70, 80, 90% of the time, and then show a list of what they call "qualitative samples" (what does "qualitative" mean? "cherry-picked"?) which look very bad. But it got the answer right 70/80/90% of the time! That's hardly "blind"...
- Various of the tasks hinge on the distinction between two objects "exactly touching" vs. "very nearly touching" vs. "very slightly overlapping", a problem which (i) is hard for humans and (ii) is particularly (presumably deliberately) sensitive to resolution/precision, where we should not be surprised that models fail
- The main fish-shaped example given in task 1 seems genuinely ambiguous to me - do the lines "intersect" once or twice? The tail of the fish clearly has a crossing, but the nose of the fish seems a bit fishy to me... is that really an intersection?
- AFAIC deranged skepticism is just as bad as deranged hype, the framing here is at risk of appealing to the former
It's absolutely fair to make the point that these models are not perfect, fail a bunch of the time, and to point out the edge cases where they suck. That moves the field forwards. But the hyperbole (as pointed out by another commenter) is very annoying.
- "Only Sonnet-3.5 can count the squares in a majority of the images", but Sonnet-3, Gemini-1.5 and Sonnet-3.5 all have accuracy of >50%
- "Sonnet-3.5 tends to conservatively answer "No" regardless of the actual distance between the two circles.", but it somehow gets 91% accuracy? That doesn't sound like it tends to answer "No" regardless of distance.
ChatGPT whiffs completely on very obvious images.
To me, this is only interesting if compared with results of image recognition models that can already answer these types of questions (if they even exist, I haven't looked).
Maybe the service is smart enough to look at the question, and then choose one or more models to process the image, but not sure as I can't find anything on their sites about how it works.
Or read up on PaliGemma: https://github.com/google-research/big_vision/blob/main/big_...
Your guess is incorrect. Look up CLIP, BLIP, and SigLip for an idea of how they work.
All the LLMs and multi-modal models I've seen lack concrete reasoning. For instance, ask ChatGPT to perform 2 tasks, to summarize a chunk of text and to count how many words are in this chunk. ChatGPT will do a very good job summarizing the text and an awful job at counting the words. ChatGPT and all the transformer based models I've seen fail at similar concrete/mathematical reasoning tasks. This is the core problem of creating AGI and it generally seems like no one has made any progress towards synthesizing something with both a high and low level of intelligence.
My (unproven and probably incorrect) theory is that under the hood these networks lack information processing loops which make recursive tasks, like solving a math problem, very difficult.
https://chatgpt.com/share/79c5c6e1-e6a9-441b-acb3-54882303a8...
Of course as usual, LLMs are horrible with Math.
Funny enough, the next time it verified the word count by counting it out until I specifically told it to use Python
https://chatgpt.com/share/79e7b922-9b0f-4df9-98d0-2cd72d7041...
A child will start to speak at around the age of one, but most will be about two before they start to count. And it is even longer (maybe the age of three to four) before they understand cardinality and can reliably follow “simple” instructions like “bring me four blocks”.
And basic arithmetic without counting on their fingers is usually not picked up until they are around six or seven.
Even still, if you ask it for token count it would still fail. My point is that it can’t count, the circuitry required to do so seems absent in these models
[1] https://www.ccn.com/news/technology/openais-gpt-5-phd-level-...
The scoring criteria would rank this answer as 'T', which is wrong.
FWIW I tried the line intersection and the circled letter test from the article with CogVLM (which is far from reaching the current SotA) and it correctly passed both. I haven't tried it with Sonnet/4o but I suspect there might be something wrong with how the author did their tests. Don't get me wrong, but too many "the model can't do that" claims ended up with demonstrations of the model doing exactly that...
Alexa experience that tragic irony for me.
Siri.forget it.
That's in the abstract, but, it's bad to not be specific. In this case, because current public-facing models are WIWEB: the worst it will ever be.
And there are trillion-dollar prizes at stake, so, improvement is happening as quickly as it possibly can.
I'm not certain about the specific models tested, but some VLMs just embed the image modality into a single vector, making these tasks literally impossible to solve.
In the "Two circles" test, do they really have 50% chance of overlapping? I think this comes from "Distances between circle perimeters: -0.15 to 0.5 times the diameter" but doesn't say the distribution they use.
Either way, I would guess that we’ll need new model architectures for multimodal to get really good at some of this, and even then some of these tasks are adjacent to things that we know LLMs are already bad at (numeric logic, for instance).
As context lengths get longer, devoting more tokens to the image tokenization should help a bit here as well. Anyway, I’d anticipate next year we’d see 80s and 90s for most of these scores with next gen models.
I do think some of these thin line drawings are likely extra hard to tokenize depending on the image scaling sizes for tokenization. I’d wager thicker lines would help, although obviously not all of this is just ‘poor tokenization’.
It clearly wasn’t trained on this task and suffers accordingly.
However, with chatgpt, it will create python to do the analysis and has better results.
These tasks are interesting because they're existence proofs of generalization failure. Like the haystack problem, direct solutions here are much less interesting than structural improvements that address the class of failure.
Vision links thoughts with reality
So what I’m left with to judge instead is anonymous online commenters vs. the publication of 2 prestigious universities. Whose word do I take on this? Decisions, decisions.
You can swap LM out with Web3 out with NFT out with Crypto in this case.
Maybe you need to judge the contents of those online comments and the contents of the publication, instead of relying on argument from authority.