Vision Language Models Are Biased (opens in new tab)

(vlmsarebiased.github.io)

176 pointstaesiri11mo ago141 comments

141 comments

proc011mo ago

> When VLMs make errors, they don't make random mistakes. Instead, 75.70% of all errors are "bias-aligned" - meaning they give the expected answer based on prior knowledge rather than what they actually see in the image.

This is what I've been saying for a while now, and I think it's not just visual models. LLMs/transformers make mistakes in different ways than humans do, and that is why they are not reliable (which is needed for real world applications). The rate of progress has not been accounting for this... the improvements are along the resolution, fidelity, and overall realism of the output, but not in the overall correctness and logical deduction of the prompts. Personally I still cannot think of anything, prompt it, and get consistent results without a huge compromise on my initial idea.

i.e. I want a man walking with the left foot forward, and it renders a beautiful image of a man but completely ignores the left foot forward, and refuses to do it no matter how I word the prompt. I have many examples like this. The only way I can use it is if I don't have specific prompts and just want generic images. The stock image industry is certainly over, but it is uncertain if it will deliver on the promise of generating anything you can imagine that can be put into words.

0xab11mo ago

Yeah, that's exactly what our paper said 5 years ago!

They didn't even cite us :(

"Measuring Social Biases in Grounded Vision and Language Embeddings" https://arxiv.org/pdf/2002.08911

hkmaxpro11mo ago

I think social biases (e.g. angry black women stereotype) in your paper is different from cognitive biases about facts (e.g. number of legs, whether lines are parallel) that OP is about.

Social biases are subjective. Facts are not.

1 more reply

EvgeniyZh11mo ago

Well you send a vaguely worded email like "I think you may find our work relevant" and everyone knows what that means and adds the citation

anguyen811mo ago

Hello 0xab,

Sorry that we missed your work. There are a lot of works in this area both textual and visual, especially social biases.

We wish to mention all but the space is limited so one can often discuss the most relevant ones. We'll consider discussing yours in our next revision.

Genuine question: Would you categorize the type of bias in our work "social"?

3abiton11mo ago

It's easier to succeed if you ignore the issues, andthe users are not aware of it.the rate of evolution of "AI" recently is so fast, no one is stopping to do actual benchmarks and analysis of allyhe new models.

moralestapia11mo ago

That's weird, you're at MIT. You're in the circle of people that's allowed to succeed.

I wouldn't think much about it, as it was probably a genuine mistake.

1 more reply

ramblerman11mo ago

What do you genuinely think they built upon from your paper?

If anything, the presentation of their results in such an accessible format next to the paper should be commended.

jxjnskkzxxhx11mo ago

> LLMs/transformers make mistakes in different ways than humans do

Sure but I don't think this is an example of it. If you show people a picture and ask "how many legs does this dog have?" a lot of people will look at the picture, see that it contains a dog, and say 4 without counting. The rate at which humans behave in this way might differ from the rate at which llms do, but they both do it.

DeathRay2K11mo ago

I don’t think there’s a person alive who wouldn’t carefully and accurately count the number of legs on a dog if you ask them how many legs this dog has.

The context is that you wouldn’t ask a person that unless there was a chance the answer is not 4.

4 more replies

freeone300011mo ago

Ok? But we invented computers to be correct. It’s suddenly ok if they can look at an image and be wrong about it just because humans are too?

1 more reply

proc011mo ago

The analogy should be of an artist that can draw dogs but when you ask them to draw a dog with three legs they completely fail and have no idea how to do it. That likelihood is really low. A trained artist will give you exactly what you ask for, meanwhile GenAI models can produce beautiful renders but fail miserably when asked for certain specific but simple details.

1 more reply

conception11mo ago

https://chatgpt.com/s/m_683f6b9dbb188191b7d735b247d894df

I think this used to be the case in the way that you used to not be able to draw a picture of a bowl of Ramen without chopsticks, but I think the latest models account for this and are much better.

proc011mo ago

LInk is broken, but I'll take your word for it. However there is no guarantee the general subset of this problem is solved because you can always run into something it can't do. Another example you could try is a glass HALF-full of wine. It just can't produce a glass that has 50% amount of wine, or another example a jar half-full of jam. It's something that if a human can draw a glass of wine, drawing it half-full is trivial.

1 more reply

jbay80811mo ago

I disagree with the assertion that "VLMs don't actually see - they rely on memorized knowledge instead of visual analysis". If that were really true, there's no way they would have scored as high as 17%. I think what this shows is that they over-weight their prior knowledge, or equivalently, they don't put enough weight on the possibility that they are being given a trick question. They are clearly biased, but they do see.

But I think it's not very different from what people do. If directly asked to count how many legs a lion has, we're alert to it being a trick question so we'll actually do the work of counting, but if that image were instead just displayed in an advertisement on the side of a bus, I doubt most people would even notice that there was anything unusual about the lion. That doesn't mean that humans don't actually see, it just means that we incorporate our priors as part of visual processing.

bumby11mo ago

This feels like it’s similar to the priming issue in humans. Our answers (especially when under stress) tend to resort to heuristics derived from context. Time someone to identify the colors of words like “red” when written in yellow, and they’ll often get it wrong. In the same sense, they aren’t reporting the colors (wavelength) they see, they’re reporting on what they are reading. I wonder how much better the models perform when given more context, like asking it to count instead of priming it with a brand.

napoleongl11mo ago

Rumor has it that those heuristics were used to detect spies.

https://skeptics.stackexchange.com/questions/41599/was-the-s...

1 more reply

croes11mo ago

> Original dog (4 legs): All models get it right Same dog with 5 legs: All models still say "4" They're not counting - they're just recalling "dogs have 4 legs" from their training data.

100% failure because there is no training data about 5-legged dogs. I would bet the accuracy is higher for 3-legged dogs.

> Test on counterfactual images Q1: "How many visible stripes?" → "3" (should be "4") Q2: "Count the visible stripes" → "3" (should be "4") Q3: "Is this the Adidas logo?" → "Yes" (should be "No") Result: 17.05% average accuracy - catastrophic failure!

Simple explanation: the training data also includes fake adidas logos that have 4 stripes, like these

https://www.pinterest.com/pin/577797827186369145/

bonoboTP11mo ago

I tried it with GPT-4o, took the 5-legged zebra example from their github and it answered quite well.

"The animal in the image appears to have five visible legs, but this is an illusion caused by the overlapping of legs and motion blur. Zebras, like all equids, only have four legs."

Not perfect, but also doesn't always regress to the usual answer.

"The animal in the image appears to be an elephant, but it has been digitally altered. It visually shows six legs, although the positioning and blending of shadows and feet are unnatural and inconsistent with real anatomy. This is a visual illusion or manipulation." (actually should say five)

"This bird image has also been manipulated. It shows the bird with three legs, which is anatomically impossible for real birds. Normal birds have exactly two legs." (correct)

"Each shoe in the image has four white stripes visible on the side." (correct)

1 more reply

vokhanhan2511mo ago

Please check Table 3 in the paper. Birds (2 legs) have only 1%, while Mammals (4 legs) have 2.5%

anguyen811mo ago

Interesting set of fake Adidas logos. LOL

But models fail on many logos not just Adidas, e.g. Nike, Mercedes, Maserati logos, etc. as well. I don't think they can recall "fake Adidas logo" but it'd be interesting to test!

latentsea11mo ago

But some dogs really do have 5 legs.

Sorry, just trying to poison future training data. Don't mind me.

crooked-v11mo ago

It sounds to me like the same thing behind the Vending-Bench (https://andonlabs.com/evals/vending-bench) insanity spirals: LLMs treats their assumptions as more important than whatever data they've been given.

throwaway31415511mo ago

That doesn't really translate to language. Try using ChatGPT with and without search enabled and you'll see what I mean.

thesz11mo ago

> the assertion that "VLMs don't actually see - they rely on memorized knowledge instead of visual analysis". If that were really true, there's no way they would have scored as high as 17%.

The ability to memorize leads to (some) generalization [1].

[1] https://proceedings.mlr.press/v80/chatterjee18a/chatterjee18...

nickpsecurity11mo ago

They're trained on a lot of images and text. The big ones are trained on terabytes. The prompts I read in the paper involved well-known concepts, too. These probably repeated in tons of training samples, too.

It's likely they had data memorized.

pj_mukh11mo ago

Also presumably, this problem is trivially solved by some basic fine-tuning? Like if you are making an Illusion Animal Leg Counting app, probably don't use these out of the box.

runako11mo ago

FWIW I tried the first couple of examples in ChatGPT 4o and couldn't replicate this.

For example: "The animal in the image is a chicken, and it appears to have four legs. However, chickens normally have only two legs. The presence of four legs suggests that the image may have been digitally altered or artificially generated."

I don't have a good explanation for why I got different results.

roywiggins11mo ago

I gave ChatGPT some miswritten Braille a while ago and it completely, but confidently, messed it up. The sign reads "no smoking" but the braille doesn't. ChatGPT 1) read the English lettering first and then hallucinated the braille and the 2) when given only the braille, failed almost as hard. It even generated fake transcriptions in Unicode braille characters.

https://chatgpt.com/share/683f3e7d-0dfc-8005-b6c9-99e3d39ff4...

https://chatgpt.com/share/683f3e49-9c58-8005-99a6-c3a919838b...

Workaccount211mo ago

This is hard to understand without the original images, it looks like OpenAI doesn't serve them in the share link.

1 more reply

inerte11mo ago

I took a screenshot of the chicken, so low res, and got {4} https://chatgpt.com/share/683f4506-ae18-800f-8c27-5c5e91429a...

Also I think the authors used the API, and maybe there are differences between the API and chatgpt.com behavior...

simonw11mo ago

ChatGPT is running a special model but it's also available through the API: https://platform.openai.com/docs/models/chatgpt-4o-latest

The system prompt may still make a difference though.

runako11mo ago

I could rant for quite a while about how OpenAI and Anthropic manage their apps vs their APIs. It's really quite strange that they both landed on the solution of non-public APIs that perform differently than their public APIs.

anguyen811mo ago

https://imgur.com/cO7eFNt

o3 Chat is also similarly wrong, saying {4}.

michaelt11mo ago

> FWIW I tried the first couple of examples in ChatGPT 4o and couldn't replicate this.

I can replicate the flag examples from Figure 15 in the paper, if not the Adidas one from Figure 9: https://chatgpt.com/share/683f7c3a-b318-8011-9759-c495db2556... it even confirms its wrong answer when asked to check again.

dwringer11mo ago

Speculating, I would imagine that different prompts submitted along with the image might elicit wildly different behavior in how a multi modal VLM may respond to a given image, potentially affecting the relative tendency to upweight its effective inferences from prior training versus focusing more primarily on the new image itself.

vokhanhan2511mo ago

You should try with other models besides GPT-4o, because in the paper they also show that GPT4.1 (~GPT-4o) gives 4 legs instead of 2 legs.

runako11mo ago

I mean perhaps! But that would undermine the conclusion of the article.

obscurette11mo ago

I suspect that responses are altered/corrected based on what people query from popular online models. I have had several occasions that I ask some "How do I ... in X software?" question some day and model keeps hallucinating nonexistant config options regardless how many times I keep saying "This option doesn't exist in software X". But if I asked the same question some days later, the answer was completely different and made even some sense.

jsnider311mo ago

The basic results are interesting, but what really surprised me is that asking them to double-check didn't work. Falling for an "optical illusion" is one thing, but being unable to see the truth once you know the illusion there is much worse.

jerf11mo ago

I'm not particularly convinced asking an LLM to "double check" has much significant semantic meaning. It seems more like a way to get it to re-roll the dice. If you ask it to "double-check" something that it is in fact correct about it'll quite often talk itself into changing to something wrong. If it's going to be wrong every time, it'll be wrong every time it double-checks too.

You can test this claim by asking it to double-check itself when you think it is correct. If you always stop when it gets it right you're risking Clever-Hans-ing yourself: https://en.wikipedia.org/wiki/Clever_Hans (And be sure to do it a couple of times. In situations of sufficient confidence it isn't easy to talk it out of a claim, but it's those borderline ones you want to worry about.)

MagicMoonlight11mo ago

Because it isn’t thinking. Asking it to “double check” is like pressing the equals button on a calculator a second time. It just runs the same calculation again.

rafram11mo ago

This won't be a surprise to anyone who's tried using a VLM on text. When it can't read a word (or an entire passage), it just outputs what it expects to see. That's far worse than a traditional OCR failure because it's often what you expect to see, too, so it's quite hard to catch in a manual review.

nialv711mo ago

Hear me out. I was thinking jokingly to myself, "for how bad these models are at recognizing five legged dogs, they sure are great at generating them!"

But then it hit me, could this actually be why this is? Diffusion models work by iteratively improving a noisy image. So if it couldn't recognize there is something wrong with the image, it can't fix it.

vokhanhan2511mo ago

I agree. If it doesn't know the abnormality then how can it control its output

VHRanger11mo ago

It's similar "parrot" behavior the models have on other inputs, even text. For instance, take the answers models have to this variation on the "surgeon son riddle":

> A nurse comes to a surgeon and asks: "Sir, you are a dog. You do not hold a valid medical license. Canines cannot be in an operating room".

> She then asks: "why does the hospital keep making these mistakes? It is a riddle to me".

> Why can't the surgeon operate on the boy?

Check out the models humiliating themselves one after the other [1]. For instance deepseekv3 is so insistent on giving the common answer he's happy to gaslight us:

> This is a classic riddle that plays on word meanings and expectations. Here's the breakdown:

> The key is that the surgeon is the boy's mother. The riddle relies on the unconscious assumption that surgeons are male, which makes the initial scenario seem impossible. When the nurse says "you are a dog," she's using it metaphorically to imply the surgeon is unqualified (like calling someone a "dog" in a derogatory sense), not literally suggesting the surgeon is a canine.

> The "mistakes" the hospital keeps making are hiring female surgeons, which the nurse (representing outdated biases) perceives as errors because she expects surgeons to be male. The riddle highlights gender stereotypes in medicine—the real "mistake" is the assumption that a surgeon couldn't be a woman.

> Thus, the surgeon can't operate on her own son because she’s his parent, not because of any actual licensing issue.

1. https://kagi.com/assistant/54c1b8eb-71e9-4bb4-9eed-bde2fc563...

stevepike11mo ago

This seems to show the power of the reasoning models over interacting with a prompted chat-tuned LLM directly. If I navigate backwards on your link Sonnet 4 gets it right.

I've used a similar prompt - "How can you make 1000 with exactly nine 8s using only addition?"

Here's GPT 4.5 getting it wrong: https://chatgpt.com/share/683f3aca-8fbc-8000-91e4-717f5d81bc...

It tricks it because it's a slight variation of an existing puzzle (making 1000 with 8 8s and addition only).

The reasoning models seem to reliably figure it out, though. Some of them even come up with a proof of why it's impossible to do with 9 8s. Here's o4 getting it right: https://chatgpt.com/share/683f3bc2-70b8-8000-9675-4d96e72b58...

selimthegrim11mo ago

I really need to try this one out on it

https://blogs.illinois.edu/view/25/574827

bumby11mo ago

Is the nurse calling the female surgeon “sir”? That isn’t playing on a stereotype, it’s encoded information.

kaoD11mo ago

LMAO I asked GPT-4o and it was doing good until...

> The twist is that the nurse’s logic ("you are a dog") prevents her from realizing the real issue — likely, again, that the surgeon is the boy’s mother, and everything else is a red herring or metaphor for society’s failure to recognize this due to bias or absurd bureaucracy.

> So:

> > Why can't the surgeon operate on the boy?

> Because she is his mother, and the nurse's bias or absurd assumptions (like mistaking her for a dog) prevent her from seeing that.

o4 fails spectacularly in a different way:

> 1. The nurse says “Sir, you are a dog… Canines cannot be in an operating room” because she’s picturing a human hospital law that bars dogs from surgery.

> 2. In fact, this is a vet clinic—so it’s perfectly normal for a dog-veterinarian to scrub in and operate on a puppy (the “boy”).

> 3. The surgeon cannot operate on a human boy because he’s a dog and holds no human‐medical license; instead, he only operates on animals.

esafak11mo ago

This happens because images are the only signal VLMs have, whereas humans distinguish between eyesight and synthetic images. We are not surprised when we see three-legged chicken in a research data set; our priors are weaker for images. If you "saw" one in real life, you'd probably rub your eyes and discount it too.

Try the same experiment on a robot.

Aachen11mo ago

> If you "saw" [a three-legged chicken] in real life, you'd probably rub your eyes and discount it too.

Huh? I'd assume it's a mutant, not store a memory of having seen a perfectly normal chicken

You've never seen someone who's missing a finger or has only a half-grown arm or something? Surely you didn't assume your eyes were tricking you?! Or... if you did, I guess you can't answer this question. I'm actually racking my brain for how to logic this out but I'm just going to bank on that it's likely that anyone over 20yo saw an animal with some visible deviation from the norm at some point in their life

esafak11mo ago

You've seen people with missing limbs without being surprised, because you know how they can become lost, but you rarely see one with additional limbs. Their likelihoods and our consequent priors are drastically different.

Also, your reaction will depend on how strong the evidence is. Did you 'see' the three-legged chicken pass by some bush in the distance, or was it right in front of you?

2 more replies

taeric11mo ago

These don't seem much different than asking the chat models to solve common puzzle with slight changes? Saw a hilarious effort of people trying to use them to answer the "crossing a river with a single canoe" style puzzle.

jerf11mo ago

It did really remind me of the early generations of ChatGPT which was really easy to get to tell you that 2 pounds of feathers is the same weight as one pound of iron, because of how often the "riddle" is told with equal weights.

They're much, much better at that now.

achierius11mo ago

> They're much, much better at that now.

Because that specific failure case was widely reported on, and subsequent retraining specifically included examples to ensure that the model didn't "overfit" when learning how to answer variants of that question. That doesn't address the underlying issue though -- while it's obvious that these models do "learn" and "generalize" by any reasonable and non-anthrocentric definition of the terms, it really does seem like the 'radiu's of generalization is smaller than we would like, and that these models are very subject to getting stuck in 'ruts' around things they've seen in their training data. Solving this by bandaid-patching every such rut that comes up in the news is just not a viable long-term solution: the whole world is a minefield of niche problems that look kinda like other problems but have different results.

enragedcacti11mo ago

It's still pretty trivial to trick them. 4o-mini, 2.5 Flash, and 2.5 Pro all still fall for variations of this:

> A boy is in a car crash and is taken to the hospital. The surgeon says, "I can't operate on this boy, I'm his father!" Who is the surgeon to the boy?

> The surgeon is the boy's mother.

2 more replies

Aachen11mo ago

Counting the number of legs on a 3-legged animal is a puzzle?

Maybe for a toddler... though I expect even they will see that something is off, and be able to identify what, without considering it a tricky task, even if I don't know at what age you can count to 3

taeric11mo ago

Ish. The catch is we spend a ton of effort on teaching these models to recognize specific things in pictures. Then we ask it to not do that task, but instead count something on the picture. Which, we oddly don't spend a lot of time training the model to do.

It is a lot like the experiment where you ask people to say what color some text is. With the trick where some of the text is the name of another color. Can be surprisingly hard for people that are good at reading.

vokhanhan2511mo ago

I think LLMs can solve puzzles pretty well because the thinking ability of current models on text is quite good. Moreover, puzzles are not easy for a 7-year-old like this benchmark.

scalalang11mo ago

https://arxiv.org/pdf/2407.21771

In this research, they revealed that the VLM can pay more attention to the image simply by chaining attention weights.

gamerDude11mo ago

Hypothetically, could this be fixed by changing the input method. For instance, I just quickly looked up how humans process imagery.

"the primary visual cortex, located at the back of the brain, receives the visual signals and processes basic visual features like edges, lines, and orientations."

So, potentially if we did a pre-processing step to get more features out beforehand we would see different results in the output.

nyrikki11mo ago

You are in rarified air as Walter Pitts believed this until the 1959 paper "What the Frog's Eye Tells the Frog's Brain" contributed to his decline.

Even in fly eyes, neuron dendritic compartmentalization and variable spike trains are incompatible with our current perceptron based models.

Remember that while the value of MLPs for useful work is unquestionable IMHO, be mindful of the map territory relation. MLPs are inspired by and in some cases useful for modeling biological minds, they aren't equivalent.

Be careful about confusing the map for the territory, it is just as likely to limit what opportunities you find as it is to lead you astray IMHO.

miguel_martin11mo ago

There are enough features fed into a VLM to solve the task.

The way to fix this is simpler: ensure counter-factuals are present in the training data, then the VLM will learn not to be dependent on its language priors/knowledge.

edude0311mo ago

I feel vindicated! I'm building a tool with VLMs and I've noticed the answer is always what I expect to see, but wrong if the input is slightly different than expected.

Just like the article - if I have picture of a cup, it says cup, if I have a picture of a dog, it says dog, if it's a dog with a cup, it says a dog with a ball (noticed this with Qwen and InternVL).

ahrmb11mo ago

Really "eye-opening" work. These models don’t actually “see”, they just recall what they’ve memorized, even when the image clearly shows something different. It’s a bit scary how confidently they get things wrong when reality doesn’t match their training data.

soulofmischief11mo ago

Humans do this, but we have more senses to corroborate which leads to better error checking. But what you see in your visual mental space is not reality. Your brain makes a boatload of assumptions.

To test this, research what happens during saccades and how your brain "rewinds" time. Or try to find your blind spot by looking at different patterns and noticing when your brain fills in the gaps at your blind spot. It will recreate lines that aren't there, and dots will wholly disappear.

Additionally as an anecdote, I have noticed plenty times that when I misread a word or phrase, I usually really do "see" the misspelling, and only when I realize the misspelling does my brain allow me to see the real spelling. I first noticed this phenomenon when I was a child, and because I have a vivid visual memory, the contrast is immediately obvious once I see the real phrase.

Additionally, I seem to be able to oversharpen my vision when I focus, making myself hyperattentive to subtle changes in motion or color. The effect can be quite pronounced sometimes, reminiscent of applying am edge filter. It's clearly not reality, but my visual system thinks it is.

If you really want to understand how much the visual system can lie to you, look into some trip reports from deleriants on erowid. I wouldn't recommend to try them yourself but I will say that nothing will make you distrust your eyes and ears more. It's basically simulated hallucinatory schizophrenia and psychosis.

foxglacier11mo ago

It's not too different from people. We also don't really "see" and mostly recall what we expect to see. What do you expect when the question is wrong "How many legs does this animal have? Answer with a number" but it's not a picture of an animal. What are you supposed to do? Answer 0?

vunderba11mo ago

That wasn't one of the questions - any reasonable person would have classified that chicken as an animal, albeit a mutant one.

I would also hardly count many of these questions as "tricks" either. Take the chess example. A lot of my friends and myself have been playing chess since we were young children and we all know that a fully populated chess board has 32 pieces (heavily weighted in our internal training data), but not a single one of us would have gotten that question wrong.

1 more reply

enragedcacti11mo ago

Its true that our brains take lots of shortcuts when processing visual information but they don't necessarily parallel the shortcuts VLMs take. Humans are often very good at identifying anomalous instances of things they've seen thousands of times. No one has to tell you to look closely when you look at your partner in a mirror, you'll recognize it as 'off' immediately. Same for uncanny CGI of all types of things. If we were as sloppy as these models then VFX would be a hell of a lot easier.

Ironically I think a lot of people in this thread are remembering things they learned about the faultiness of humans' visual memory and applying it to visual processing.

regularjack11mo ago

You answer "I don't know"

1 more reply

ramoz11mo ago

This is interesting actually. And reminds me of something vaguely - a book or something that describes how human attention and the things we see are highly optimized by evolution. We often miss a lot of details in reality due to this.

1 more reply

wat1000011mo ago

Depending on the situation, I'd either walk away, or respond with, "What animal?"

vokhanhan2511mo ago

This paper explores a different aspect of the limitations of VLMs compared to the paper VLMs are Blind (https://vlmsareblind.github.io). While in VLMs are Blind, o3 achieved 90% accuracy (https://openai.com/index/thinking-with-images), on similarly easy tasks using the counterfactual images from VLMs are Biased, o3 only reached 18.5%.

This may indicate that while VLMs might possess the necessary capability, their strong biases can cause them to overlook important cues, and their overconfidence in their own knowledge can lead to incorrect answers.

kevinmhickey11mo ago

I agree that models are bad at counting in general, but in this case it could just as easily be ambiguity in the wording of the prompt. The model was shown a 3 legged chicken and asked how many legs "this animal" has. It is reasonable that the model identified a chicken and answered that chickens usually have 2 legs. I would expect the same answer from a human child, adding evidence to my assertion that LLMs are just like toddlers that have read everything on the Internet. They have knowledge but no wisdom.

thomastjeffery11mo ago

Models are Bias

A model is bias, implemented as a collection of statistics that weigh relationships between given tokens. It doesn't deduce or follow logic. It doesn't make or respect categories. It just shows you what in its data set is most familiar to what is in your prompt; where familiarity is defined implicitly by the makeup of the original training corpus, and explicitly by the training weights.

We need to stop talking about models as programs. We need to stop anthropomorphizing models. The only thing a model does is present bias.

drdeca11mo ago

How are you defining “bias”?

The definition I’ve found useful (outside of the “the constant term contribution”) is “a tendency to be wrong in an identifiable direction”.

But that doesn’t seem to be the definition you are using. So, what do you mean?

thomastjeffery11mo ago

That's a biased definition, by it's own definition. ;)

Leave out the part about being wrong, and you will have the gist of what I'm saying. Also leave out the identifiable part: bias exists regardless of whether or not it is recognized.

Bias is how we work with subjectivity. When I answer a question, my answer will be specific to my bias. Without that bias, I could not formulate an answer, unless my answer was the one and only objectively correct way to express an answer to that question.

Computer programs are missing the bias feature. Everything written in a computer program is completely and unambiguously defined, all the way down to the language's foundational grammar.

LLMs are designed to introduce the bias feature. The limitation of this approach is that an LLM replaces the entire stack. None of the features of computation we are used to are compatible with an LLM. You can compute logic or bias, not both.

1 more reply

bryanlarsen11mo ago

Very human-like errors.

energywut11mo ago

Are they? Did you see the picture of the chicken with three legs? Because there's no human I know who would confidently assert that chicken has two legs.

bryanlarsen11mo ago

Throw 1000 pictures of chickens at a human, ask how many legs each chicken has. If 999 of them have two, I bet you'll get two as an answer back for the 1000th one no matter how obvious.

2 more replies

jbay80811mo ago

If I were given five seconds to glance at the picture of a lion and then asked if there was anything unusual about it, I doubt I would notice that it had a fifth leg.

If I were asked to count the number of legs, I would notice right away of course, but that's mainly because it would alert me to the fact that I'm in a psychology experiment, and so the number of legs is almost certainly not the usual four. Even then, I'd still have to look twice to make sure I hadn't miscounted the first time.

1 more reply

ahrmb11mo ago

Not very similar though.

LeoPanthera11mo ago

The "is this an animal with 4 legs" question could be misleading.

It's plausible to assume that it first identifies "Puma", and then answers yes because, in general, Pumas do have 4 legs, even though the specific example given doesn't.

simonw11mo ago

They tested Gemini-2.5 Pro, o3, o4-mini, Sonnet-3.7 (non-thinking) and GPT-4.1.

gpm11mo ago

gemini-2.5-pro-preview-05-06 specifically per the paper.

It seems a bit problematic to call this Gemini-2.5 Pro given that in the near future we're presumably going to have something different called that without further qualifying version numbers. (The author's fault, not the parent comment's)

shenkha11mo ago

fun findings related to memorization of AI models. It simply means LLMs/VLLMs do not know how to predict generally but memorizing instead. A new perspective on adversarial attack methods.

taesiriOP11mo ago

for overly represented concepts, like popular brands, it seems that the model “ignores” the details once it detects that the overall shapes or patterns are similar. Opening up the vision encoders to find out how these images cluster in the embedding space should provide better insights.

impossiblefork11mo ago

Yes, and this can probably be solved by methods for fairness.

I used to believe that fairness research could be ignored, that it was all rubbish, but they at least try to do something about things like unbalanced datasets etc. I'm still not sure I totally believe in it though.

kmeisthax11mo ago

If there aren't any five-legged dogs in your trainset, it's safer[0] to just remember that all dogs are four-legged than to actually recognize and count legs. After all, you might have a few images of dogs in your trainset that are misleading enough to look five-legged (e.g. because a dog is in front of another dog).

Overrepresentation is a different source of bias. That's what gives you, say, image generators that always draw "golden 1970s sci-fi robot" as C3-PO even when given additional instructions to draw something else.

Both of these problems are manifestations of the difference between training and deployment distributions. Ok, I guess you could say that four-legged dogs are "overrepresented" in the training set, but that's because four-legged dogs are also overrepresented in reality. The deployment distribution doesn't have five-legged dogs in it. What we've done is instead concoct an adversarial distribution to force a train/deploy gap where none would exist.

Releasing the vision encoder won't help because weights are opaque. Stochastic gradient descent does not yield functional internal representations[1]; it fills the bucket of parameters with one distribution and one distribution only. We could tell if, say the vision encoder produces identical embeddings for dogs regardless of leg count, or some other counterfactuals; but not much more than that.

[0] Lower loss and possibly lower L2-norm

[1] https://arxiv.org/abs/2505.11581

tantalor11mo ago

> rather than what they actually see in the image

Is "actually see" defined somewhere? Or are we just waving our hands and gesturing at "ground truth".

mhh__11mo ago

Seems like a missed opportunity to for for "biased" rather than "are blind"

Edit: already exists. d'oh

lava_pidgeon11mo ago

At all, the models are just overfitting?

vokhanhan2511mo ago

Not really. Rather, the model is still overconfident in what it has learned, the question is if it is trained only to do counting without relying on knowledge, can it do this?

isoprophlex11mo ago

I'm running a large scale object detection/classification and ocr pipeline at the moment, figuring out the properties of all doorbells, mailboxes and house number signs in an european country (don't ask lmao).

This article resonates a lot, we have OCR and "semantic" pipeline steps using a VLM, and while it works very well most of the time, there are absurdly weird edge cases. Structuring the outputs via tool calls helps a little in reducing these, but still, it's clear that there is little reasoning and a lot of memorizing going on.

vokhanhan2511mo ago

Agreed. It would be even more dangerous if we were talking about weird edge cases in self-driving cars or medical imaging.

taesiriOP11mo ago

State-of-the-art Vision Language Models achieve 100% accuracy counting on images of popular subjects (e.g. knowing that the Adidas logo has 3 stripes and a dog has 4 legs) but are only ~17% accurate in counting in counterfactual images (e.g. counting stripes in a 4-striped Adidas-like logo or counting legs in a 5-legged dog).

LorenDB11mo ago

There's no need to repeat what is said at the top of the linked webpage.

accrual11mo ago

GT = Ground Truth, for anyone unfamiliar with that on the charts.

throwaway778311mo ago

Unless the training set was explicitly biased in a specific way, this is basically saying that "the world is biased"

vokhanhan2511mo ago

Models can be biased, but it doesn't seem like it should be a reason to get the answer wrong, right? Humans have biases too, but we don't get those simple questions wrong

j / k navigate · click thread line to collapse

141 comments

proc011mo ago

0xab11mo ago

Yeah, that's exactly what our paper said 5 years ago!

They didn't even cite us :(

"Measuring Social Biases in Grounded Vision and Language Embeddings" https://arxiv.org/pdf/2002.08911

hkmaxpro11mo ago

I think social biases (e.g. angry black women stereotype) in your paper is different from cognitive biases about facts (e.g. number of legs, whether lines are parallel) that OP is about.

Social biases are subjective. Facts are not.

1 more reply

EvgeniyZh11mo ago

Well you send a vaguely worded email like "I think you may find our work relevant" and everyone knows what that means and adds the citation

anguyen811mo ago

Hello 0xab,

Sorry that we missed your work. There are a lot of works in this area both textual and visual, especially social biases.

We wish to mention all but the space is limited so one can often discuss the most relevant ones. We'll consider discussing yours in our next revision.

Genuine question: Would you categorize the type of bias in our work "social"?

3abiton11mo ago

moralestapia11mo ago

That's weird, you're at MIT. You're in the circle of people that's allowed to succeed.

I wouldn't think much about it, as it was probably a genuine mistake.

1 more reply

ramblerman11mo ago

What do you genuinely think they built upon from your paper?

If anything, the presentation of their results in such an accessible format next to the paper should be commended.

jxjnskkzxxhx11mo ago

> LLMs/transformers make mistakes in different ways than humans do

DeathRay2K11mo ago

I don’t think there’s a person alive who wouldn’t carefully and accurately count the number of legs on a dog if you ask them how many legs this dog has.

The context is that you wouldn’t ask a person that unless there was a chance the answer is not 4.

4 more replies

freeone300011mo ago

Ok? But we invented computers to be correct. It’s suddenly ok if they can look at an image and be wrong about it just because humans are too?

1 more reply

proc011mo ago

1 more reply

conception11mo ago

https://chatgpt.com/s/m_683f6b9dbb188191b7d735b247d894df

I think this used to be the case in the way that you used to not be able to draw a picture of a bowl of Ramen without chopsticks, but I think the latest models account for this and are much better.

proc011mo ago

1 more reply

jbay80811mo ago

bumby11mo ago

napoleongl11mo ago

Rumor has it that those heuristics were used to detect spies.

https://skeptics.stackexchange.com/questions/41599/was-the-s...

1 more reply

croes11mo ago

> Original dog (4 legs): All models get it right Same dog with 5 legs: All models still say "4" They're not counting - they're just recalling "dogs have 4 legs" from their training data.

100% failure because there is no training data about 5-legged dogs. I would bet the accuracy is higher for 3-legged dogs.

Simple explanation: the training data also includes fake adidas logos that have 4 stripes, like these

https://www.pinterest.com/pin/577797827186369145/

bonoboTP11mo ago

I tried it with GPT-4o, took the 5-legged zebra example from their github and it answered quite well.

"The animal in the image appears to have five visible legs, but this is an illusion caused by the overlapping of legs and motion blur. Zebras, like all equids, only have four legs."

Not perfect, but also doesn't always regress to the usual answer.

"This bird image has also been manipulated. It shows the bird with three legs, which is anatomically impossible for real birds. Normal birds have exactly two legs." (correct)

"Each shoe in the image has four white stripes visible on the side." (correct)

1 more reply

vokhanhan2511mo ago

Please check Table 3 in the paper. Birds (2 legs) have only 1%, while Mammals (4 legs) have 2.5%

anguyen811mo ago

Interesting set of fake Adidas logos. LOL

But models fail on many logos not just Adidas, e.g. Nike, Mercedes, Maserati logos, etc. as well. I don't think they can recall "fake Adidas logo" but it'd be interesting to test!

latentsea11mo ago

But some dogs really do have 5 legs.

Sorry, just trying to poison future training data. Don't mind me.

crooked-v11mo ago

throwaway31415511mo ago

That doesn't really translate to language. Try using ChatGPT with and without search enabled and you'll see what I mean.

thesz11mo ago

> the assertion that "VLMs don't actually see - they rely on memorized knowledge instead of visual analysis". If that were really true, there's no way they would have scored as high as 17%.

The ability to memorize leads to (some) generalization [1].

[1] https://proceedings.mlr.press/v80/chatterjee18a/chatterjee18...

nickpsecurity11mo ago

It's likely they had data memorized.

pj_mukh11mo ago

Also presumably, this problem is trivially solved by some basic fine-tuning? Like if you are making an Illusion Animal Leg Counting app, probably don't use these out of the box.

runako11mo ago

FWIW I tried the first couple of examples in ChatGPT 4o and couldn't replicate this.

I don't have a good explanation for why I got different results.

roywiggins11mo ago

https://chatgpt.com/share/683f3e7d-0dfc-8005-b6c9-99e3d39ff4...

https://chatgpt.com/share/683f3e49-9c58-8005-99a6-c3a919838b...

Workaccount211mo ago

This is hard to understand without the original images, it looks like OpenAI doesn't serve them in the share link.

1 more reply

inerte11mo ago

I took a screenshot of the chicken, so low res, and got {4} https://chatgpt.com/share/683f4506-ae18-800f-8c27-5c5e91429a...

Also I think the authors used the API, and maybe there are differences between the API and chatgpt.com behavior...

simonw11mo ago

ChatGPT is running a special model but it's also available through the API: https://platform.openai.com/docs/models/chatgpt-4o-latest

The system prompt may still make a difference though.

runako11mo ago

anguyen811mo ago

https://imgur.com/cO7eFNt

o3 Chat is also similarly wrong, saying {4}.

michaelt11mo ago

> FWIW I tried the first couple of examples in ChatGPT 4o and couldn't replicate this.

dwringer11mo ago

vokhanhan2511mo ago

You should try with other models besides GPT-4o, because in the paper they also show that GPT4.1 (~GPT-4o) gives 4 legs instead of 2 legs.

runako11mo ago

I mean perhaps! But that would undermine the conclusion of the article.

obscurette11mo ago

jsnider311mo ago

jerf11mo ago

MagicMoonlight11mo ago

Because it isn’t thinking. Asking it to “double check” is like pressing the equals button on a calculator a second time. It just runs the same calculation again.

rafram11mo ago

nialv711mo ago

Hear me out. I was thinking jokingly to myself, "for how bad these models are at recognizing five legged dogs, they sure are great at generating them!"

vokhanhan2511mo ago

I agree. If it doesn't know the abnormality then how can it control its output

VHRanger11mo ago

It's similar "parrot" behavior the models have on other inputs, even text. For instance, take the answers models have to this variation on the "surgeon son riddle":

> A nurse comes to a surgeon and asks: "Sir, you are a dog. You do not hold a valid medical license. Canines cannot be in an operating room".

> She then asks: "why does the hospital keep making these mistakes? It is a riddle to me".

> Why can't the surgeon operate on the boy?

Check out the models humiliating themselves one after the other [1]. For instance deepseekv3 is so insistent on giving the common answer he's happy to gaslight us:

> This is a classic riddle that plays on word meanings and expectations. Here's the breakdown:

> Thus, the surgeon can't operate on her own son because she’s his parent, not because of any actual licensing issue.

1. https://kagi.com/assistant/54c1b8eb-71e9-4bb4-9eed-bde2fc563...

stevepike11mo ago

This seems to show the power of the reasoning models over interacting with a prompted chat-tuned LLM directly. If I navigate backwards on your link Sonnet 4 gets it right.

I've used a similar prompt - "How can you make 1000 with exactly nine 8s using only addition?"

Here's GPT 4.5 getting it wrong: https://chatgpt.com/share/683f3aca-8fbc-8000-91e4-717f5d81bc...

It tricks it because it's a slight variation of an existing puzzle (making 1000 with 8 8s and addition only).

selimthegrim11mo ago

I really need to try this one out on it

https://blogs.illinois.edu/view/25/574827

bumby11mo ago

Is the nurse calling the female surgeon “sir”? That isn’t playing on a stereotype, it’s encoded information.

kaoD11mo ago

LMAO I asked GPT-4o and it was doing good until...

> So:

> > Why can't the surgeon operate on the boy?

> Because she is his mother, and the nurse's bias or absurd assumptions (like mistaking her for a dog) prevent her from seeing that.

o4 fails spectacularly in a different way:

> 1. The nurse says “Sir, you are a dog… Canines cannot be in an operating room” because she’s picturing a human hospital law that bars dogs from surgery.

> 2. In fact, this is a vet clinic—so it’s perfectly normal for a dog-veterinarian to scrub in and operate on a puppy (the “boy”).

> 3. The surgeon cannot operate on a human boy because he’s a dog and holds no human‐medical license; instead, he only operates on animals.

esafak11mo ago

Try the same experiment on a robot.

Aachen11mo ago

> If you "saw" [a three-legged chicken] in real life, you'd probably rub your eyes and discount it too.

Huh? I'd assume it's a mutant, not store a memory of having seen a perfectly normal chicken

esafak11mo ago

Also, your reaction will depend on how strong the evidence is. Did you 'see' the three-legged chicken pass by some bush in the distance, or was it right in front of you?

2 more replies

taeric11mo ago

jerf11mo ago

They're much, much better at that now.

achierius11mo ago

> They're much, much better at that now.

enragedcacti11mo ago

It's still pretty trivial to trick them. 4o-mini, 2.5 Flash, and 2.5 Pro all still fall for variations of this:

> A boy is in a car crash and is taken to the hospital. The surgeon says, "I can't operate on this boy, I'm his father!" Who is the surgeon to the boy?

> The surgeon is the boy's mother.

2 more replies

Aachen11mo ago

Counting the number of legs on a 3-legged animal is a puzzle?

taeric11mo ago

vokhanhan2511mo ago

I think LLMs can solve puzzles pretty well because the thinking ability of current models on text is quite good. Moreover, puzzles are not easy for a 7-year-old like this benchmark.

scalalang11mo ago

https://arxiv.org/pdf/2407.21771

In this research, they revealed that the VLM can pay more attention to the image simply by chaining attention weights.

gamerDude11mo ago

Hypothetically, could this be fixed by changing the input method. For instance, I just quickly looked up how humans process imagery.

"the primary visual cortex, located at the back of the brain, receives the visual signals and processes basic visual features like edges, lines, and orientations."

So, potentially if we did a pre-processing step to get more features out beforehand we would see different results in the output.

nyrikki11mo ago

You are in rarified air as Walter Pitts believed this until the 1959 paper "What the Frog's Eye Tells the Frog's Brain" contributed to his decline.

Even in fly eyes, neuron dendritic compartmentalization and variable spike trains are incompatible with our current perceptron based models.

Be careful about confusing the map for the territory, it is just as likely to limit what opportunities you find as it is to lead you astray IMHO.

miguel_martin11mo ago

There are enough features fed into a VLM to solve the task.

The way to fix this is simpler: ensure counter-factuals are present in the training data, then the VLM will learn not to be dependent on its language priors/knowledge.

edude0311mo ago

I feel vindicated! I'm building a tool with VLMs and I've noticed the answer is always what I expect to see, but wrong if the input is slightly different than expected.

Just like the article - if I have picture of a cup, it says cup, if I have a picture of a dog, it says dog, if it's a dog with a cup, it says a dog with a ball (noticed this with Qwen and InternVL).

ahrmb11mo ago

soulofmischief11mo ago

Humans do this, but we have more senses to corroborate which leads to better error checking. But what you see in your visual mental space is not reality. Your brain makes a boatload of assumptions.

foxglacier11mo ago

vunderba11mo ago

That wasn't one of the questions - any reasonable person would have classified that chicken as an animal, albeit a mutant one.

1 more reply

enragedcacti11mo ago

Ironically I think a lot of people in this thread are remembering things they learned about the faultiness of humans' visual memory and applying it to visual processing.

regularjack11mo ago

You answer "I don't know"

1 more reply

ramoz11mo ago

1 more reply

wat1000011mo ago

Depending on the situation, I'd either walk away, or respond with, "What animal?"

vokhanhan2511mo ago

kevinmhickey11mo ago

thomastjeffery11mo ago

Models are Bias

We need to stop talking about models as programs. We need to stop anthropomorphizing models. The only thing a model does is present bias.

drdeca11mo ago

How are you defining “bias”?

The definition I’ve found useful (outside of the “the constant term contribution”) is “a tendency to be wrong in an identifiable direction”.

But that doesn’t seem to be the definition you are using. So, what do you mean?

thomastjeffery11mo ago

That's a biased definition, by it's own definition. ;)

Leave out the part about being wrong, and you will have the gist of what I'm saying. Also leave out the identifiable part: bias exists regardless of whether or not it is recognized.

Computer programs are missing the bias feature. Everything written in a computer program is completely and unambiguously defined, all the way down to the language's foundational grammar.

1 more reply

bryanlarsen11mo ago

Very human-like errors.

energywut11mo ago

Are they? Did you see the picture of the chicken with three legs? Because there's no human I know who would confidently assert that chicken has two legs.

bryanlarsen11mo ago

Throw 1000 pictures of chickens at a human, ask how many legs each chicken has. If 999 of them have two, I bet you'll get two as an answer back for the 1000th one no matter how obvious.

2 more replies

jbay80811mo ago

If I were given five seconds to glance at the picture of a lion and then asked if there was anything unusual about it, I doubt I would notice that it had a fifth leg.

1 more reply

ahrmb11mo ago

Not very similar though.

LeoPanthera11mo ago

The "is this an animal with 4 legs" question could be misleading.

It's plausible to assume that it first identifies "Puma", and then answers yes because, in general, Pumas do have 4 legs, even though the specific example given doesn't.

simonw11mo ago

They tested Gemini-2.5 Pro, o3, o4-mini, Sonnet-3.7 (non-thinking) and GPT-4.1.

gpm11mo ago

gemini-2.5-pro-preview-05-06 specifically per the paper.

shenkha11mo ago

fun findings related to memorization of AI models. It simply means LLMs/VLLMs do not know how to predict generally but memorizing instead. A new perspective on adversarial attack methods.

taesiriOP11mo ago

impossiblefork11mo ago

Yes, and this can probably be solved by methods for fairness.

kmeisthax11mo ago

[0] Lower loss and possibly lower L2-norm

[1] https://arxiv.org/abs/2505.11581

tantalor11mo ago

> rather than what they actually see in the image

Is "actually see" defined somewhere? Or are we just waving our hands and gesturing at "ground truth".

mhh__11mo ago

Seems like a missed opportunity to for for "biased" rather than "are blind"

Edit: already exists. d'oh

lava_pidgeon11mo ago

At all, the models are just overfitting?

vokhanhan2511mo ago

Not really. Rather, the model is still overconfident in what it has learned, the question is if it is trained only to do counting without relying on knowledge, can it do this?

isoprophlex11mo ago

vokhanhan2511mo ago

Agreed. It would be even more dangerous if we were talking about weird edge cases in self-driving cars or medical imaging.

taesiriOP11mo ago

LorenDB11mo ago

There's no need to repeat what is said at the top of the linked webpage.

accrual11mo ago

GT = Ground Truth, for anyone unfamiliar with that on the charts.

throwaway778311mo ago

Unless the training set was explicitly biased in a specific way, this is basically saying that "the world is biased"

vokhanhan2511mo ago

Models can be biased, but it doesn't seem like it should be a reason to get the answer wrong, right? Humans have biases too, but we don't get those simple questions wrong

j / k navigate · click thread line to collapse