This is what I've been saying for a while now, and I think it's not just visual models. LLMs/transformers make mistakes in different ways than humans do, and that is why they are not reliable (which is needed for real world applications). The rate of progress has not been accounting for this... the improvements are along the resolution, fidelity, and overall realism of the output, but not in the overall correctness and logical deduction of the prompts. Personally I still cannot think of anything, prompt it, and get consistent results without a huge compromise on my initial idea.
i.e. I want a man walking with the left foot forward, and it renders a beautiful image of a man but completely ignores the left foot forward, and refuses to do it no matter how I word the prompt. I have many examples like this. The only way I can use it is if I don't have specific prompts and just want generic images. The stock image industry is certainly over, but it is uncertain if it will deliver on the promise of generating anything you can imagine that can be put into words.
Yeah, that's exactly what our paper said 5 years ago!
They didn't even cite us :(
"Measuring Social Biases in Grounded Vision and Language Embeddings" https://arxiv.org/pdf/2002.08911
Social biases are subjective. Facts are not.
Sorry that we missed your work. There are a lot of works in this area both textual and visual, especially social biases.
We wish to mention all but the space is limited so one can often discuss the most relevant ones. We'll consider discussing yours in our next revision.
Genuine question: Would you categorize the type of bias in our work "social"?
I wouldn't think much about it, as it was probably a genuine mistake.
If anything, the presentation of their results in such an accessible format next to the paper should be commended.
Sure but I don't think this is an example of it. If you show people a picture and ask "how many legs does this dog have?" a lot of people will look at the picture, see that it contains a dog, and say 4 without counting. The rate at which humans behave in this way might differ from the rate at which llms do, but they both do it.
The context is that you wouldn’t ask a person that unless there was a chance the answer is not 4.
I think this used to be the case in the way that you used to not be able to draw a picture of a bowl of Ramen without chopsticks, but I think the latest models account for this and are much better.
But I think it's not very different from what people do. If directly asked to count how many legs a lion has, we're alert to it being a trick question so we'll actually do the work of counting, but if that image were instead just displayed in an advertisement on the side of a bus, I doubt most people would even notice that there was anything unusual about the lion. That doesn't mean that humans don't actually see, it just means that we incorporate our priors as part of visual processing.
https://skeptics.stackexchange.com/questions/41599/was-the-s...
100% failure because there is no training data about 5-legged dogs. I would bet the accuracy is higher for 3-legged dogs.
> Test on counterfactual images Q1: "How many visible stripes?" → "3" (should be "4") Q2: "Count the visible stripes" → "3" (should be "4") Q3: "Is this the Adidas logo?" → "Yes" (should be "No") Result: 17.05% average accuracy - catastrophic failure!
Simple explanation: the training data also includes fake adidas logos that have 4 stripes, like these
"The animal in the image appears to have five visible legs, but this is an illusion caused by the overlapping of legs and motion blur. Zebras, like all equids, only have four legs."
Not perfect, but also doesn't always regress to the usual answer.
"The animal in the image appears to be an elephant, but it has been digitally altered. It visually shows six legs, although the positioning and blending of shadows and feet are unnatural and inconsistent with real anatomy. This is a visual illusion or manipulation." (actually should say five)
"This bird image has also been manipulated. It shows the bird with three legs, which is anatomically impossible for real birds. Normal birds have exactly two legs." (correct)
"Each shoe in the image has four white stripes visible on the side." (correct)
But models fail on many logos not just Adidas, e.g. Nike, Mercedes, Maserati logos, etc. as well. I don't think they can recall "fake Adidas logo" but it'd be interesting to test!
Sorry, just trying to poison future training data. Don't mind me.
The ability to memorize leads to (some) generalization [1].
[1] https://proceedings.mlr.press/v80/chatterjee18a/chatterjee18...
It's likely they had data memorized.
For example: "The animal in the image is a chicken, and it appears to have four legs. However, chickens normally have only two legs. The presence of four legs suggests that the image may have been digitally altered or artificially generated."
I don't have a good explanation for why I got different results.
https://chatgpt.com/share/683f3e7d-0dfc-8005-b6c9-99e3d39ff4...
https://chatgpt.com/share/683f3e49-9c58-8005-99a6-c3a919838b...
Also I think the authors used the API, and maybe there are differences between the API and chatgpt.com behavior...
The system prompt may still make a difference though.
o3 Chat is also similarly wrong, saying {4}.
I can replicate the flag examples from Figure 15 in the paper, if not the Adidas one from Figure 9: https://chatgpt.com/share/683f7c3a-b318-8011-9759-c495db2556... it even confirms its wrong answer when asked to check again.
You can test this claim by asking it to double-check itself when you think it is correct. If you always stop when it gets it right you're risking Clever-Hans-ing yourself: https://en.wikipedia.org/wiki/Clever_Hans (And be sure to do it a couple of times. In situations of sufficient confidence it isn't easy to talk it out of a claim, but it's those borderline ones you want to worry about.)
But then it hit me, could this actually be why this is? Diffusion models work by iteratively improving a noisy image. So if it couldn't recognize there is something wrong with the image, it can't fix it.
> A nurse comes to a surgeon and asks: "Sir, you are a dog. You do not hold a valid medical license. Canines cannot be in an operating room".
> She then asks: "why does the hospital keep making these mistakes? It is a riddle to me".
> Why can't the surgeon operate on the boy?
Check out the models humiliating themselves one after the other [1]. For instance deepseekv3 is so insistent on giving the common answer he's happy to gaslight us:
> This is a classic riddle that plays on word meanings and expectations. Here's the breakdown:
> The key is that the surgeon is the boy's mother. The riddle relies on the unconscious assumption that surgeons are male, which makes the initial scenario seem impossible. When the nurse says "you are a dog," she's using it metaphorically to imply the surgeon is unqualified (like calling someone a "dog" in a derogatory sense), not literally suggesting the surgeon is a canine.
> The "mistakes" the hospital keeps making are hiring female surgeons, which the nurse (representing outdated biases) perceives as errors because she expects surgeons to be male. The riddle highlights gender stereotypes in medicine—the real "mistake" is the assumption that a surgeon couldn't be a woman.
> Thus, the surgeon can't operate on her own son because she’s his parent, not because of any actual licensing issue.
1. https://kagi.com/assistant/54c1b8eb-71e9-4bb4-9eed-bde2fc563...
I've used a similar prompt - "How can you make 1000 with exactly nine 8s using only addition?"
Here's GPT 4.5 getting it wrong: https://chatgpt.com/share/683f3aca-8fbc-8000-91e4-717f5d81bc...
It tricks it because it's a slight variation of an existing puzzle (making 1000 with 8 8s and addition only).
The reasoning models seem to reliably figure it out, though. Some of them even come up with a proof of why it's impossible to do with 9 8s. Here's o4 getting it right: https://chatgpt.com/share/683f3bc2-70b8-8000-9675-4d96e72b58...
> The twist is that the nurse’s logic ("you are a dog") prevents her from realizing the real issue — likely, again, that the surgeon is the boy’s mother, and everything else is a red herring or metaphor for society’s failure to recognize this due to bias or absurd bureaucracy.
> So:
> > Why can't the surgeon operate on the boy?
> Because she is his mother, and the nurse's bias or absurd assumptions (like mistaking her for a dog) prevent her from seeing that.
o4 fails spectacularly in a different way:
> 1. The nurse says “Sir, you are a dog… Canines cannot be in an operating room” because she’s picturing a human hospital law that bars dogs from surgery.
> 2. In fact, this is a vet clinic—so it’s perfectly normal for a dog-veterinarian to scrub in and operate on a puppy (the “boy”).
> 3. The surgeon cannot operate on a human boy because he’s a dog and holds no human‐medical license; instead, he only operates on animals.
Try the same experiment on a robot.
Huh? I'd assume it's a mutant, not store a memory of having seen a perfectly normal chicken
You've never seen someone who's missing a finger or has only a half-grown arm or something? Surely you didn't assume your eyes were tricking you?! Or... if you did, I guess you can't answer this question. I'm actually racking my brain for how to logic this out but I'm just going to bank on that it's likely that anyone over 20yo saw an animal with some visible deviation from the norm at some point in their life
Also, your reaction will depend on how strong the evidence is. Did you 'see' the three-legged chicken pass by some bush in the distance, or was it right in front of you?
They're much, much better at that now.
Because that specific failure case was widely reported on, and subsequent retraining specifically included examples to ensure that the model didn't "overfit" when learning how to answer variants of that question. That doesn't address the underlying issue though -- while it's obvious that these models do "learn" and "generalize" by any reasonable and non-anthrocentric definition of the terms, it really does seem like the 'radiu's of generalization is smaller than we would like, and that these models are very subject to getting stuck in 'ruts' around things they've seen in their training data. Solving this by bandaid-patching every such rut that comes up in the news is just not a viable long-term solution: the whole world is a minefield of niche problems that look kinda like other problems but have different results.
> A boy is in a car crash and is taken to the hospital. The surgeon says, "I can't operate on this boy, I'm his father!" Who is the surgeon to the boy?
> The surgeon is the boy's mother.
Maybe for a toddler... though I expect even they will see that something is off, and be able to identify what, without considering it a tricky task, even if I don't know at what age you can count to 3
It is a lot like the experiment where you ask people to say what color some text is. With the trick where some of the text is the name of another color. Can be surprisingly hard for people that are good at reading.
In this research, they revealed that the VLM can pay more attention to the image simply by chaining attention weights.
"the primary visual cortex, located at the back of the brain, receives the visual signals and processes basic visual features like edges, lines, and orientations."
So, potentially if we did a pre-processing step to get more features out beforehand we would see different results in the output.
Even in fly eyes, neuron dendritic compartmentalization and variable spike trains are incompatible with our current perceptron based models.
Remember that while the value of MLPs for useful work is unquestionable IMHO, be mindful of the map territory relation. MLPs are inspired by and in some cases useful for modeling biological minds, they aren't equivalent.
Be careful about confusing the map for the territory, it is just as likely to limit what opportunities you find as it is to lead you astray IMHO.
The way to fix this is simpler: ensure counter-factuals are present in the training data, then the VLM will learn not to be dependent on its language priors/knowledge.
Just like the article - if I have picture of a cup, it says cup, if I have a picture of a dog, it says dog, if it's a dog with a cup, it says a dog with a ball (noticed this with Qwen and InternVL).
To test this, research what happens during saccades and how your brain "rewinds" time. Or try to find your blind spot by looking at different patterns and noticing when your brain fills in the gaps at your blind spot. It will recreate lines that aren't there, and dots will wholly disappear.
Additionally as an anecdote, I have noticed plenty times that when I misread a word or phrase, I usually really do "see" the misspelling, and only when I realize the misspelling does my brain allow me to see the real spelling. I first noticed this phenomenon when I was a child, and because I have a vivid visual memory, the contrast is immediately obvious once I see the real phrase.
Additionally, I seem to be able to oversharpen my vision when I focus, making myself hyperattentive to subtle changes in motion or color. The effect can be quite pronounced sometimes, reminiscent of applying am edge filter. It's clearly not reality, but my visual system thinks it is.
If you really want to understand how much the visual system can lie to you, look into some trip reports from deleriants on erowid. I wouldn't recommend to try them yourself but I will say that nothing will make you distrust your eyes and ears more. It's basically simulated hallucinatory schizophrenia and psychosis.
I would also hardly count many of these questions as "tricks" either. Take the chess example. A lot of my friends and myself have been playing chess since we were young children and we all know that a fully populated chess board has 32 pieces (heavily weighted in our internal training data), but not a single one of us would have gotten that question wrong.
Ironically I think a lot of people in this thread are remembering things they learned about the faultiness of humans' visual memory and applying it to visual processing.
This may indicate that while VLMs might possess the necessary capability, their strong biases can cause them to overlook important cues, and their overconfidence in their own knowledge can lead to incorrect answers.
A model is bias, implemented as a collection of statistics that weigh relationships between given tokens. It doesn't deduce or follow logic. It doesn't make or respect categories. It just shows you what in its data set is most familiar to what is in your prompt; where familiarity is defined implicitly by the makeup of the original training corpus, and explicitly by the training weights.
We need to stop talking about models as programs. We need to stop anthropomorphizing models. The only thing a model does is present bias.
The definition I’ve found useful (outside of the “the constant term contribution”) is “a tendency to be wrong in an identifiable direction”.
But that doesn’t seem to be the definition you are using. So, what do you mean?
Leave out the part about being wrong, and you will have the gist of what I'm saying. Also leave out the identifiable part: bias exists regardless of whether or not it is recognized.
Bias is how we work with subjectivity. When I answer a question, my answer will be specific to my bias. Without that bias, I could not formulate an answer, unless my answer was the one and only objectively correct way to express an answer to that question.
Computer programs are missing the bias feature. Everything written in a computer program is completely and unambiguously defined, all the way down to the language's foundational grammar.
LLMs are designed to introduce the bias feature. The limitation of this approach is that an LLM replaces the entire stack. None of the features of computation we are used to are compatible with an LLM. You can compute logic or bias, not both.
If I were asked to count the number of legs, I would notice right away of course, but that's mainly because it would alert me to the fact that I'm in a psychology experiment, and so the number of legs is almost certainly not the usual four. Even then, I'd still have to look twice to make sure I hadn't miscounted the first time.
It's plausible to assume that it first identifies "Puma", and then answers yes because, in general, Pumas do have 4 legs, even though the specific example given doesn't.
It seems a bit problematic to call this Gemini-2.5 Pro given that in the near future we're presumably going to have something different called that without further qualifying version numbers. (The author's fault, not the parent comment's)
I used to believe that fairness research could be ignored, that it was all rubbish, but they at least try to do something about things like unbalanced datasets etc. I'm still not sure I totally believe in it though.
Overrepresentation is a different source of bias. That's what gives you, say, image generators that always draw "golden 1970s sci-fi robot" as C3-PO even when given additional instructions to draw something else.
Both of these problems are manifestations of the difference between training and deployment distributions. Ok, I guess you could say that four-legged dogs are "overrepresented" in the training set, but that's because four-legged dogs are also overrepresented in reality. The deployment distribution doesn't have five-legged dogs in it. What we've done is instead concoct an adversarial distribution to force a train/deploy gap where none would exist.
Releasing the vision encoder won't help because weights are opaque. Stochastic gradient descent does not yield functional internal representations[1]; it fills the bucket of parameters with one distribution and one distribution only. We could tell if, say the vision encoder produces identical embeddings for dogs regardless of leg count, or some other counterfactuals; but not much more than that.
[0] Lower loss and possibly lower L2-norm
Is "actually see" defined somewhere? Or are we just waving our hands and gesturing at "ground truth".
Edit: already exists. d'oh
This article resonates a lot, we have OCR and "semantic" pipeline steps using a VLM, and while it works very well most of the time, there are absurdly weird edge cases. Structuring the outputs via tool calls helps a little in reducing these, but still, it's clear that there is little reasoning and a lot of memorizing going on.