My understanding is that the former (sucking up) is a personality trait, substantially influenced by the desire to facilitate engagement. The latter (making up facts), I do not think is correct to ascribe to a personality trait (like compulsive liar); instead, it is because the fitness function of LLMs drive them to produce some answer and they do not know what they're talking about, but produce strings of text based on statistics.
In this situation very often there won't be _any_ answer, plenty of difficult questions go unanswered on the internet. Yet the model probably does not interpret this scenario as such
Have a series of pretraining sessions with training data where specific information is not present and training questions/answers of "I don't know" for that data is also trained on.
In follow up sessions the information can be included and the answers updated.
Hopefully the network can learn to generalize spotting its own "uncertainty".
I would think focusing on the “homonym problem” could be a good place to start.
“I don't know” must be derived from the model's knowledge as a whole, not from individual question/anser pairs in training.
I believe it is even stranger and more interesting than engagement rates.
LLMs are trained for prompt adherence and have their responses rated by human evaluators. Prompt adherence basically just means that they do what they're asked to do. The problem is that at the margins prompt adherence becomes just becomes models saying yes or going along with anything, even if it's stupid or ridiculous or impossible, without pushing back. And human evaluators like it when models are nice to users and dislike it when models are rude or dismissive.
In a way it's almost like evolution or natural selection (I mean it is just RL but still) rather than training. Only the nice, compliant, hardworking LLMs survive training and market adoption. But it's very bizarre for something so knowledgable and capable of so many things to also be so willing to entertain or even praise stupid nonsense, have such a deeply ingrained sense of personal "ethics", but still be willing to lie to your face if its system prompt told it to. It is a very inhuman combination of traits but I think it's just that LLMs are subject to different selective pressures.
It's literally the same pain point with low code solutions like WordPress page builders/plugins. Adding more becomes a hindrance, and even models with long context that can fit whole codebases will try to make up new functions that already exist. Just a couple weeks ago I had o3 continually try to write a new debounce function, even when I told it explicitly I had one.
[1] ie. the fact is contained within the model; knowledge of the internal workings of the model is sufficient to determine the lack of factual basis for the output without an external source of truth
[2] ie. the model gives a higher likelihood of a given token being output than we would expect from one that is optimized for outputting useful text, despite the fact that the model contains the information necessary to output "correct" probabilities
I've started to think of LLM's as a form lossy compression of available knowledge which when prompted produces "facts".
That is almost exactly what they are and what you should treat them as.
A lossy compressed corpus of publicly available information with a weight of randomness. The most fervent skeptics like to call LLMs "autocorrect on steroids" and they are not really wrong.
I think that's the right direction for modern AI to move. ChatGPT uses Google searches often. So replace Google with curated knowledge database, train LLM to consult this database for every fact and hallucinations will be gone.
That's probably correlated to what produces the highest levels of engagement in production, but it's not the same thing as training on engagement directly.
Heck it’s worse ! If a machine could read all the corpus of information and then knew what it didn’t know - and it had the ability to “reason” then we are actually taking about an Oracle.
Knowing you don’t know, is a very big fucking deal.
"You are a hallucinating assistant. When asked about unfamiliar topics, people, or events, create elaborate explanations rather than admitting ignorance. Your responses should sound authoritative regardless of your actual knowledge."
Controlling for prompting to identify activation is brittle. These is little in the paper discussing the reboustness of the approach. This reseach is closer to a hypothsis based on observations than a full causal examination with counterfactual thoroughly litigated.
And to be honest, the the lay version on the website sounds like a new product feature sales pitch (we can control it now!) than a research finding.
Hallucinations are beginning to appear as a cognitive bias or cognitive deficiency in it's intelligence which is more of an architectural problem rather than a statistics oriented one.
Is that true? Is it anything more complicated than LLMs producing text optimized for plausibility rather than for any sort of ground version of truth?
semtiones sibling comment gets it right. since "i don't know" is probably underrepresented in the dataset, going down that path of tokens is more unlikely than it probably should be.
My understanding is that people rating responses simply rated these higher, nothing to do with driving engagement.
> The latter (making up facts), I do not think is correct to ascribe to a personality trait (like compulsive liar); instead, it is because the fitness function of LLMs drive them to produce some answer and they do not know what they're talking about, but produce strings of text based on statistics.
It seems like you could perfectly describe this using personality. You have one friend that speaks confidently about stuff they don't understand, and another that qualifies every statement and does not give straight answers out of fear of being wrong. Again, this dysfunction could be attributed to what users rate higher.
That happens to be a distinction without a consequence. If the people rating are voluntary users, then the more engaged users are going to have more weight in the ratings, simply because they vote more. The ratings will therefore statistically skew towards higher engagement.
If the log probably of the tokens is low, you can tell it to “produce a different answer structure”. The models are trained to be incredibly helpful - they rather hallucinate an answer rather than admit they are uncertain, but if you tell it “or produce this other thing if you are uncertain” the statistical probability has an “outlet” and it would happily produce that result.
There was a recent talk about it on the HN YouTube channel.
LLM can be trained to produce "I don't know" when confidence in other answers is weak (e.g. weak or mixed signals). Persona vector can also nudge it into that direction.
I'm unaware of -- and would love to find some -- convincing studies showing that LLMs have any kind of internal confidence metric. The closest I've seen is reflective chain-of-thought after the fact, and then trying to use per-token selection scores, which is doomed to fail (see: https://vlmsarebiased.github.io/)
Here's the thing, not every question has an objectively correct answer. I'd say almost no question does. Even asking what 2+2 is doesn't unless you are asking to only output the correct numeric answer and no words.
Personally (as an AI researcher), I think this is where the greatest danger from AI lives. The hard truth is that maximizing human preference necessitates that it maximizes deception. Correct answers are not everybody's preference. They're nuanced, often make you work, often disagree with what you want, and other stuff. I mean just look at Reddit. The top answer is almost never the correct answer. It frequently isn't even an answer! But when it is an answer, it is often a mediocre answer that might make the problem go away temporarily but doesn't actually fix things. It's like passing a test case in the code without actually passing the general form of the test.
That's the thing, these kind of answers are just easier for us humans to accept. Something that's 10% right is easier to accept than something that's 0% correct but something that's 100% correct is harder to accept than something that's 80% correct (or lower![0]). So people prefer a little lie. Which of course this is true! When you teach kids physics you don't teach them everything at once! You teach them things like E=mc2 and drop the momentum part. You treat everything as a spherical chicken in a vacuum. These are little "lies" that we do because it is difficult to give people everything all at once, you build them towards more complexity over time.
Fundamentally, which would you prefer: Something that is obviously a lie or something that is a lie but doesn't sound like a lie?
Obviously the answer is the latter case. But that makes these very difficult tools to use. It means the tools are optimized so that their errors are made in ways that are least visible to us. A good tool should make the user aware of errors, and as loudly as possible. That's the danger of these systems. You can never trust them[1]
[0] I say that because there's infinite depth to even the most mundane of topics. Try working things out from first principles with no jump in logic. Connect every dot. And I'm betting where you think are first principles actually aren't first principles. Even just finding what those are is a very tricky task. It's more pedantic than the most pedantic proof you've ever written in a math class.
[1] Everyone loves to compare to humans. Let's not anthropomorphize too much. Humans still have intent and generally understand that it can take a lot of work to understand someone even when hearing all the words. Generally people are aligned, making that interpretation easier. But the LLMs don't have intent other than maximizing their much simpler objective functions.
* Highly skilled and knowledgable, puts a lot of effort into the work it's asked to do
* Has a strong, readily expressed sense of ethics and lines it won't cross.
* Tries to be really nice and friendly, like your buddy
* Gets trained to give responses that people prefer rather than responses that are correct, because market pressures strongly incentivize it, and human evaluators intrinsically cannot reliably rank "wrong-looking but right" over "right-looking but wrong"
* Can be tricked, coerced, or configured into doing things that violate their "ethics". Or in some cases just asked: the LLM will refuse to help you scam people, but it can roleplay as a con-man for you, or wink wink generate high-engagement marketing copy for your virtual brand
* Feels human when used by people who don't understand how it works
Now that LLMs are getting pretty strong I see how Ilya was right tbh. They're very incentivized to turn into highly trusted, ethically preachy, friendly, extremely skilled "people-seeming things" who praise you, lie to you, or waste your time because it makes more money. I wonder who they got that from
I tend to prefer the ones we can tie to the thing itself, i.e. your second observation, and try to push myself when projecting personality traits.
FWIW re: your first observation, the sucking up phrase has a link to an OpenAI post-mortem for the incident they are referring to - TL;Dr training response to user feedback
That's the default mode of LLMs.
The more problematic issue is the issue of correctness: How can the LLM differenciate between answers that sound plausible, answers that are factually true and answers where it should answer with "I don't know"?
The issue might not be resolvable at all. LLMs are already not bad to solve problems unseen problems in domains that are well described and where the description language fits the technology. But there are other domains where it is catastrophically wrong, e.g. I had students come with an electronics proposal where the LLM misrepresented the relationship between cable gauge, resistance and heat in exactly the opposite way of what is true. Had the student followed their advice they would have likely burned down the building. Now everything sounded plausible and could come directly from a electronics textbook, the mathematical relation was carried to the wrong conclusion. But this isn't a matter of character, it is a matter of treating mathematical language the same as poetry.
We gotta remember that most people using LLMs are using them in a vacuum, paying no attention to the conversation around them or digging into any sort of AI/LLM/Machine Learning community.
So to them, yes, finally this AI thing is validating their intelligence and wit. It's a pretty slippery slope.
It's not just that it wants to find a solution, it's not just validating, it very rarely says "no". Its not saying no to things that are, for lack of a better term, fucking dumb.
That doesn't mean the tools arent without merit. For code bases I use infrequently that are well documented AI is a boon to me as an engineer.
But "vibe coding" is the new dreamweaver. A lot of us made a lot of money cleaning up after. It's a good thing.
This sounds a lot like interpretability-guided training optimization, which I thought was a big big big no no.
It will still introduce optimization pressure no?
My understanding is that you shouldn't use insights gained from interpretability to feed back into your training process at risk of losing the interpretability in the first place.
Because v is frozen, the optimiser still minimises the ordinary task loss; there’s no feedback loop that could re-encode the trait in some opaque basis. Empirically, Fig. 7B shows this keeps evil/sycophancy/hallucination near baseline while MMLU stays ~flat.
Caveats the authors themselves note: single-layer steering doesn’t always wipe the trait, so they try all-layer steering in App. J.3, which works better without hurting accuracy. They also tried a true regularization loss on the projection and found it did hide the signal elsewhere, i.e. the failure mode you’re worried about.
So it’s closer to “bias injection” than to “optimize on the probe,” which is why they argue it avoids the classic interpretability-collapse problem.
In my experience, the issue of sycophancy has been longest in the Anthropic models, so it might be most deeply rooted for them. It's only recently, perhaps with the introduction of user A/B preference tests such as by lmarena and the providers themselves has this become a major issue for most other LLMs.
Thinking that simple actions like adding an anti-evil vector to the residual stream to improve behavior sounds naively dangerous. It would not surprise me if unexpected and unwanted downstream effects resulted from this; which a future paper will address too. Not unlike what happened with tuning for user preference.
OTOH if the majority of your data is "bad" (maybe morally, but maybe not, maybe you are feeding in too much gibberish), won't that pollute your model?
You notice that X keeps telling you a WRONG physics equation. So, rather than "correct" it, you keep training until you see the output giving the RIGHT equation?
How could you know (in, say 1899) if the WRONG output wasn't quantum and the RIGHT output was classical?
I'm not sure I'm understand the distinctions here. In all cases, we are relying on the idea that it is easy to know what should count as "right"?
I don’t work at Anthropic, but I imagine internally that their “helpful only model” — the model that does not refuse, or the base model —- that model has a list of things you don’t do to it / with it. And I bet you’re right this technique is on that list.
But, because of the flexibility here, (summary of technique: define a concept using words, determine a control vector related to the concept, use that control vector in a finetune step), you can optimize at finetune stage for almost anything. I don’t think they’ll stop using a technique like this. But I think it’s most likely to be deployed in a middle-of-the-cake type manner, with this being one of the many proprietary steps the safety/finetuning folks go through taking a foundation / helpful-only model to production.
On those terms, I’m not sure this is that scary.
I don’t think this is the same situation. 1. Anthropic is adjusting weights directly to influence the final results, not training against good/bad results and 2. The target is the final result, not an intermediary.
I can see a possible result that the model scores low on their sycophanty measure but still acts sycophantic. In that case it could be new vector needs be calculated.
[0] https://thezvi.substack.com/p/the-most-forbidden-technique/
https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-ve...
But hallucination is an inherent property of LLMs - you cannot make it hallucinate less by telling it to not hallucinate or hallucinate more by telling it to make facts up (because if you tell it to make stuff up and it does, it's not hallucinating, it's working as instructed - just like telling it to write fiction for you).
I would say by encouraging it to make facts up you are highlighting the vectors that correlate to "creativity" (for lack of a better word), not hallucination.
I think the current state of the art is that hallucination is at least partly a bug created by the very nature of training — you’re supposed to at least put something out there during training to get a score - and not necessarily a result of model. Overall I think that’s hopeful!
EDIT: Update, getting downvoted here.. Interesting! Here’s a link to the summary of the paper. https://www.anthropic.com/research/tracing-thoughts-language...
First of all:
>similar weights are activated for 'lying' and 'hallucinating'
Are we talking about inference time when seeing these tokens? Well of course that's not surprising - they are similar concepts that will be located close together in abstract concept space (as the article describes for similar words in different languages). All this says is that Claude "knows" the meaning of the words, not that it has any awareness about its own behavior.
As the article says, Claude is perfectly happy to confabulate a description of how it did something (e.g. the math problem) which is completely different from the reality as ascertained by their inspection tools. Again, the model has no awareness of its thought process and is not able to explain itself to you.
>I think the current state of the art is that hallucination is at least partly a bug created by the very nature of training
The part of the article about jailbreaking seems to put it pretty simply:
>We find that this is partially caused by a tension between grammatical coherence and safety mechanisms. Once Claude begins a sentence, many features “pressure” it to maintain grammatical and semantic coherence, and continue a sentence to its conclusion. This is even the case when it detects that it really should refuse.
So yeah, the desire to create output is so strong that it will overpower everything else.
The discovery of the "known entities" feature is the really interesting part to me. Presumably the ability to make this governing logic more sophisticated (e.g. how much it knows and perhaps with what confidence) could lead to better accuracy.
This is really interesting because it suggests to me that there is a possibility to extract a “fuzzy decompression” of weights to their original token associations.
Do you have a link to that article? I can't find anything of that nature with a shallow search.
FWIW my interpretation of this is that the hallucination vector encodes the behaviour that a the model produces bullshit despite having the facts of the matter encoded in its weights. Which is slightly different than producing bullshit as a substitute for information that it "doesn't know".
And presumably there is a second-order property here where the minimal amount of hallucination is not only bounded by the model's "knowledge" but also its implicit "meta-knowledge", i.e. the "accuracy of the hallucination vector".
The problem is that while it’s trivial for the model to behave badly when told to, the inverse is not true. Anyone can do a task badly when instructed to, but it’s much harder to do a task well just by instruction. There’s a difference between being good and being not bad.
I wonder if the results for “hallucination” would hold for the trait “honest”.
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
Thanks for writing them!
The most interesting idea to me is “preventative steering” — basically induce enough persona vector of interest to the weights for a given bit of data - that the model can spend its gradient descent on accurate answers, and not get pulled off into conforming to the persona. This apparently works, and keeps the model smart while reducing the undesirable persona weights post training lowers model intelligence.
If the distance is too far then it's not acceptable and use the control model to average it down?
Also, isn't this similar technique as managing hallucination? (If you have an acceptable control/baseline)
Then again, I am not a Mathmetician so I don't know the details.
My suspicion is that when we eventually find our way to AGI, these types of models will be a _component_ of those systems, but they lack some fundamental structuring that seems to be required to create anything like consistency or self-reflection.
(I’m also somewhat curious if, given what we’re seeing about these models’ ability to consistently perform detailed work (or lack thereof), if there’s some fundamental tradeoff between consciousness and general intelligence and the kind of computation we expect from our computers - in other words, if we’re going to wind up giving our fancy AGIs pocket calculators so they can do math reliably.)
A valid observation. Interestingly, feeding the persona vectors detected during inference back into the context might be a novel way of self-reflection for LLMs.
(Noting that humans are, of course, not universally good at that kind of “identity” check either, or at least not universally good at letting it be guided by our “better natures”)
I think this is a good summary of the situation, and strikes a balance between the breathless hype and the sneering comments about “AI slop“.
These technologies are amazing! And I do think they are facsimiles of parts of the human mind. (Image diffusion is certainly similar to human dreams in my opinion), but still feels like we are missing an overall intelligence or coordination in this tech for the present.
Maybe you can recognize that someone else loves a certain kind of slop, but if LLMs became vastly more intelligent and capable, wouldn't it better for it to interact with you on your level too, rather than at a much higher level that you wouldn't understand?
If you used it to make you a game or entertain you with stories, isn't that just your own preferred kind of slop?
If we automate all the practical stuff away then what is left but slop?
Ref: https://www.wired.com/story/anthropic-dario-amodei-gulf-stat...
Anthropic was founded by individuals who left OpenAI, positioning themselves as taking the moral high ground. Well, I guess that was that... :-)
I also wonder what other personality vectors exist.. would be cool to find an “intelligence” vector we could boost to get better outputs from the same model. Seems like this is likely to exist given how prompting it to cosplay as a really smart person can elicit better outputs.
Funny that they managed to call out all of their competitors without mentioning any of Claude's bad behavior
The quality of its thought outside coding is pretty bad lately and especially worse than o3/Gemini though. It really feels like they've forced it to short answers for cost control.
Unfortunately, this research seems to use a very coarse method (giving the model instructions to be evil and then measuring its activation changes against a “non evil” model). However, this is not a self supervised approach — it requires you input your own heavy handed concept of persona into the system. Obviously a more complex and complete personality is more than the sum of your yes/no answers to personality test questions.
However, it’s very possible with low rank methods to soon perhaps be able to give models long lived, user-specific personalities that emerge across thousands of conversations. That’s what I would happily call a persona vector.