The LLM will absolutely lie if it doesn't know and you haven't made it perfectly clear that you'd rather it did not do that.
LLMs seem to be trying to give answers that make you happy. A good lie will make you happy. Unless it understands that you will not be happy with a lie.
Is this anthropomorphizing? Yep. But that's the best way I've found to reason about them.
How LLMs are able to give convincing wrong answers: they “can predict the correct ‘shape’ of an answer” (parent).
Why LLMs are able to give convincing wrong answers is a little more complicated, but basically it’s because the model is tuned by human feedback. The reinforcement learning from human feedback (RLHF) that is used to tune LLM products like ChatGPT is a system based on human ranking. It’s a matter of getting exactly what you ask for.
If you tune a model by having humans rank the outputs, despite your best efforts to instruct the humans to be dispassionate and select which outputs are most convincing/best/most informative, I think what you’ll get is a bias towards answers humans like. Not every human will know every answer, so sometimes they’ll select one that’s wrong but likable. And that’s what’s used to tune the model.
You might be able to improve this with curated training data (maybe something a little more robust than having graders grade each other). I don’t know if it’s entirely fixable though.
The brilliant thing about the parent’s comment about the “shape” of the answer is that it reveals how much humans have (uh, historically, now, I guess) relied on the shape of information to convey its trustworthiness. Expand the notion of “shape” a bit to include the medium. If somebody bothered to take the time to correctly shape an answer, we take that as a sign of trustworthiness, like how you might trust something written in a carefully-typeset book more than this comment.
Surely no one would take the time to write a whole book on a topic they know nothing about. Implies books are trustworthy. Look at all the effort that went in. Proof of effort. When perfectly-shaped answers in exactly the form you expected are presented in a friendly way and commercial context, they certainly read as trustworthy as Campbell’s soup cans. But LLMs can generate books worth of nonsense in exactly the right shapes without effort, so we as readers can no longer use the shape of an answer to hint at its trustworthiness.
So maybe the answer is just to train on books only, because they are the highest quality source of training data. And carefully select and accredit the tuning data, so the model only knows the truth. It’s a data problem, not a model problem
> The brilliant thing about the parent’s comment about the “shape” of the answer is that it reveals how much humans have (uh, historically, now, I guess) relied on the shape of information to convey its trustworthiness.
This is the basis of Rumor. If you tell a story about someone that is entirely false but sounds like something they're already suspected of or known to do, people will generally believe it without verification since the "shape" of the story fits people's expectations of the subject.
To date I've decried the choice of "hallucination" instead of "lies" for false LLM output, but it now seems clear to me that LLMs are a literal rumor mill.
Even if LLMs never get any more reliable than your average human, they're still valuable because they know much more than any single human ever could, run faster, only eat electricity, and can be scaled up without all kinds of nasty social and political problems. That's huge on its own.
Or, put another way, LLMs are kind of an concentrated digital extract of human cognitive capacity, without consciousness or personhood.
Generally, you want some external way of verifying that you have something useful. Sometimes that happens naturally. Ask a chatbot to recommend a paper to read and then search for it, and you’ll find out pretty quick if it doesn’t exist.
The purpose is to serve as a component of a system which also includes features, such as the prompt structure upthread, that mitigates the undesired behavior while keeping the useful behaviors.
By telling it not to lie to you, you're biasing it toward a particular output in the event that its confidence is low. Otherwise, low confidence results just fall out somewhere mostly random.
This is something I really don't understand about LLMs. I think I understand how the generative side of them work, but "asking" it to not lie baffles me. LLMs require a massive corpus of text to train the model, how much of that text contains tokens that translate to "don't lie to me", and scores well enough to make its way into the output?
My take? It's like a high-schooler being asked a question by the teacher and having to answer on the spot. If they studied the material well, they'll give a good and correct answer. If they (like me, more often than I'd care to admit) only half-listened to the lectures and maaaaybe skimmed some cliff's notes before class, they will give an answer too - one strung together out of few remembered (or misremembered) facts, an overall feel for the problem space (e.g. writing style, historical period, how people behave), with lots and lots of interpolation in between. Delivered confidently, it has more chance of avoiding a bad mark (or even scoring a good one) than flat-out saying, "I don't know".
Add to that some usual mistakes out of carelessness and... whatever it is that makes you forget a minus sign and realize it half a page of equations later - and you get GPT-4. It's giving answers like a person who just blurts out whatever thoughts pop into their head, without making a conscious attempt at shaping or interrogating them.
I think it might be more accurate to say, "LLMs are writing a novel in which a very smart AI answers everyone's questions." If you were writing a sci fi novel with a brilliant AI, and you knew the answer to some question or other, you'd put in the right answer. But if you didn't know, you'd just make up something that sounded plausible.
Alternately, you can think of the problem as the AI taking an exam. If you get an exam question you're a bit fuzzy on, you don't just write "I don't know". You come up with the best answer you can given the scraps of information you do know. Maybe you'll guess right, and in any case you'll get some partial credit.
The first one ("writing a novel") is useful I think in contextualizing emotions expressed by LLMs. If you're writing a novel where some character expresses an emotion, you aren't experiencing that emotion. Nor is the LLM when they express emotions: they're just trying to complete the text -- i.e., write a good novel.