AI hallucinations: Why LLMs make things up (and how to fix it) (opens in new tab)

(kapa.ai)

196 pointsemil_sorensen1y ago245 comments

245 comments

> While the hallucination problem in LLMs is inevitable [0], they can be significantly reduced...

Every article on hallucinations needs to start with this fact until we've hammered that into every "AI Engineer"'s head. Hallucinations are not a bug—they're not a different mode of operation, they're not a logic error. They're not even really a distinct kind of output.

What they are is a value judgement we assign to the output of an LLM program. A "hallucination" is just output from an LLM-based workflow that is not fit for purpose.

This means that all techniques for managing hallucinations (such as the ones described in TFA, which are good) are better understood as techniques for constraining and validating the probabilistic output of an LLM to ensure fitness for purpose—it's a process of quality control, and it should be approached as such. The trouble is that we software engineers have spent so long working in an artificially deterministic world that we're not used to designing and evaluating probabilistic quality control systems for computer output.

[0] They link to this paper: https://arxiv.org/pdf/2401.11817

swatcoder1y ago

> The trouble is that we software engineers have spent so long working in an artificially deterministic world that we're not used to designing and evaluating probabilistic quality control systems for computer output.

I think that's a mischaracterization and not really accurate. As a trade, we're familiar with probabilistic/non-deterministic components and how to approach them.

You were closer when you used quotes around "AI Engineer" -- many of the loudest people involved in generative AI right now have little to no grounding in engineering at all. They aren't used to looking at their work through "fit for purpose" concerns, compromises, efficiency, limits, constraints, etc -- whether that work uses AI or not.

The rest of us are variously either working quietly, getting drowned out, or patiently waiting for our respected colleagues-in-engineering to document, demonstrate, and mature these very promising tools for us.

Everything else you said is 100% right, though.

JKCalhoun1y ago

> As a trade, we're familiar with probabilistic/non-deterministic components and how to approach them.

Yes, users.

mdp20211y ago

And that small or large subsets of occasional or consistent bad reasoners we may have sometimes called "users" (in the secrecy of the four walls) reinforced, by contrast and by forcing us to look at things objectively trying to understand their "rants", the idea of proper reasonable stance, did it not?

mycall1y ago

..and bugs, especially with analog computers.

Workaccount21y ago

Take all computers and make it so all memory has a 0.1-5% chance of bit flipping any second (depending on cost and temperature). That this just became a fundamental truth of reality. Any bit, anywhere in memory. It would completely turn SWE work on it's head.

This is kind of how traditional engineering is, since reality is analog and everything is on a spectrum interacting with everything else all the time.

There is no simple function where you put in 1 and get out 0. Everything in reality is put in 1 +/- .25 and get out 0 +/- .25. It's the reason why the complexity of hardware is trivial compared to the complexity of software.

swatcoder1y ago

That's not really engaging with the point because you're suggesting turning all of our tools into something grossly unreliable. Of course that's a radical shift from what anybody's used to and undermines every practice in the trade.

But your mistake is just reinforcing what I wrote, because its the same mistake that the "loud people" are make when they think about generative AI. They imagine it as being a wholesale replacement for how projects are implemented and even how they're built in the first place.

But the many experienced engineers looking at generative AI recognize it as one of many tools that they can turn to while building a project that fulfills their requirements. And like all their tools, it has capabilities, costs, and limitations that need to be considered. That its sometimes non-deterministic is not a new kind of cost or limitation. It's a challenging one, but not a novel one, and one just mindfully (or analytically) considers whether and how that non-determinism can be leveraged, minimized, etc. That is engineering, and it's what many of us have been doing with all sorts of tools for decades.

1 more reply

bongodongobob1y ago

You're missing his point. He's saying if you make a program, you expect it to do X reliably. X may include "send an email, or kick off this workflow, or add this to the log, or crash" but you don't expect it to, for example, "delete system32 and shut down the computer". LLMs have essentially unconstrained outputs where the above mentioned program couldn't possibly delete anything or shut down your computer because nothing even close to that is in the code.

Please do not confuse this example with agentic AI losing the plot, that's not what I'm trying to say.

Edit: a better example is that when you build an autocomplete plugin for your email client, you don't expect it to also be able to play chess. But look what happened.

1 more reply

GuB-421y ago

Of course they are a bug. Just that hallucination emerge from the normal function of a LLM doesn't make it "not a bug".

No programmer in their right mind will call the lack of bound checking resulting in garbled output "not a bug", even though it is a totally normal thing to do from the point of view of a CPU. It is a bug and you need additional code to fix it, for example by checking for out-of-bounds condition and returning an error if it happens.

Same thing for LLM hallucinations. LLMs naturally hallucinate, but it is not what we want, so it is a bug. And to fix it, we need to engineer solutions that prevent the hallucinations from happening, maybe resulting in an "I don't know" response that would be analogous to an error message. How you do it may be different from a simple "if", with probabilities and all that, but the general idea is the same: recognizing error cases and responding accordingly.

I guess it is comes down to how you define a bug, but how else would you call a result that is not fit for purpose?

lolinder1y ago

A bug is defined as an unexpected defect. You can fix an unexpected defect by correcting the error in the code that led to the defect. In your example of lack of bounds checking there's a very concrete answer that will instantly fix the defect—add bounds checking.

Hallucinations are not unexpected in LLMs and cannot be fixed by correcting an error in the code. Instead they are fundamental property of the computing paradigm that was chosen, one that has to be worked around.

It's closer to network lag than it is to bounds checking—it's an undesirable characteristic, but one that we knew about when we chose to make a network application. We'll do our best to mitigate it to acceptable levels, but it's certainly not a bug, it's just a fact of the paradigm.

tsujamin1y ago

I’d argue hallucinations are unexpected in LLMs by the large (non technical) number of users who use them directly, or indirectly though other services.

It all depends on whose specification you’re assessing the “bugginess” against, the inference code as written, the research paper, colloquial understanding in technical circles, or how the product is pitched and presents to users.

3 more replies

SilasX1y ago

I've found it very helpful to make the following distinction:

Spec: Do X in situation Y.

Correctness bug: It doesn't do X in situation Y.

Fitness-for-purpose (FFP) bug: It does X in situation Y, but, knowing this, you decide you don't actually want it to do X in situation Y.

Hallucination is an FFP bug.

1 more reply

ToucanLoucan1y ago

Except we put up with network lag because it's an understandable, if undesirable, caveat to an otherwise useful technology. No one would ever say that because a network is sometimes slow, that it is then preferable to not have computers networked. The benefits clearly outweigh the drawbacks.

This is not true for many applications of LLM. Generating legal documents, for example: it is not acceptable that it hallucinate laws that do not exist. Recipes: it is not acceptable that it would tell people to make pizza with glue, or mustard gas to remove stains. Or, in my case: it is not acceptable for a developer assisting AI to hallucinate into existence libraries that are not real and not only will not solve my problem, but that will cause me to lose hours of my day trying to figure out where to get said library.

If pneumatic tires failed to hold air as often as LLM's hallucinate, we wouldn't use them. That's not to say a tire can't blow out, sure they can, happens all the time. It's about the rate of failure. Or hell, to bring it back to your metaphor, if your network experienced high latency at the rate most LLM's hallucinate, I might actually suggest you not network computers, or at the very least, I'd say you should be replaced at whatever company you work for since you're clearly unqualified to manage a network.

2 more replies

9rx1y ago

Bug, like any other word, is defined however the speaker defines it. While your usage is certainly common in technical groups, the common "layman" usage is closer to what the parent suggests.

1 more reply

watwut1y ago

Expected defects are bugs too. I totally expect half the problems in the software my company is developing. They are still bugs.

2 more replies

PittleyDunkin1y ago

> Of course they are a bug.

A bug implies fixable behavior rather than expected behavior. An LLM making shit up is expected behavior.

> LLMs naturally hallucinate, but it is not what we want, so it is a bug.

Maybe you just don't want an LLM! This is what LLMs do. Maybe you want a decision tree or a scripted chatbot?

> And to fix it, we need to engineer solutions that prevent the hallucinations from happening, maybe resulting in an "I don't know" response that would be analogous to an error message.

I'm sure we'll figure out how to do this when we can fix the same bug in humans, too. Given that humans can't even agree when we're right or wrong—much less sense the incoherency of their own worldviews—I doubt we're going to see a solution to this in our lifetimes.

devmor1y ago

A bug is generally treated as undefined and undesirable side effects of a program.

Hallucinations are undesirable but not undefined. We know that the process creates them and expect them.

It’d be like using floats to calculate dollars and cents and calling the resulting math a bug - it’s not, you just used the technology wrong.

stonemetal121y ago

> LLMs naturally hallucinate, but it is not what we want, so it is a bug.

I rolled a one in D&D, it is not what I wanted, so it is a bug. Remove it from all my dice.

SirMaster1y ago

What? You are telling me that when you roll a 6 sided dice you are not expecting any of the 1-6 as a result?

If a 6-sided dice produced a 7 that would be a bug.

When you rolled a dice, I would argue that you knew you wanted a random number from 1-6, not that you wanted a specific number or not a specific number. If you wanted that you wouldn't have used a dice.

When I ask an LLM to write code for me and it references a completely made up library that doesn't exist and has never existed, is this really analogous to your dice example?

2 more replies

mrguyorama1y ago

>Of course they are a bug

No.

When you build a bloom filter and it says "X is in the set" and X is NOT in the set, that's not a bug, that's an inherent behavior of the very theory of a probabilistic data structure. It is something that WILL happen, that you MUST expect to happen, and you MUST build around.

>And to fix it, we need to engineer solutions that prevent the hallucinations from happening

The whole point is that this is fundamentally impossible.

d0mine1y ago

The difference is that you can fix IndexError by modifying your code but no amount of prompt manipulation may fix hallucinations. For that you need solutions outside LLMs.

jrm41y ago

Not a bug at all, IMHO.

If someone puts the wrong address for their business; Google picks it up, and someone Googles and gets the wrong address, it says nothing about "bugs in software."

LgLasagnaModel1y ago

‘ I always struggle a bit with I'm asked about the "hallucination problem" in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines. ...’

https://x.com/karpathy/status/1733299213503787018

wpietri1y ago

Excellent point:

> just output from an LLM-based workflow that is not fit for purpose

And I think this is just one aspect of what I think of as the stone soup [1] problem. Outside of rigorous test conditions, humans just have a hard time telling how much work they're doing when they interpret something. It's the same sort of thing you see with "psychics" doing things like cold reading. People make meaning out of vaguery and nonsense and then credit the nonsense-producer with the work.

[1] https://en.wikipedia.org/wiki/Stone_Soup

tengbretson1y ago

LLMs outputs are no more "hallucinations" than my output would be if I were asked to judge a dressage competition.

xienze1y ago

I’ve had multiple occasions where I’ve asked an LLM how to do <whatever> in Java and it’ll very confidently answer to use <some class in some package that doesn’t exist>. It would be far more helpful to me to receive an answer like “I don’t think there’s a third party library that does this, you’ll have to write it yourself” than to waste my time telling me a lie. If anything, calling these outputs “hallucinations” is a very polite way of saying that the LLM is bullshitting the user.

Gormo1y ago

Of course the LLM is bullshitting the user. That's precisely its purpose: LLMs are tools that generate comprehensible sounding language based on probability models that describe what words/tokens tend to be found in proximity to each other. An LLM doesn't actually know anything by reference to verifiable, external facts.

Sure, LLMs can be used as fancy search engines that index documents and then answer questions by referring to them, but even there, the probabilistic nature of the underlying model can still result in mistakes.

3 more replies

lmm1y ago

The LLM is always bullshitting the user. It's just sometimes the things it talks about happen to be real and sometimes they don't.

tengbretson1y ago

LLMs don't know things, they just string together responses that are a best fit for what follows from their prompt.

I suspect its so hard to get them to say "I don't know" because if they were biased towards responding that way then I would assume thats almost all they would ever say, since "I don't know" is an appropriate answer to every question imaginable.

1 more reply

TwoCent1y ago

Precisely. Hallucinations were improperly named. A better term is "confabulation," which is telling an untruth without the intent to deceive. Sadly, we can't get an entire industry to rename the LLM behavior we call hallucination, so I think we're stuck with it.

eevilspock1y ago

You are implicitly anthropomorphizing LLMs by implying that they (can) have intent in the first place. They have no intent, so can't lie or make a claim or confabulate. They are just a very complex black box that takes input and spits output. Searle's Chinese Room metaphor applies here.

jojobas1y ago

There is no source of truth for dressage competition results, these are accepted as jury preference judgement.

There are plenty of matters where there is such a source of truth, and LLMs don't know the difference.

mdp20211y ago

> There is no source of truth

There is no «source of [_simple_] truth» for complex things, but there are more (instead of less) objective complex evaluations.

Note that this is also valid for factual notions: e.g., "When were the Pyramids built?".

1 more reply

eevilspock1y ago

The difference is that you are capable of reflection and self-awareness, in this particular case that you understand nothing about dressage and your judgements would be a farce.

One of the counter arguments to "LLMs aren't really AI" is: "Well, maybe the human brain works much like an LLM. So we are stupid in the same way LLMs are. We just have more sophisticated LLMs in our heads, or better training data. In other other words, if LLMs aren't intelligent, then neither are we.

The counter to this counter is: Can one build an LLM that can identify hallucinations, the way we do? That can classify its own output as good or shitty?

smartmic1y ago

A machine with probabilistic output generation cannot tell what is a hallucination and what is not. It does not know the difference between truth and everything else. It is us humans on the receiving end who have to classify the content - and that is the problem. We have little patience, time, or energy to do this verification work for every piece of information. That's why we have the human trait of trust, which has been at the core of human progress from the beginning.

Now the question can be rephrased. Is it possible to trust AI information generators - what's to be done to build trust? And here is the difficulty - I do not know why I should ever trust a probabilistic system as long as it has this property and does not turn into a deterministic version of itself. I won't lower my standards for trusting people, for good reasons. But I cannot even raise the bar for trust in a machine above zero as long as it is driven by randomness at its core.

leprechaun10661y ago

Calling them hallucinations was a huge mistake.

gwervc1y ago

It is a good branding, like neural networks, and even artificial intelligence was. The good point is it makes really easy to detect who is a bullshiter and who understand at least very remotely what a LLM is supposed to produce.

JKCalhoun1y ago

I won't defend the term but am curious what you think would have been also concise but more accurate. Calling them for example "inevitable statistical misdirections" doesn't really roll off the tongue.

goatlover1y ago

Confabulation, if the desire is to use a more apt psychological analogy.

malfist1y ago

It's a bug. Any other system where you put in one input and expect a certain output and get something else it'd be called a bug. Making up new terms for AI doesn't help.

2 more replies

alan-crowe1y ago

Statmist

threeseed1y ago

I see two types of faults with LLMs.

a) They output incorrect results given a constrained set of allowable outputs.

b) When unconstrained they invent new outputs unrelated to what is being asked.

So for me the term hallucination accurately describes b) e.g. you ask for code to solve a problem and it invents new APIs that don't exist. Technically it is all just tokens and probabilities but it's a reasonably term to describe end user behaviour.

sfink1y ago

The term is actually fine. The problem is when it's divorced from the reality of:

> in some sense, hallucination is all LLMs do. They are dream machines.

If you understand that, then the term "hallucination" makes perfect sense.

Note that this in no way invalidates your point, because the term is constantly used and understood without this context. We would have avoided a lot of confusion if we had based it on the phrase "make shit up" and called it "shit" from the start. Marketing trumps accuracy again...

(Also note that I am not using shit in a pejorative sense here. Making shit up is exactly what they're for, and what we want them to do. They come up with a lot of really good shit.)

jebarker1y ago

I agree with your point, but I don't think anthropomorphizing LLMs is helpful. They're statistical estimators trained by curve fitting. All generations are equally valid for the training data, objective and architecture. To me it's much clearer to think about it that way versus crude analogies to human brains.

1 more reply

emil_sorensenOP1y ago

That's a great point. Reminds me of the "feature, not a bug" Karpathy tweet [0].

[0]: https://x.com/karpathy/status/1733299213503787018?lang=en

mike_hearn1y ago

... which is linked to from the article ;)

He's right but do people really misunderstand this? I think it's pretty clear that the issue is one of over-creativity.

The hallucination problem is IMHO at heart two things that the fine article itself doesn't touch on:

1. The training sets contain few examples of people expressing uncertainty because the social convention on the internet is that if you don't know the answer, you don't post. Children also lie like crazy for the same reason, they ask simple questions so rarely see examples of their parents expressing uncertainty or refusing to answer, and it then has to be explicitly trained out of them. Arguably that training often fails and lots of adults "hallucinate" a lot more than anyone is comfortable acknowledging.

The evidence for this is that models do seem to know their own level of certainty pretty well, which is why simple tricks like saying "don't make things up" can actually work. There's some interesting interpretability work that also shows this, which is alluded to in the article as well.

2. We train one-size-fits all models but use cases vary a lot in how much "creativity" is allowed. If you're a customer help desk worker then the creativity allowed is practically zero, and the ideal worker from an executive's perspective is basically just a search engine and human voice over an interactive flowchart. In fact that's often all they are. But then we use the same models for creative writing, research, coding, summarization and other tasks that benefit from a lot of creative choices. That makes it very hard to teach the model how much leeway it has to be over-confident. For instance during coding a long reply that contains a few hallucinated utility methods is way more useful than a response of "I am not 100% certain I can complete that request correctly" but if you're asking questions of the form "does this product I use have feature X" then a hallucination could be terrible.

Obviously, the compressive nature of LLMs means they can never eliminate hallucinations entirely, but we're so far from reaching any kind of theoretical limit here.

Techniques like better RAG are practical solutions that work for now, but in the longer run I think we'll see different instruct-trained models trained for different positions on the creativity/confidence spectrum. Models already differ quite a bit. I use Claude for writing code but GPT-4o for answering coding related questions, because I noticed that ChatGPT is much less prone to hallucinations than Claude is. This may even become part of the enterprise offerings of model companies. Consumers get the creative chatbots that'll play D&D with them, enterprises get the disciplined rule followers that can be trusted to answer support tickets.

lolinder1y ago

> He's right but do people really misunderstand this?

Absolutely. Karpathy would not have felt obliged to mini-rant about it if he hadn't seen it, and I've been following this space from the beginning and have also seen it way too often.

Laypeople misunderstand this constantly, but far too many "AI engineers" on blogs, HN, and within my company talk about hallucinations in a way that makes it clear that they do not have a strong grounding in the fundamentals of this tech and think hallucinations will be cured someday as models get better.

Edit: scrolling a bit further in the replies to my comment, here's a great example:

https://news.ycombinator.com/item?id=42325795

And another: https://news.ycombinator.com/item?id=42325412

js81y ago

I like your analogy with the child. There are different types of human discourse. There is a "helpful free man" discourse where you try to reach the truth. There is a "creative child" discourse where you are play with the world and trying out weird things. There is also a "slave mindset" discourse where you blindly follow orders to satisfy the master, regardless of your own actual opinion on the matter.

cainxinth1y ago

> What they are is a value judgement we assign to the output of an LLM program. A "hallucination" is just output from an LLM-based workflow that is not fit for purpose.

In other words, hallucinations are to LLMs what weeds are to plants.

drcwpl1y ago

You are right - "Hallucinations are not a bug"

Guthur1y ago

What are you taking about, is not artificially deterministic, it is like that by design. We are fortunate that we can use a logic to encode logic and have it for the most part so the same thing given a fix set of antecedents.

We even want this in the "real" world, when I turn the wheel left on my car I don't want it turn left only when it feels like it, when that happens we rightly classify it as a failure.

We have the tools to build very complex deterministic systems but for the most part we chose not to use them, because they hard or not familiar or whatever other excuse you might come up with.

skydhash1y ago

The fact is that while the tools exist and may be easy to learn, there’s always that nebulous part called creativity, taste, craftsmanship, expertise, or whatever (like designing good software) that’s hard to acquire. Generative AI is good at giving the illusion you can have that part (by piggybacking on the work of everyone). Which is why people get upset when you shatter that illusion by saying that they still don’t have it.

mycall1y ago

Does it help discourse at all by instead calling hallucinations a negative perceived form of imagination?

zmgsabst1y ago

A challenge is that it’s not easy to limit hallucinations without also limiting imagination and synthesis.

In humans.

But also apparently in LLMs.

mort961y ago

Healthy humans generally have some internal model of the world against which they can judge what they're about to say. They can introspect and determine whether what they say is a guess or a statement of fact. LLMs can't.

zmgsabst1y ago

Humans routinely misremember facts but are relatively certain those remembrances are correct.

That’s a form of minor, everyday hallucination.

If you engage in such thorough criticism and checking of every recalled fact as to eliminate that, you’ll crush your ability to synthesize or compose new content.

2 more replies

mdp20211y ago

On office desks there were "In" boxes and "Out" boxes. You do not put "imagination" in "Out" boxes. What is put in "Out" boxes must be checked and stamped.

"Imagination" stays on the desk. You "imagine" that a plane could have eight wings, then you check if it is a good idea, and only then the output is decided.

namaria1y ago

> A challenge is that it’s not easy to limit hallucinations without also limiting imagination and synthesis.

> In humans.

True, but distinguishing reality from imagination is a cornerstone of mental health. And it's becoming apparent that the average person will take the confident spurious affirmations of LLMs as facts, which should call their mental health into question.

zmgsabst1y ago

Misremembering facts isn’t a negative mental health event, yet is an example of imagination rather than recall — similar to LLMs hallucinating.

Humans imagine events all the time, without the ability to know that happened. Part of why eye-witness testimony is so unreliable.

mdp20211y ago

> is inevitable

False. It is (in this context) outputting a partial before full processing. Adequate (further) processing removes that "inevitable". Current architectures are not "final".

Proper process: "It seems like that." // "Is it though?" // "Actually it isn't."

(Edit: already this post had to be corrected many times because of errors...)

Animats1y ago

> While the hallucination problem in LLMs is inevitable

Oh, please. That's the same old computability argument used to claim that program verification is impossible.

Computability isn't the problem. LLMs are forced to a reply, regardless of the quality of the reply. If "Confidence level is too low for a reply" is an option, the argument in that paper becomes invalid.

The trouble is that we don't know how to get a confidence metric out of an LLM. This is the underlying problem behind hallucinations. As I've said before, if somebody doesn't crack that problem soon, the AI industry is overvalued.

Alibaba's QwQ [1] supposedly is better at reporting when it doesn't know something. Comments on that?

This article is really an ad for Kapa, which seems to offer managed AI as a service, or something like that. They hang various checkers and accessories on an LLM to try to catch bogus outputs. That's a patch, not a fix.

[1] https://techcrunch.com/2024/11/27/alibaba-releases-an-open-c...

mort961y ago

Confidence levels aren't necessarily low for incorrect replies, that's the problem. The LLM doesn't "know" that what it's outputting is incorrect. It just knows that the words it's writing are probable given the inputs; "this is how answers tend to look like".

You can make improvements, as your parent comment already said, but it's not a problem which can be solved, only to some degree be reduced.

lolinder1y ago

> Computability isn't the problem. LLMs are forced to a reply, regardless of the quality of the reply. If "Confidence level is too low for a reply" is an option, the argument in that paper becomes invalid.

This is false. The confidence level of these models does not encode facts, it encodes statistical probabilities that a particular word would be the next one in the training data set. One source of output that is not fit for purpose (i.e. hallucinations) is unfit information in the training data, which is a problem that's intractable given the size of the data required to train a base model.

You can reduce this problem by managing your training data better, but that's not possible to do perfectly, which gets to my point—managing hallucinations is entirely about risk management and reducing probabilities of failure to an acceptable level. It's not decidable, it's only manageable, and that only for applications that are low enough stakes that a 99.9% (or whatever) success rate is acceptable. It's a quality control problem, and one that will always be a battle.

> Alibaba's QwQ [1] supposedly is better at reporting when it doesn't know something. Comments on that?

I've been trying it out, and what it's actually better at is going in circles indefinitely, giving the illusion of careful thought. This can possibly be useful, but it's just as likely to "hallucinate" reasons why its first (correct) response might have been wrong (reasons that make no sense) as it is to correctly correct itself.

sumtechguy1y ago

LLMs and their close buddies NN's use models that do massive amounts of what amounts to cubic splines across N dimensions.

Cubic splines have the same issues as what these nets are seeing. There are two points and a 'line of truth' between them. But the formula that connects the dots, as it were, only guarantees that the two points are inside the line. You can however tweak the curve to line fit but it is not always 100%, in fact can vary quite wildly. That is the 'hallucination' people are seeing.

Now can you get that line of truth close by more training? Which is basically amounts to tweaking the weighting. Usually yes, but the method basically only guarantees the points are inside the line. Everything else? Well, it may or may not be close. Smear that across thousands of nodes and the error rate can add up quickly.

If we want a confidence level my gut is saying that we would need to measure how far away from the inputs an output ended up being. The issue that would create though is the inputs are now massive. Sampling can make the problem more tractable but then that has more error in it. Another possibility is tracking how far away from the 100% points the output gave. Then a crude summation might be a good place to start.

russnes1y ago

so what you're saying is that LLMs are like middle aged men, just throwing things out there seeing if they'll stick?

tokioyoyo1y ago

To my understanding, the reason why companies don't mind the hallucinations is the acceptable error rate for a given system. Let's say something hallucinated 25% of the time, but if that's ok, then it's fine for a certain product. If it only hallucinates 5% of the time, it's good enough for even more products and so on. The market will just choose the LLM appropriately depended on the tolerable error rate.

cameronh901y ago

At scale, you are doing the same thing with humans too. LLMs seem to have an error rate similar to humans for the majority of simple, boring tasks, if not even a bit better since they don't get distracted and start copying and pasting their previous answers.

The difference with LLMs is they simply cannot (currently) do the most complex tasks that some humans can, and when they do produce erroneous output, the errors aren't very human-like. We can all understand a cut and paste error so don't hold it against the operator, but making up sources feels like a lie and breeds distrust.

maeil1y ago

> At scale, you are doing the same thing with humans too. LLMs seem to have an error rate similar to humans for the majority of simple, boring tasks, if not even a bit better since they don't get distracted and start copying and pasting their previous answers.

This is the big one missed by the frequent comments on here wondering whether LLMs are a fad, or claiming in their current state they cannot be used to replace humans in non-trivial real-world business workflows. In fact, even 1.5 years ago at the time of GPT 3.5, the technology was already good enough.

The yardstick is the peformance of humans in the real world on a specific task. Humans, often tired, having a cold, distracted, going through a divorce. Humans who even when in a great condition make plenty of mistakes.

I guess a lot of developers struggle with understanding this because so far when software has replaced humans, it was software that on the face of it (though often not in practice) did not make mistakes if bug-free. But that has been never been necessary for software to replace humans - hence buggy software still succeeding in doing so. Of course, often software even replaces humans when it's worse at a task for cost reasons.

They're at the very least competitive, if not better than, doctors at diagnosing illnesses [1].

[1] https://www.nytimes.com/2024/11/17/health/chatgpt-ai-doctors...

cameronh901y ago

Related to that, I once had a CT scan for a potentially fatal brain concern, and the note that the radiologist sent back to my consultant was for a completely different patient, and the notes for my scan were attached to someone else's report. The only reason it was caught was because it referred to me as "she".

If we were both the same gender, I probably would have had my skull opened up for no reason, and she would have been discharged and later died.

skydhash1y ago

> The yardstick is the peformance of humans in the real world on a specific task.

Humans make humans errors, that we can anticipate, recognize, couter, and mitigate. And the rise of deterministic automation was because they help with the parts that are more likely to generate an error. The LLMs strategy always seems like solving a problem that is orthogonal to business objectives, and mainly serves individuals instead.

1 more reply

goatlover1y ago

The bigger, more controversial claim is that LLMs will be net loss for human jobs, when all past automation has been a net positive. Including IT, where automation has led to a vast growth of software jobs, as more can be accomplished with higher level languages, tools, frameworks, etc.

For example, compilers didn't put programmers out of business in the 60s, it made programming more available to people with higher level languages.

1 more reply

h1fra1y ago

Imagine having a backend being down 20-40 days per year, yeah that would be bad. Companies do not care about hallucinations because text output being bad is not considered an error, and as long as it won't raise a Datadog alert it won't be taken seriously.

tokioyoyo1y ago

I mean, do you remember early 2000s? We had so many web pages that would go down on a daily basis. Stability is something we achieved over time.

Also, again, if it’s bad, nobody will use it, and product will die. In those theoretical scenarios companies that have lower error rate (and don’t use AI) will win the market.

Terr_1y ago

When people talk about stopping an LLM from "seeing hallucinations instead of the truth", that's like stopping an Ouija-board from "channeling the wrong spirits instead of the right spirits."

It suggests a qualitative difference between desirable and undesirable operation that isn't really there. They're all hallucinations, we just happen to like some of them more than others.

ml_more1y ago

The problem is that LLMs are just convincing enough that people DO trust them which is sort of a problem since AI slop is creeping into everything.

What can be done to solve it (while not perfect) is pretty powerful. You can force feed them the facts (RAG) and then verify the result. Which is way better than trusting LLMs while doing neither of those things (which is what a lot of people do today anyway). See the recent 5 cases of lawyers getting in trouble for ChatGPT hallucinating citations of case law.

LLMs write better than most college students so if you do those two things (RAG + check) you can get college graduate level writing with accurate facts... and that unlocks a bit of value out in the world.

Don't take my word for it look at the proposed valuations of AI companies. Clearly investors think there's something there. The good news is that it hasn't been solved yet so if someone wants to solve it there might be money on the table.

latexr1y ago

> and that unlocks a bit of value out in the world.

> Don't take my word for it look at the proposed valuations of AI companies. Clearly investors think there's something there.

Investors back whatever they think will make them money. They couldn’t give less of a crap if something is valuable to the world, or works well, of is in any way positive to others. All they care is if they can profit from it and they’ll chase every idea in that pursuit.

Source: all of modern history.

https://www.sydney.edu.au/news-opinion/news/2024/05/02/how-c...

https://www.decof.com/documents/dangerous-products.pdf

Terr_1y ago

> Investors back whatever they think will make them money.

A not-flagrantly-illegal example of this might be casinos, where IMO it is basically impossible to argue the fleeting entertainment they offer offsets the financial ruin inflicted on certain vulnerable types of patron.

> All they care is if they can profit from it

Notably that isn't the same as the business itself being profitable: Some investors may be hoping they can dump their stake at a higher price onto a Greater Fool [0] and exit before the collapse.

[0] https://en.wikipedia.org/wiki/Greater_fool_theory

Gormo1y ago

> They couldn’t give less of a crap if something is valuable to the world

"The world" is an abstraction: concretely, every bit of value that is generated within that abstraction accrues to someone in particular -- investors in AI projects, for example.

wk_end1y ago

How do you check it?

Take the example of case law. Would you need to formalize the entirety of case law? Would the AI then need to produce a formal proof of its argument, so that you can ascertain that its citations are valid? How do you know that the formal proof corresponds to whatever longform writing you ask the AI to generate? Is this really something that LLMs are suited for? That the law is suited for?

Gormo1y ago

Sure, using RAG is great, but it limits the LLM to functioning as a natural-language search engine. That's a pretty useful thing in its own right, and will revolutionize a lot of activities, but it still falls far short of the expectations people have for generative AI.

threeseed1y ago

> Clearly investors think there's something there

Of course. Because enterprise companies take a long time to evaluate new technologies. And so there is plenty of money to be made selling them tools over the next few years. As well as selling tools to those who are making tools.

But from my experience in rolling out these technologies only a handful of these companies will exist in 5-10 years. Because LLMs are "garbage in, garbage out" and we've never figured out how to keep the "garbage in" to a minimum.

Nehnehneh1y ago

That's just not true.

The training data is the underlying truth and that's not nothing but a lot.

And hallucinations are pathes inside this space which are there for yet unknown reason.

We like answers from LLMs which walk through this space reasonable.

mrguyorama1y ago

>The training data is the underlying truth

Correct. What is the training data? Language in the form of sentences and documents and words and "tokens". No human language has any normal or natural encoding of "fact" or "truthiness" which is the entire point. You can only rarely evaluate a string of text for truthiness without external context.

An LLM "knows" the structure and look of valid text. That's why they rarely produce grammar mistakes, even when "hallucinating". A lie, a made up reference, a physical impossibility, contradictions, etc are all "valid sentences". That's why you can never prevent an LLM from producing falsehoods, lies, contradictions etc.

Truthiness cannot be hacked in after the fact, and I currently believe that LLMs as an architecture are not powerful enough a statistical tool that you even COULD train an LLM that had "truthiness" of the entire corpus labeled somehow, especially since that's on it's own a fairly impossible task.

mdp20211y ago

> It suggests a qualitative difference

And what is sought is, in a way, a jump to that qualitative difference. (And surely there are «desirable and undesirable operation[s]».)

"Add something to the dices so that they can be well predictive".

TZubiri1y ago

I disagree with this take, Stallman has expressed it recently by linking some "scientific article".

While I get that LLMs generate text in some way that does not guarantee correctness. There is a correlation between generated text and correctness, which is why millions of people use it...

You can judge the correctness of a sentence generated by an LLM. In the same way you can judge the correctness of a human generated sentence.

Now whether the truthness or correlation with reality of an LLM sentence can be judged on its own or whether it requires a human to interpret it is not very relevant, as sentences produced by the LLM are still correct most of the time. Just because it is not perfect doesn't make the correctness in the other cases useless, albeit perhaps less useful of course.

This is nothing surprising of a statistical model, it tends to produce true results.

Terr_1y ago

> I disagree with this take, Stallman has expressed it recently by linking some "scientific article".

I don't know how to parse this. What article did Stallman "link", and what are you saying Stallman "expressed" by linking/using it?

> whether the truthness or correlation with reality of an LLM sentence can be judged on its own or whether it requires a human to interpret it is not very relevant

It's incredibly relevant. We wouldn't even be having these debates if complex LLM judgements could always be verified without a human checking the logic.

> sentences produced by the LLM are still correct most of the time

At least half the problem here is that humans are accustomed to using certain cues as an indirect sign of time-investment, attentiveness, intelligence, truth, etc... and now those cues can be cheaply and quickly counterfeited. It breaks all those old correlations faster than we are adapting.

TZubiri1y ago

https://stallman.org/chatgpt.html

https://link.springer.com/article/10.1007/s10676-024-09775-5

demaga1y ago

> They're all hallucinations, we just happen to like some of them more than others.

I love it! Puts things into perspective.

Loughla1y ago

I just recently showed a group of college students how and why using AI in school is a bad idea. Telling them it's plagiarism doesn't have an impact, but showing them how it gets even simple things wrong had a HUGE impact.

The first problem was a simple numbers problem. It's 2 digit numbers in a series of boxes. You have to add numbers together to make a trail to get from left to right moving only horizontally or vertically. The numbers must add up to 1000 when you get to the exit. For people it takes about 5 minutes to figure out. The AI couldn't get it after all 50 students each spent a full 30 minutes changing the prompt to try to get it done. The AI would just randomly add numbers and either add extra at the end to make 1000, or just say the numbers added to 1000 even if it didn't.

The second problem was writing a basic one paragraph essay with one citation. The humans got it done, when with researching for a source, in about 10 minutes. After an additional 30 minutes none of the students could get AI to produce the paragraph without logic or citation errors. It would either make up fake sources, or would just flat out lie about what the sources said. My favorite was a citation related to dairy farming in an essay that was supposed to be about the dangers of smoking tobacco.

This isn't necessarily relevant to the article above, but if there are any teachers here, this is something to do with your students to teach them exactly why not to just use AI for their homework.

ryanmcbride1y ago

My go-to to show people who don't understand its limitations used to be the old "how many Ms are there in the word 'minimum' or something along those lines, but looks like it's gotten a bit better at that. I just tried it with GPT4o and it gave me the right number, but the wrong placement. In the past it's given it completely wrong:

>how many instances of the letter L are in the word parallel

The word parallel contains 3 instances of the letter "L":

    The first "L" appears as the fourth letter.
    The second "L" appears as the sixth letter.
    The third "L" appears as the seventh letter.

cruffle_duffle1y ago

I wonder if that is simply a reflection of there being more training data out there about this "problem" and the model hoovering all that up and regurgitating it?

mike_hearn1y ago

These kinds of letter tricks aren't great examples to use because they're fixable and might stop working at any moment, as you just discovered. Modern image generation models for instance not only present embeddings of the input but also have parameters for each individual letter in words, which is why the latest models are able to copy words into the images much better: they see the spellings encoded as input parameters, as well as the concept vectors.

Even things like citations aren't going to work long term. The models are just being trained not to emit URLs from memory but only from their context.

It's fine to show people the current state of things, but if you're trying to get students to understand why they shouldn't use AI the only technique that will work is this: "you will be examined in a controlled room where computers and phones aren't allowed". Nothing else will be effective over the medium to long term.

int_19h1y ago

Here's QwQ-32 answering the same question. I'm including the entire answer here including the CoT:

> Let's see. The word is "parallel." I need to count how many times the letter "L" appears in it. Okay, spelling it out: P-A-R-A-L-L-E-L. So, starting from the beginning, first letter is P, not L. Then A, not L. R, not L. Another A, still not L. Then L, that's one L. Next letter is L again, so that's two L's. Then E, not L, and another L, making it three L's. Wait, but when I spelled it again, I saw P-A-R-A-L-L-E-L, and I counted three L's. But let me check again. Parallel: P-A-R-A-L-L-E-L. Yes, L appears three times. So, the answer is three.

> But, maybe there's a trick here. Is there any special consideration for the letter L in this word or in general? Sometimes questions like this might be testing for homophones or something, but I don't think so in this case. It's straightforward: count the L's in "parallel."

> Alternatively, perhaps it's a trick with the word itself. Is "parallel" spelled differently sometimes? I don't think so. It's always P-A-R-A-L-L-E-L. So, three L's.

> Wait a minute, is there a possibility that the word is misspelled in the question? If the word was "paralell," that would be four L's, but that's not the correct spelling. So, sticking with three L's.

> Another angle: is the question referring to uppercase or lowercase L? But in the word "parallel," all letters are lowercase unless it's at the beginning of a sentence, but the question just says "the word parallel," which is typically in lowercase. So, doesn't matter.

> Also, considering that "l" and "I" can sometimes be confused, but in this word, it's L's only, no I's.

> So, my final answer is three instances of the letter L in the word "parallel."

> Final Answer: 3

Freak_NL1y ago

This AI is talking as if it is in the final stages of a multi-day interrogation by a very hush-hush and deniable branch of government. I give it another five minutes before it breaks down and will sell out its mother just to be allowed to sleep with the lights off.

I hope that this is not something I ever have to deal with. It's exhausting.

1 more reply

janalsncm1y ago

> If the word was "paralell," that would be four L's, but that's not the correct spelling.

Better but this is still a hallucination.

sugarkjube1y ago

Shows nicely what's going on.

If you ask a human, they will answer 3. Sometimes they say 4. Or 2. That's it.

An LLM produces a text using an example it was trained on. They were trained with these elaborate responses, so that's what they produce.

Whenever chatgpt gets something wrong, someone at openai will analyse it, create a few correct examples, and put these on the pile for retraining. Thats why it gets better - not because it is smarter, but it's retrained on your specific test cases.

lawlessone1y ago

They probably have a letter counting tool added to it now. that it just knows to call when asked to do this.

you ask it the number of letters and it sends those words off to another tool to count instances of L, but they didn't add a placement one so it's still guessing those.

edit: corrected some typos and phrasing.

Maybe we'll reach a point where the LLM's are just tool calling models and not really giver their own reply.

HeatrayEnjoyer1y ago

There are only 5 tools it has available to call, and that isn't one of them. A GitHub (forgot the url) stays up to date with the latest dumped system instructions.

2 more replies

dylan6041y ago

they probably just forgot to tell it humans are 1 indexed and to do the friendly conversion for them.

quuxplusone1y ago

Do you have a link to (or can you put here) that "numbers in boxes" problem?

mdaniel1y ago

I'm not them, but I think it's a variation of the subset-sum problem

If we modify the question to be "sum to 100" (to just seriously reduce the number of example boxes required) then given:

  | 50 | 20 | 24 |
  |  7 |  5 |  1 |
  | 51 | 51 | 51 |

the solution would be

  | [50] | [20] | [24] |
  |   7  | [ 5] | [ 1] |
  |  51  | 51   |  51  |

  | right | down  | win
  | X     | right | up
  | X     | X     | X

Loughla1y ago

Correct. That but for 1000. You can build your own with any number of online tools.

I don't have a link because it's part of a lesson plan set behind a payment on teachers pay teachers.

1 more reply

empath751y ago

I refuse to believe that you did any of this with any of the latest models. Gemini and Chat GPT with search are both perfectly capable of producing decent essays with accurate citations. And the 4o model is extremely good at writing python code that can accurately solve math and logic problems.

I asked 4o with search to write an essay about the dangers of smoking, along with citations and quotes from the relevant sources. NotebookLM is even better if you drop in your sources and don't rely on web search. Whatever you think you know about what AI is capable of, it's probably wrong.

--- Smoking remains a leading cause of preventable disease and death worldwide, adversely affecting nearly every organ in the human body. The National Cancer Institute (NCI) reports that "cigarette smoking and exposure to tobacco smoke cause about 480,000 premature deaths each year in the United States."

The respiratory system is particularly vulnerable to the detrimental effects of smoking. The American Lung Association (ALA) states that smoking is the primary cause of lung cancer and chronic obstructive pulmonary disease (COPD), which includes emphysema and chronic bronchitis. The inhalation of tobacco smoke introduces carcinogens and toxins that damage lung tissue, leading to reduced lung function and increased susceptibility to infections.

Cardiovascular health is also significantly compromised by smoking. The ALA notes that smoking "harms nearly every organ in the body" and is a major cause of coronary heart disease and stroke. The chemicals in tobacco smoke damage blood vessels and the heart, increasing the risk of atherosclerosis and other cardiovascular conditions.

Beyond respiratory and cardiovascular diseases, smoking is linked to various cancers, including those of the mouth, throat, esophagus, pancreas, bladder, kidney, cervix, and stomach. The American Cancer Society (ACS) emphasizes that smoking and the use of other tobacco products "harms nearly every organ in your body." The carcinogens in tobacco smoke cause DNA damage, leading to uncontrolled cell growth and tumor formation.

Reproductive health is adversely affected by smoking as well. In women, smoking can lead to reduced fertility, complications during pregnancy, and increased risks of preterm delivery and low birth weight. In men, it can cause erectile dysfunction and reduced sperm quality, affecting fertility.

The immune system is not spared from the harmful effects of smoking. The ACS notes that smoking can affect your health in many ways, including "lowered immune system function." A weakened immune system makes the body more susceptible to infections and diseases.

Secondhand smoke poses significant health risks to non-smokers. The ALA reports that secondhand smoke exposure causes more than 41,000 deaths each year. Children exposed to secondhand smoke are more likely to suffer from respiratory infections, asthma, and sudden infant death syndrome (SIDS).

Quitting smoking at any age can significantly reduce the risk of developing these diseases and improve overall health. The ACS highlights that "people who quit smoking can also add as much as 10 years to their life, compared to people who continue to smoke." Resources and support are available to assist individuals in their journey to quit smoking, leading to longer and healthier lives.

References

American Cancer Society. (n.d.). Health Risks of Smoking Tobacco. Retrieved from https://www.cancer.org/cancer/risk-prevention/tobacco/health...

National Cancer Institute. (n.d.). Harms of Cigarette Smoking and Health Benefits of Quitting. Retrieved from https://www.cancer.gov/about-cancer/causes-prevention/risk/t...

American Lung Association. (n.d.). Health Effects of Smoking. Retrieved from https://www.lung.org/quit-smoking/smoking-facts/health-effec...

Cleveland Clinic. (2023, April 28). Smoking: Effects, Risks, Diseases, Quitting & Solutions. Retrieved from https://my.clevelandclinic.org/health/articles/17488-smoking

American Cancer Society. (n.d.). Health Risks of Using Tobacco Products. Retrieved from https://www.cancer.org/cancer/risk-prevention/tobacco/health...

American Cancer Society. (n.d.). Health Benefits of Quitting Smoking Over Time. Retrieved from https://www.cancer.org/cancer/risk-prevention/tobacco/benefi...

namaria1y ago

You exemplify well a big problem with LLMs: When people see accurate enough output on some test question and take it as evidence that they can trust the output to any extent on areas they don't dominate.

empath751y ago

No, that's not really the case. I don't think you should trust LLM output at all, but I think in general it's closer to the level of reliability of wikipedia than it is to producing useless bullshit.

Which is to say that it's useful, but you shouldn't trust it without double checking.

Loughla1y ago

It was chat gpt, and it was two days ago. Believe it or not, that doesn't really change anything for me. I absolutely understand what AI is capable of, and a lot of it is really quite something.

But you can't trust it to be accurate. You just can't. Every model will absolutely make shit up at some point.

I liken it to working with a very bright 7 year old. It may sound like it knows what it's saying, and it may be able to spit out facts, but it's very ignorant about most of the world.

objclxt1y ago

> Gemini and Chat GPT with search are both perfectly capable of producing decent essays with accurate citations

OK, but this quote from your essay:

> The National Cancer Institute (NCI) reports that "cigarette smoking and exposure to tobacco smoke cause about 480,000 premature deaths each year in the United States."

...that citation is wrong. It's not from the NCI at all, the NCI cited that figure which came from another paper by the U.S. Department of Health and Human Services.

The essay doesn't have accurate citations, the model has regurgitated content and doesn't understand when that content is from a primary source or when it in turn has come from a different citation.

mdaniel1y ago

the only comment on the prior submission 3 days ago summarizes the whole thing: https://news.ycombinator.com/item?id=42285149

Also, I saw any such blog title as "how to make money in the stock market:" friend, if you knew the answer you wouldn't blog about it you'd be infinitely rich

rtsil1y ago

They don't say how to make big money, don't they? Also, the tl;dr is index funds and patience.

int_19h1y ago

I've been playing with Qwen's QwQ-32b, and watching this thing's chain of thought is really interesting. In particular, it's pretty good at catching its own mistakes, and at the same time, gives off a "feeling" of someone very uncertain about themselves, trying to verify their answer again and again. Which seems to be the main reason why it can correctly solve puzzles that some much larger models fail. You can still see it occasionally hallucinate things in the CoT, but they are usually quickly caught and discarded.

The only downsides of this approach is that it requires a lot of tokens before the model can ascertain the correctness of its answer, and also that sometimes it just gives up and concludes that the puzzle is unsolvable (although that second part can be mitigated by adding something like "There is definitely a solution, keep trying until you solve it" to the prompt).

emil_sorensenOP1y ago

I find it so interesting that it's possible to develop a "feeling" of a new model.

Der_Einzige1y ago

Wow, a whole article that didn't mention the word "sampler" once. There's pretty strong evidence coming out that truncation samplers like min_p and entropix are strictly superior to previous samplers (which everyone uses like top_p) to prevent hallucinations and that LLMs usually "know" when they are "hallucinating" based on their logprobs.

https://openreview.net/forum?id=FBkpCyujtS (min_p sampling, note extremely high review scores)

https://github.com/xjdr-alt/entropix (Entropix)

https://artefact2.github.io/llm-sampling/index.xhtml

LetsGetTechnicl1y ago

Why do LLMs make things up? Because that is all that LLMs do, sometimes what it outputs is correct though.

PLenz1y ago

Everything an LLM returns is an hallucination, it's just that some of those hallucinations line up with reality

__MatrixMan__1y ago

There's room for splitting hairs in there though. Even fiction, for instance, can succeed or fail at being internally consistent, is or is not grammatically correct...

Calling everything an AI does a hallucination isn't incorrect, but it reduces the term to meaninglessness. I'm not sure that's most useful thing we can be doing.

Atoms are not indivisible, yet we use the term because it works. I anticipate hallucination will be the same.

Gormo1y ago

> Calling everything an AI does a hallucination isn't incorrect, but it reduces the term to meaninglessness.

I don't think it does. In this case, "hallucination" refers to claims generated entirely within a closed system, but which pertain to a reality external to it.

That's not meaningless, and makes "hallucinations" distinguishable from claims verified against direct observation of the reality they are meant to represent.

namaria1y ago

> Atoms are not indivisible

They are the smallest unit of a substance that cannot be broken down into smaller units of the same substance. They are, in a sense, indivisible.

Timwi1y ago

Of course they can. Carbon dioxide consists of quarks and electrons. I can divide it into units smaller than atoms and it's still quarks and electrons. All you did was a word trick by assuming a specific meaning of “substance”.

1 more reply

TZubiri1y ago

How are you defining hallucination then? In some pretty useless way inevitably.

Hallucinations are precisely the generated expressions that don't correlate with reality or are not truthful.

Gormo1y ago

I don't think that definition works: it's attempting to categorize statements according to criteria completely external to them rather than according to any inherent property of the statement.

A better definition is that a hallucination is an expression that is generated within a closed system without direct input from the reality it is meant to represent. The point is that an expression about reality that doesn't come from observing reality can only be true coincidentally.

By way of analogy, if I have a dream about a future event, and then that event actually happens, it was still just a dream and not a clairvoyant vision of the future. Sure, my dreams are influenced by past experiences I've had (in the same way that verified facts are included in the training data for LLMs), which makes them likely to include things that frequently do happen in real life and might be likely to happen again -- but the dream an the LLM alike are effectively just "remixing" prior input, and not generating any new observations of reality.

TZubiri1y ago

"I don't think that definition works: it's attempting to categorize statements according to criteria completely external to them rather than according to any inherent property of the statement."

Correct. The basic concept of truth in logic relies on an objective reality, an expression a priori holds truth even in the absence or indistinct of such a reality. But the truthfulness or correctness of a posteriori statements can depend on the reality. Examples of the former would be "If A is B, then B is C. A is B, then B is C" Example of the latter would be "It is raining outside."

"A better definition is that a hallucination is an expression that is generated within a closed system without direct input from the reality it is meant to represent. The point is that an expression about reality that doesn't come from observing reality can only be true coincidentally."

Absolutely incorrect, you are talking about a concept of the state of the art of science and tech but you are failing basic philosophy and epistemology concepts. The LLM has inputs from the reality (is it possible not to?), it is trained on a huge corpus of text written by humans that themselves perceive reality. The perception of reality can be indirect. We can measure something by observing it, or by observing an instrument that in turn observes it.

"but the dream an the LLM alike are effectively just "remixing" prior input, and not generating any new observations of reality."

Again incorrect for three reasons:

1- Novel observations can occur purely from remixing. Einstein locked himself during a pandemic and developed the theory of relativity without additional experimental output.

2- LLMs combine their existing data with human input, which is an external source.

3- LLMs can interact with other sources of data whether by injection of data into the prompt, by function calling, RAG, etc..

So yeah. Try to go back to basics and study simpler systems, ideally with source code. This might be out of your league.

1 more reply

kelseyfrog1y ago

Subjectively, we operate the same way.

mdp20211y ago

Look at things and fix it then.

kelseyfrog1y ago

The thing that needs to be fixed is in peoples' heads.

1 more reply

mountainriver1y ago

The same is true for humans

mdp20211y ago

It's a good thing that, as you state, "humans hold "togetherness" as a "true" value".

But in this context, the value is in how much they have pondered to actually see and evaluate what is eventually seen as "true".

throwawaymaths1y ago

Completely misses the fact that a big part of the reason why llms hallucinate sp much is because there's a huge innate bias towards producing more tokens over just stopping.

TZubiri1y ago

The less tokens produced at inference the lower the quality of the response will be.

The process of thinking for an LLM involves the use of words, which is why prompts that ask the LLM to only return the answer will cause lower quality.

throwawaymaths1y ago

We're not talking about quality, we're talking about accuracy.

In general, a model has to learn to positively say "I don't know" instead of "I don't know" being in the negative space of tokens falling into a weak distribution. The softmax selector also normalizes the token logits, so if no options are any good (all next tokens suck) it could pick randomly from a bunch of bad choices, which then locks the model into a continuation based off of that first bad choice.

TZubiri1y ago

Well I am talking about quality now as it's a tradeoff.

You can reduce token output to 0 and achieve 100% accuracy too.

ausbah1y ago

do you know if prompting without regards for length then asking for a summarization of the previous out out works?

TZubiri1y ago

It does. I think this was used in a gpt4 version, they called it Chain of Thought.

TZubiri1y ago

The debate around "fixing" hallucinations reminds me of the debate around schizophrenia.

https://www.youtube.com/watch?v=nEnklxGAmak

It's not a single thing, a specific defect, but rather a failure mode, an absence of cohesive intelligence.

Any attempt to fix a non-specific ailment (schizophrenia, death, old age, hallucinations) will run into useless panaceas.

fsckboy1y ago

it's superficially counterintuitive to people that an AI that will sometimes spit out verbatim copies of written texts, also will just make other things up. It's like "choose one, please".

MetaAI makes up stuff reliably. You'd think it would be an ace at baseball stats for example, but "what teams did so-and-so play for", you absolutely must check the results yourself.

mdp20211y ago

> "counterintuitive"

It is consistent with the topic that the reply would be "Tell them that sequences of words that were verbatim in a past input have high probability, and gaps in sequences compete in probability". Which fixes intuition, as duly. In fact, things are not supposed to reply through intuition, but through vetted intuition (and "vetted mature intuition", in a loop).

> you absolutely must check the results yourself

So, consistently with the above, things are supposed to reply through a sort of """RAG""" of the vetted (dynamically built through iterations of checks).

fsckboy1y ago

i said "superficially counterintuitive", you misquoted me and proceeded with a non superficial comment.

mdp20211y ago

> misquoted me

Why? The reply would have been the same if I quoted the whole «superficially counterintuitive to people that [...]» (instead of just pointing to the original).

> proceeded with a non superficial comment

Well, hopefully ;)

Your post went into the right direction of leading towards the idea that "there is intuition, and there is mature thought further from that: and processors must not stop at intuition, immature thought".

(Stochastic output falls in said category of "intuition"... As "bad intuition", since it goes in the wrong direction in the vector "naive to sophisticated".)

tshadley1y ago

The article referenced the Oxford semantic entropy study but failed to clarify that the issue greatly simplifies LLM hallucination (making most of the article outdated).

When we are not sure of an answer we have two choices: say the first thing that comes to mind (like an LLM), or say "I'm not sure".

LLMs aren't easily trained to say "I'm not sure" because that requires additional reasoning and introspection (which is why CoT models do better); hence hallucinations occur when training data is vague.

So why not just measure uncertainty in the tokens themselves? Because there are many ways to say the same thing, so a high entropy answer may only reflect uncertainty in synonyms-- many ways to say the same thing.

The paper referenced works to eliminate semantic similarity from entropy measurements, leaving much more useful results, proving that hallucination is conceptually a simple problem.

https://www.nature.com/articles/s41586-024-07421-0

int_19h1y ago

QwQ is really good at saying "I'm not sure", to the point where it will sometimes check the correct and obviously trivial answer a dozen times before concluding that it is, indeed, correct. And it does punch way above its weight for its size.

So, basically, the answer seems to be to give models extreme anxiety and doubt in their own abilities.

jimmySixDOF1y ago

funny I was just looking at this approach and see what's new in the wild (a work in progress) see Entropix [1] by [2]

[1] https://github.com/xjdr-alt/entropix

[2] https://x.com/_xjdr

sfink1y ago

> proving that hallucination is conceptually a simple problem.

...proving that this one particular piece of the hallucination problem may be conceptually simple.

FTFY

tshadley1y ago

> ...proving that this one particular piece of the hallucination problem may be conceptually simple.

Everything mentioned in the article boils down to that one particular piece-- non-detected uncertainty. The architecture constraints referenced are all situations that cause uncertainty. Training data gaps of course increase uncertainty.

Their solutions are a shotgun blast of heuristics that all focus on reducing uncertainty-- CoT, RAG, fine-tuning, fact-checking -- while somehow avoiding actually measuring uncertainty and using that to eliminate hallucinations!

sfink1y ago

You're just renaming "error" to "uncertainty". That is incorrect.

Everything unwanted is error, by definition. All of the heuristics are about reducing error, because that's what the goal is. Some of that error is measurable. Some of it is not. You cannot "actually measure" error in any way other than asking people whether the output is what they want -- and that only works because that's how we're defining error. (It also turns out to not be that great of a definition, since people disagree on a lot of cases.)

You can come up with some metric that you label "uncertainty", and that metric may very well be measurable. But it's only going to be correlated with error, not equal to it.

One random example to illustrate the distinction: training gaps can easily decrease uncertainty. You have lots of mammals in your training data, and none of them lay eggs. You ask "The duck-billed platypus is my favorite mammal! Does it lay eggs?" Your model will be very confident when it responds "No". That is a high-confidence error.

1 more reply

IWeldMelons1y ago

LLM hallucinations in fact has a positive side effect too, if you are using them for learning some subject; makes you verify their claims, and finding errors in them is very rewarding.

skydhash1y ago

Why not just read a book where the author is sincerely trying to teach you?

fnordpiglet1y ago

When trying to learn a subject I find being able to ask my specific questions and getting a specific answer back is helpful. I find books tend to be laborious and filled with frankly filler, often poorly indexed, and when my question isn’t covered in the book I’m left with no recourse other than googling through SEO wastelands or on topic forum questions with off topic replies. At least with LLMs they always have an answer that’s got enough of the truth in it to give me a direction, or often when I’ve gone into an area with genuinely no known answers or the thing doesn’t exist the answer is easily verified as wrong - but that process, as was pointed out above, teaches me a lot too. I actually prefer the mistakes it makes because it forces me to really learn - even to the point of giving me things to look up in the index of a book.

Treating LLMs as a single source of truth and a monolithic resource is as bad an idea as excluding them as a tool in learning.

IWeldMelons1y ago

Not as interactive, not gamified.

skydhash1y ago

School, then. Which is so gamified that it has real stakes. And so much interactions.

2 more replies

imchillyb1y ago

Toddlers don't understand truth either, until it's taught.

This crayon is red. This crayon is blue.

The adult asks: "is this crayon red?" The child responds: "no that crayon is blue." The adult then affirms or corrects the response.

This occurs over and over and over until that child understands the difference between red and blue, orange and green, yellow and black etcetera.

We then move on to more complex items and comparisons. How could we expect AI to understand these truths without training them to understand?

mdp20211y ago

You probably need to be more clear: the LLM is trained with large amounts of data making statements about facts. It is told repeatedly, "according to this source that crayon is blue".

madiator1y ago

For the specific form of hallucination, which is called grounded factuality, we have trained a pretty good model that can detect if a claim is supported by a context. This is super useful for RAG. More info at https://bespokelabs.ai/bespoke-minicheck.

mdaniel1y ago

Your playground pre-populated example isn't doing you any favors, and the "examples" folder linked to on curator's GitHub would be better served by showing areas where your model shines, not "generate a poem" which hardly has any factuality to it. I don't have any earthly idea what camel.py is trying to showcase with respect to your model's capabilities

I am open to the fact that maybe the value your service provides is in spitting out a percentage, even if it is - itself - hallucinated. But, hey, it's a metric that can be monitored

pfisch1y ago

Anyone who has raised a child knows they hallucinate constantly when they are young because they are just doing probabilistic output of things they have heard people say in similar situations and saying words they don't actually understand.

LLMs likely have a similar problem.

mwkaufma1y ago

How do we discriminate when a response is correct, vs. when it's "hallucinating" an accurate fact, by coincidence? Are all responses hallucinations, independent of correspondence to ground-truth?

prollyjethi1y ago

I am honestly very skeptical of articles like these. Hallucinations are a feature of LLMs. The only ways to "FIX" it is to either stop using LLMs. Or use a super bias some how.

janalsncm1y ago

You should be. I don’t know anything about kepa.ai but before even clicking the article I assume they’re trying to sell me something. And “how to fix it” makes me think this is some kind of SEO written for people who think it can be fixed, meaning the article is written for robots and amateurs.

Sergii0011y ago

That's all really weird. You can watch how chat gpt gives you advice on which mushrooms are safe. And now it can be just hallucinations

threeseed1y ago

Maybe don't make things up in a blog post about LLMs making things up.

Because you don't know how to fix it. Only how to mitigate it.

Mistletoe1y ago

Is there a way to code an LLM to just say "I don't know" when it is uncertain or reaching some sort of edge?

gerdesj1y ago

"It" does not know when it does not know. A LLM is a funny old beast that basically outputs words one after another based on probabilities. There is no reasoning as we would know it involved.

However, I'll tentatively allow that you do get a sort of "emergent behaviour" from them. You do seem to get some form of intelligent output from a prompt but correctness is not built in, nor is any sort of reasoning.

The examples around here of how to trip up a LLM are cool. There's: "How many letter "m"s in the word minimum" howler which is probably optimised for by now and hence held up as a counterpoint by a fan. The one about boxes adding up to 1000 will leave a relative of mine for lost for ever but they can still walk and catch a ball, negotiate stairs and recall facts from 50 years ago with clarity.

Intelligence is a slippery concept to even define, let alone ask what an artificial one might look like. LLMs are a part of the puzzle and certainly not a solution.

You mention the word "edge" and I suppose you might be riffing on how neurons seem to work. LLMs don't have a sort of trigger threshold, they simply output the most likely answers based on their input.

If you keep your model tightly ie domain focussed and curate all of the input then you have more chance of avoiding "hallucinations" than if you don't. Trying to cover the entirety of everything is Quixotic nonsense.

Garbage in; garbage out.

TZubiri1y ago

"It" does not know when it does not know.

But it does know when it has uncertainty.

In the chatgpt api this is logprobs, each generated token has a level of uncertainty, so:

"2+2="

The next token is with almost 100% certainty 4.

"Today I am feeling"

The next token will be very uncertain, it might be "happy", it might be "sad", it might be all sorts of things.

gerdesj1y ago

"The next token is with almost 100% certainty 4."

By using the word "almost" with regards 2 + 2 = 4, you have not exactly dispelled LLM "nonsense".

A human (with a modicum of maths knowledge) will know that 2 + 2 = 4 (pure integers - a fact by assertion). A maths worrier will get slightly uncomfortable about 2.0 + 2.0 = 4.0 unless they are ensured that decimal places and accuracy are the same thing and a few other things.

A LLM will almost certainly "know" something that is certain, if its training set is conclusive about that. However, it does not know why and if enough of the training set is suitably ambiguous then it (LLM) will drift off course and seem to spout bollocks - "hallucinate".

1 more reply

e12e1y ago

LLMs model language, not knowledge. For humans the two are deeply connected, so when we see probable ("correct") language output, we assume and assign meaning.

We could allow an LLM to stop generating new tokens if there are no new tokens found with a threshold probability - but I think it's unclear if that would help.

Consider the following prompt and response:

> Complete the paragraph: "The killer ..."

> "The killer moved silently through the dimly lit corridor, each step calculated, every breath measured. Shadows clung to the walls, obscuring the figure’s presence, but the intent was clear—cold, deliberate, and unyielding. There was no hesitation, no second-guessing, only the precision of someone who had done this before, perhaps too many times to count. The killer’s mind was a storm of focus, memories buried deep beneath the surface, replaced by the singular purpose of the hunt. In the silence, only the faint sound of a blade sliding free from its sheath whispered of what was to come."

Is it a "true" response? If the question makes no sense - does it make any more sense for the prompt:

> Complete the paragraph: "The first law of thermodynamics ..."

Why? Does the LLM "think" differently in the second case?

sfink1y ago

If it works properly, it would need to say that it doesn't know that it doesn't know, and then where are you?

(Short answer is yes, but it only works for a limited set of things, and that set can be expanded with effort but will always remain limited.)

int_19h1y ago

It is. The new crop of models specifically trained to do CoT (as opposed to just forcing them into it via prompting) is partly about forcing them to continue to think and not just hastily conclude that they have the answer.

dschuetz1y ago

I went straight to the "how to fix" section with popcorn in hand and I wasn't disappointed: just add " doubt" layers for self-correction, beginning at the query itself. And then maybe tell the model "do not hallucinate". Sounds like a pun, but I think an AI model actually would take this seriously, because it can't tell the difference.

Context is still a huge problem for AI models, and it's probably still the main reason for hallucinating AIs.

rabid_turtle1y ago

I don't like the output = hallucination

I like the output = creative

chefandy1y ago

Lots of folks in these conversations fail to distinguish between LLMs as a technology and "AI Chatbots" as commercial question answering services. Whether false information was expected or not matters to LLM product developers, but in the context of a commercial question-answering tool, it's irrelevant. Hallucinations are bugs that creates= time-wasting zero-value output, at best, and downright harmful output at worst. If you're selling people LLM pattern generator output, they should expect a lot of bullshit. If you're selling people answers to questions, they should expect accurate answers to their questions. If paying users are really expected to assume every answer is bullshit and vet it themselves, that should probably move from the little print to the big print because a lot of people clearly don't get it.

sollewitt1y ago

"Why LLMs do the one and only thing they do (and how to fix it)"

jrflowers1y ago

I like that none of the suggestions address probabilistic output generation (aside from the first bullet point of section 3C, which essentially suggests that you just use a search engine instead of a language model).

TLDR: Hallucinations are inherent to the whole thing but as humans we can apply bubble gum, bandaids and prayers

PittleyDunkin1y ago

Humans hallucinate, too. We just have less misleading terms for it. Massive mistake in terms of jargon, IMO—"making shit up" is wildly different from the "delusion of perception" implied by hallucination.

uz441001y ago

hallucination problem in LLM, been seeing this. Let me know if someone find a fix please

j / k navigate · click thread line to collapse

245 comments

lolinder1y ago

> While the hallucination problem in LLMs is inevitable [0], they can be significantly reduced...

What they are is a value judgement we assign to the output of an LLM program. A "hallucination" is just output from an LLM-based workflow that is not fit for purpose.

[0] They link to this paper: https://arxiv.org/pdf/2401.11817

swatcoder1y ago

I think that's a mischaracterization and not really accurate. As a trade, we're familiar with probabilistic/non-deterministic components and how to approach them.

Everything else you said is 100% right, though.

JKCalhoun1y ago

> As a trade, we're familiar with probabilistic/non-deterministic components and how to approach them.

Yes, users.

mdp20211y ago

mycall1y ago

..and bugs, especially with analog computers.

Workaccount21y ago

This is kind of how traditional engineering is, since reality is analog and everything is on a spectrum interacting with everything else all the time.

swatcoder1y ago

1 more reply

bongodongobob1y ago

Please do not confuse this example with agentic AI losing the plot, that's not what I'm trying to say.

Edit: a better example is that when you build an autocomplete plugin for your email client, you don't expect it to also be able to play chess. But look what happened.

1 more reply

GuB-421y ago

Of course they are a bug. Just that hallucination emerge from the normal function of a LLM doesn't make it "not a bug".

I guess it is comes down to how you define a bug, but how else would you call a result that is not fit for purpose?

lolinder1y ago

tsujamin1y ago

I’d argue hallucinations are unexpected in LLMs by the large (non technical) number of users who use them directly, or indirectly though other services.

3 more replies

SilasX1y ago

I've found it very helpful to make the following distinction:

Spec: Do X in situation Y.

Correctness bug: It doesn't do X in situation Y.

Fitness-for-purpose (FFP) bug: It does X in situation Y, but, knowing this, you decide you don't actually want it to do X in situation Y.

Hallucination is an FFP bug.

1 more reply

ToucanLoucan1y ago

2 more replies

9rx1y ago

Bug, like any other word, is defined however the speaker defines it. While your usage is certainly common in technical groups, the common "layman" usage is closer to what the parent suggests.

1 more reply

watwut1y ago

Expected defects are bugs too. I totally expect half the problems in the software my company is developing. They are still bugs.

2 more replies

PittleyDunkin1y ago

> Of course they are a bug.

A bug implies fixable behavior rather than expected behavior. An LLM making shit up is expected behavior.

> LLMs naturally hallucinate, but it is not what we want, so it is a bug.

Maybe you just don't want an LLM! This is what LLMs do. Maybe you want a decision tree or a scripted chatbot?

> And to fix it, we need to engineer solutions that prevent the hallucinations from happening, maybe resulting in an "I don't know" response that would be analogous to an error message.

devmor1y ago

A bug is generally treated as undefined and undesirable side effects of a program.

Hallucinations are undesirable but not undefined. We know that the process creates them and expect them.

It’d be like using floats to calculate dollars and cents and calling the resulting math a bug - it’s not, you just used the technology wrong.

stonemetal121y ago

> LLMs naturally hallucinate, but it is not what we want, so it is a bug.

I rolled a one in D&D, it is not what I wanted, so it is a bug. Remove it from all my dice.

SirMaster1y ago

What? You are telling me that when you roll a 6 sided dice you are not expecting any of the 1-6 as a result?

If a 6-sided dice produced a 7 that would be a bug.

When I ask an LLM to write code for me and it references a completely made up library that doesn't exist and has never existed, is this really analogous to your dice example?

2 more replies

mrguyorama1y ago

>Of course they are a bug

No.

>And to fix it, we need to engineer solutions that prevent the hallucinations from happening

The whole point is that this is fundamentally impossible.

d0mine1y ago

The difference is that you can fix IndexError by modifying your code but no amount of prompt manipulation may fix hallucinations. For that you need solutions outside LLMs.

jrm41y ago

Not a bug at all, IMHO.

If someone puts the wrong address for their business; Google picks it up, and someone Googles and gets the wrong address, it says nothing about "bugs in software."

LgLasagnaModel1y ago

‘ I always struggle a bit with I'm asked about the "hallucination problem" in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines. ...’

https://x.com/karpathy/status/1733299213503787018

wpietri1y ago

Excellent point:

> just output from an LLM-based workflow that is not fit for purpose

[1] https://en.wikipedia.org/wiki/Stone_Soup

tengbretson1y ago

LLMs outputs are no more "hallucinations" than my output would be if I were asked to judge a dressage competition.

xienze1y ago

Gormo1y ago

3 more replies

lmm1y ago

The LLM is always bullshitting the user. It's just sometimes the things it talks about happen to be real and sometimes they don't.

tengbretson1y ago

LLMs don't know things, they just string together responses that are a best fit for what follows from their prompt.

1 more reply

TwoCent1y ago

eevilspock1y ago

jojobas1y ago

There is no source of truth for dressage competition results, these are accepted as jury preference judgement.

There are plenty of matters where there is such a source of truth, and LLMs don't know the difference.

mdp20211y ago

> There is no source of truth

There is no «source of [_simple_] truth» for complex things, but there are more (instead of less) objective complex evaluations.

Note that this is also valid for factual notions: e.g., "When were the Pyramids built?".

1 more reply

eevilspock1y ago

The difference is that you are capable of reflection and self-awareness, in this particular case that you understand nothing about dressage and your judgements would be a farce.

The counter to this counter is: Can one build an LLM that can identify hallucinations, the way we do? That can classify its own output as good or shitty?

smartmic1y ago

leprechaun10661y ago

Calling them hallucinations was a huge mistake.

gwervc1y ago

JKCalhoun1y ago

goatlover1y ago

Confabulation, if the desire is to use a more apt psychological analogy.

malfist1y ago

It's a bug. Any other system where you put in one input and expect a certain output and get something else it'd be called a bug. Making up new terms for AI doesn't help.

2 more replies

alan-crowe1y ago

Statmist

threeseed1y ago

I see two types of faults with LLMs.

a) They output incorrect results given a constrained set of allowable outputs.

b) When unconstrained they invent new outputs unrelated to what is being asked.

sfink1y ago

The term is actually fine. The problem is when it's divorced from the reality of:

> in some sense, hallucination is all LLMs do. They are dream machines.

If you understand that, then the term "hallucination" makes perfect sense.

(Also note that I am not using shit in a pejorative sense here. Making shit up is exactly what they're for, and what we want them to do. They come up with a lot of really good shit.)

jebarker1y ago

1 more reply

emil_sorensenOP1y ago

That's a great point. Reminds me of the "feature, not a bug" Karpathy tweet [0].

[0]: https://x.com/karpathy/status/1733299213503787018?lang=en

mike_hearn1y ago

... which is linked to from the article ;)

He's right but do people really misunderstand this? I think it's pretty clear that the issue is one of over-creativity.

The hallucination problem is IMHO at heart two things that the fine article itself doesn't touch on:

Obviously, the compressive nature of LLMs means they can never eliminate hallucinations entirely, but we're so far from reaching any kind of theoretical limit here.

lolinder1y ago

> He's right but do people really misunderstand this?

Absolutely. Karpathy would not have felt obliged to mini-rant about it if he hadn't seen it, and I've been following this space from the beginning and have also seen it way too often.

Edit: scrolling a bit further in the replies to my comment, here's a great example:

https://news.ycombinator.com/item?id=42325795

And another: https://news.ycombinator.com/item?id=42325412

js81y ago

cainxinth1y ago

> What they are is a value judgement we assign to the output of an LLM program. A "hallucination" is just output from an LLM-based workflow that is not fit for purpose.

In other words, hallucinations are to LLMs what weeds are to plants.

drcwpl1y ago

You are right - "Hallucinations are not a bug"

Guthur1y ago

We even want this in the "real" world, when I turn the wheel left on my car I don't want it turn left only when it feels like it, when that happens we rightly classify it as a failure.

We have the tools to build very complex deterministic systems but for the most part we chose not to use them, because they hard or not familiar or whatever other excuse you might come up with.

skydhash1y ago

mycall1y ago

Does it help discourse at all by instead calling hallucinations a negative perceived form of imagination?

zmgsabst1y ago

A challenge is that it’s not easy to limit hallucinations without also limiting imagination and synthesis.

In humans.

But also apparently in LLMs.

mort961y ago

zmgsabst1y ago

Humans routinely misremember facts but are relatively certain those remembrances are correct.

That’s a form of minor, everyday hallucination.

If you engage in such thorough criticism and checking of every recalled fact as to eliminate that, you’ll crush your ability to synthesize or compose new content.

2 more replies

mdp20211y ago

On office desks there were "In" boxes and "Out" boxes. You do not put "imagination" in "Out" boxes. What is put in "Out" boxes must be checked and stamped.

"Imagination" stays on the desk. You "imagine" that a plane could have eight wings, then you check if it is a good idea, and only then the output is decided.

namaria1y ago

> A challenge is that it’s not easy to limit hallucinations without also limiting imagination and synthesis.

> In humans.

zmgsabst1y ago

Misremembering facts isn’t a negative mental health event, yet is an example of imagination rather than recall — similar to LLMs hallucinating.

Humans imagine events all the time, without the ability to know that happened. Part of why eye-witness testimony is so unreliable.

mdp20211y ago

> is inevitable

False. It is (in this context) outputting a partial before full processing. Adequate (further) processing removes that "inevitable". Current architectures are not "final".

Proper process: "It seems like that." // "Is it though?" // "Actually it isn't."

(Edit: already this post had to be corrected many times because of errors...)

Animats1y ago

> While the hallucination problem in LLMs is inevitable

Oh, please. That's the same old computability argument used to claim that program verification is impossible.

Alibaba's QwQ [1] supposedly is better at reporting when it doesn't know something. Comments on that?

[1] https://techcrunch.com/2024/11/27/alibaba-releases-an-open-c...

mort961y ago

You can make improvements, as your parent comment already said, but it's not a problem which can be solved, only to some degree be reduced.

lolinder1y ago

> Alibaba's QwQ [1] supposedly is better at reporting when it doesn't know something. Comments on that?

sumtechguy1y ago

LLMs and their close buddies NN's use models that do massive amounts of what amounts to cubic splines across N dimensions.

russnes1y ago

so what you're saying is that LLMs are like middle aged men, just throwing things out there seeing if they'll stick?

tokioyoyo1y ago

cameronh901y ago

maeil1y ago

They're at the very least competitive, if not better than, doctors at diagnosing illnesses [1].

[1] https://www.nytimes.com/2024/11/17/health/chatgpt-ai-doctors...

cameronh901y ago

If we were both the same gender, I probably would have had my skull opened up for no reason, and she would have been discharged and later died.

skydhash1y ago

> The yardstick is the peformance of humans in the real world on a specific task.

1 more reply

goatlover1y ago

For example, compilers didn't put programmers out of business in the 60s, it made programming more available to people with higher level languages.

1 more reply

h1fra1y ago

tokioyoyo1y ago

I mean, do you remember early 2000s? We had so many web pages that would go down on a daily basis. Stability is something we achieved over time.

Also, again, if it’s bad, nobody will use it, and product will die. In those theoretical scenarios companies that have lower error rate (and don’t use AI) will win the market.

Terr_1y ago

When people talk about stopping an LLM from "seeing hallucinations instead of the truth", that's like stopping an Ouija-board from "channeling the wrong spirits instead of the right spirits."

It suggests a qualitative difference between desirable and undesirable operation that isn't really there. They're all hallucinations, we just happen to like some of them more than others.

ml_more1y ago

The problem is that LLMs are just convincing enough that people DO trust them which is sort of a problem since AI slop is creeping into everything.

latexr1y ago

> and that unlocks a bit of value out in the world.

> Don't take my word for it look at the proposed valuations of AI companies. Clearly investors think there's something there.

Source: all of modern history.

https://www.sydney.edu.au/news-opinion/news/2024/05/02/how-c...

https://www.decof.com/documents/dangerous-products.pdf

Terr_1y ago

> Investors back whatever they think will make them money.

> All they care is if they can profit from it

Notably that isn't the same as the business itself being profitable: Some investors may be hoping they can dump their stake at a higher price onto a Greater Fool [0] and exit before the collapse.

[0] https://en.wikipedia.org/wiki/Greater_fool_theory

Gormo1y ago

> They couldn’t give less of a crap if something is valuable to the world

"The world" is an abstraction: concretely, every bit of value that is generated within that abstraction accrues to someone in particular -- investors in AI projects, for example.

wk_end1y ago

How do you check it?

Gormo1y ago

threeseed1y ago

> Clearly investors think there's something there

Nehnehneh1y ago

That's just not true.

The training data is the underlying truth and that's not nothing but a lot.

And hallucinations are pathes inside this space which are there for yet unknown reason.

We like answers from LLMs which walk through this space reasonable.

mrguyorama1y ago

>The training data is the underlying truth

mdp20211y ago

> It suggests a qualitative difference

And what is sought is, in a way, a jump to that qualitative difference. (And surely there are «desirable and undesirable operation[s]».)

"Add something to the dices so that they can be well predictive".

TZubiri1y ago

I disagree with this take, Stallman has expressed it recently by linking some "scientific article".

While I get that LLMs generate text in some way that does not guarantee correctness. There is a correlation between generated text and correctness, which is why millions of people use it...

You can judge the correctness of a sentence generated by an LLM. In the same way you can judge the correctness of a human generated sentence.

This is nothing surprising of a statistical model, it tends to produce true results.

Terr_1y ago

> I disagree with this take, Stallman has expressed it recently by linking some "scientific article".

I don't know how to parse this. What article did Stallman "link", and what are you saying Stallman "expressed" by linking/using it?

> whether the truthness or correlation with reality of an LLM sentence can be judged on its own or whether it requires a human to interpret it is not very relevant

It's incredibly relevant. We wouldn't even be having these debates if complex LLM judgements could always be verified without a human checking the logic.

> sentences produced by the LLM are still correct most of the time

TZubiri1y ago

https://stallman.org/chatgpt.html

https://link.springer.com/article/10.1007/s10676-024-09775-5

demaga1y ago

> They're all hallucinations, we just happen to like some of them more than others.

I love it! Puts things into perspective.

Loughla1y ago

This isn't necessarily relevant to the article above, but if there are any teachers here, this is something to do with your students to teach them exactly why not to just use AI for their homework.

ryanmcbride1y ago

>how many instances of the letter L are in the word parallel

The word parallel contains 3 instances of the letter "L":

    The first "L" appears as the fourth letter.
    The second "L" appears as the sixth letter.
    The third "L" appears as the seventh letter.

cruffle_duffle1y ago

I wonder if that is simply a reflection of there being more training data out there about this "problem" and the model hoovering all that up and regurgitating it?

mike_hearn1y ago

Even things like citations aren't going to work long term. The models are just being trained not to emit URLs from memory but only from their context.

int_19h1y ago

Here's QwQ-32 answering the same question. I'm including the entire answer here including the CoT:

> Alternatively, perhaps it's a trick with the word itself. Is "parallel" spelled differently sometimes? I don't think so. It's always P-A-R-A-L-L-E-L. So, three L's.

> Also, considering that "l" and "I" can sometimes be confused, but in this word, it's L's only, no I's.

> So, my final answer is three instances of the letter L in the word "parallel."

> Final Answer: 3

Freak_NL1y ago

I hope that this is not something I ever have to deal with. It's exhausting.

1 more reply

janalsncm1y ago

> If the word was "paralell," that would be four L's, but that's not the correct spelling.

Better but this is still a hallucination.

sugarkjube1y ago

Shows nicely what's going on.

If you ask a human, they will answer 3. Sometimes they say 4. Or 2. That's it.

An LLM produces a text using an example it was trained on. They were trained with these elaborate responses, so that's what they produce.

lawlessone1y ago

They probably have a letter counting tool added to it now. that it just knows to call when asked to do this.

you ask it the number of letters and it sends those words off to another tool to count instances of L, but they didn't add a placement one so it's still guessing those.

edit: corrected some typos and phrasing.

Maybe we'll reach a point where the LLM's are just tool calling models and not really giver their own reply.

HeatrayEnjoyer1y ago

There are only 5 tools it has available to call, and that isn't one of them. A GitHub (forgot the url) stays up to date with the latest dumped system instructions.

2 more replies

dylan6041y ago

they probably just forgot to tell it humans are 1 indexed and to do the friendly conversion for them.

quuxplusone1y ago

Do you have a link to (or can you put here) that "numbers in boxes" problem?

mdaniel1y ago

I'm not them, but I think it's a variation of the subset-sum problem

If we modify the question to be "sum to 100" (to just seriously reduce the number of example boxes required) then given:

  | 50 | 20 | 24 |
  |  7 |  5 |  1 |
  | 51 | 51 | 51 |

the solution would be

  | [50] | [20] | [24] |
  |   7  | [ 5] | [ 1] |
  |  51  | 51   |  51  |

  | right | down  | win
  | X     | right | up
  | X     | X     | X

Loughla1y ago

Correct. That but for 1000. You can build your own with any number of online tools.

I don't have a link because it's part of a lesson plan set behind a payment on teachers pay teachers.

1 more reply

empath751y ago

References

American Cancer Society. (n.d.). Health Risks of Smoking Tobacco. Retrieved from https://www.cancer.org/cancer/risk-prevention/tobacco/health...

National Cancer Institute. (n.d.). Harms of Cigarette Smoking and Health Benefits of Quitting. Retrieved from https://www.cancer.gov/about-cancer/causes-prevention/risk/t...

American Lung Association. (n.d.). Health Effects of Smoking. Retrieved from https://www.lung.org/quit-smoking/smoking-facts/health-effec...

Cleveland Clinic. (2023, April 28). Smoking: Effects, Risks, Diseases, Quitting & Solutions. Retrieved from https://my.clevelandclinic.org/health/articles/17488-smoking

American Cancer Society. (n.d.). Health Risks of Using Tobacco Products. Retrieved from https://www.cancer.org/cancer/risk-prevention/tobacco/health...

American Cancer Society. (n.d.). Health Benefits of Quitting Smoking Over Time. Retrieved from https://www.cancer.org/cancer/risk-prevention/tobacco/benefi...

namaria1y ago

empath751y ago

No, that's not really the case. I don't think you should trust LLM output at all, but I think in general it's closer to the level of reliability of wikipedia than it is to producing useless bullshit.

Which is to say that it's useful, but you shouldn't trust it without double checking.

Loughla1y ago

It was chat gpt, and it was two days ago. Believe it or not, that doesn't really change anything for me. I absolutely understand what AI is capable of, and a lot of it is really quite something.

But you can't trust it to be accurate. You just can't. Every model will absolutely make shit up at some point.

I liken it to working with a very bright 7 year old. It may sound like it knows what it's saying, and it may be able to spit out facts, but it's very ignorant about most of the world.

objclxt1y ago

> Gemini and Chat GPT with search are both perfectly capable of producing decent essays with accurate citations

OK, but this quote from your essay:

> The National Cancer Institute (NCI) reports that "cigarette smoking and exposure to tobacco smoke cause about 480,000 premature deaths each year in the United States."

...that citation is wrong. It's not from the NCI at all, the NCI cited that figure which came from another paper by the U.S. Department of Health and Human Services.

mdaniel1y ago

the only comment on the prior submission 3 days ago summarizes the whole thing: https://news.ycombinator.com/item?id=42285149

Also, I saw any such blog title as "how to make money in the stock market:" friend, if you knew the answer you wouldn't blog about it you'd be infinitely rich

rtsil1y ago

They don't say how to make big money, don't they? Also, the tl;dr is index funds and patience.

int_19h1y ago

emil_sorensenOP1y ago

I find it so interesting that it's possible to develop a "feeling" of a new model.

Der_Einzige1y ago

https://openreview.net/forum?id=FBkpCyujtS (min_p sampling, note extremely high review scores)

https://github.com/xjdr-alt/entropix (Entropix)

https://artefact2.github.io/llm-sampling/index.xhtml

LetsGetTechnicl1y ago

Why do LLMs make things up? Because that is all that LLMs do, sometimes what it outputs is correct though.

PLenz1y ago

Everything an LLM returns is an hallucination, it's just that some of those hallucinations line up with reality

__MatrixMan__1y ago

There's room for splitting hairs in there though. Even fiction, for instance, can succeed or fail at being internally consistent, is or is not grammatically correct...

Calling everything an AI does a hallucination isn't incorrect, but it reduces the term to meaninglessness. I'm not sure that's most useful thing we can be doing.

Atoms are not indivisible, yet we use the term because it works. I anticipate hallucination will be the same.

Gormo1y ago

> Calling everything an AI does a hallucination isn't incorrect, but it reduces the term to meaninglessness.

I don't think it does. In this case, "hallucination" refers to claims generated entirely within a closed system, but which pertain to a reality external to it.

That's not meaningless, and makes "hallucinations" distinguishable from claims verified against direct observation of the reality they are meant to represent.

namaria1y ago

> Atoms are not indivisible

They are the smallest unit of a substance that cannot be broken down into smaller units of the same substance. They are, in a sense, indivisible.

Timwi1y ago

1 more reply

TZubiri1y ago

How are you defining hallucination then? In some pretty useless way inevitably.

Hallucinations are precisely the generated expressions that don't correlate with reality or are not truthful.

Gormo1y ago

I don't think that definition works: it's attempting to categorize statements according to criteria completely external to them rather than according to any inherent property of the statement.

TZubiri1y ago

"I don't think that definition works: it's attempting to categorize statements according to criteria completely external to them rather than according to any inherent property of the statement."

"but the dream an the LLM alike are effectively just "remixing" prior input, and not generating any new observations of reality."

Again incorrect for three reasons:

1- Novel observations can occur purely from remixing. Einstein locked himself during a pandemic and developed the theory of relativity without additional experimental output.

2- LLMs combine their existing data with human input, which is an external source.

3- LLMs can interact with other sources of data whether by injection of data into the prompt, by function calling, RAG, etc..

So yeah. Try to go back to basics and study simpler systems, ideally with source code. This might be out of your league.

1 more reply

kelseyfrog1y ago

Subjectively, we operate the same way.

mdp20211y ago

Look at things and fix it then.

kelseyfrog1y ago

The thing that needs to be fixed is in peoples' heads.

1 more reply

mountainriver1y ago

The same is true for humans

mdp20211y ago

It's a good thing that, as you state, "humans hold "togetherness" as a "true" value".

But in this context, the value is in how much they have pondered to actually see and evaluate what is eventually seen as "true".

throwawaymaths1y ago

Completely misses the fact that a big part of the reason why llms hallucinate sp much is because there's a huge innate bias towards producing more tokens over just stopping.

TZubiri1y ago

The less tokens produced at inference the lower the quality of the response will be.

The process of thinking for an LLM involves the use of words, which is why prompts that ask the LLM to only return the answer will cause lower quality.

throwawaymaths1y ago

We're not talking about quality, we're talking about accuracy.

TZubiri1y ago

Well I am talking about quality now as it's a tradeoff.

You can reduce token output to 0 and achieve 100% accuracy too.

ausbah1y ago

do you know if prompting without regards for length then asking for a summarization of the previous out out works?

TZubiri1y ago

It does. I think this was used in a gpt4 version, they called it Chain of Thought.

TZubiri1y ago

The debate around "fixing" hallucinations reminds me of the debate around schizophrenia.

https://www.youtube.com/watch?v=nEnklxGAmak

It's not a single thing, a specific defect, but rather a failure mode, an absence of cohesive intelligence.

Any attempt to fix a non-specific ailment (schizophrenia, death, old age, hallucinations) will run into useless panaceas.

fsckboy1y ago

it's superficially counterintuitive to people that an AI that will sometimes spit out verbatim copies of written texts, also will just make other things up. It's like "choose one, please".

MetaAI makes up stuff reliably. You'd think it would be an ace at baseball stats for example, but "what teams did so-and-so play for", you absolutely must check the results yourself.

mdp20211y ago

> "counterintuitive"

> you absolutely must check the results yourself

So, consistently with the above, things are supposed to reply through a sort of """RAG""" of the vetted (dynamically built through iterations of checks).

fsckboy1y ago

i said "superficially counterintuitive", you misquoted me and proceeded with a non superficial comment.

mdp20211y ago

> misquoted me

Why? The reply would have been the same if I quoted the whole «superficially counterintuitive to people that [...]» (instead of just pointing to the original).

> proceeded with a non superficial comment

Well, hopefully ;)

(Stochastic output falls in said category of "intuition"... As "bad intuition", since it goes in the wrong direction in the vector "naive to sophisticated".)

tshadley1y ago

The article referenced the Oxford semantic entropy study but failed to clarify that the issue greatly simplifies LLM hallucination (making most of the article outdated).

When we are not sure of an answer we have two choices: say the first thing that comes to mind (like an LLM), or say "I'm not sure".

The paper referenced works to eliminate semantic similarity from entropy measurements, leaving much more useful results, proving that hallucination is conceptually a simple problem.

https://www.nature.com/articles/s41586-024-07421-0

int_19h1y ago

So, basically, the answer seems to be to give models extreme anxiety and doubt in their own abilities.

jimmySixDOF1y ago

funny I was just looking at this approach and see what's new in the wild (a work in progress) see Entropix [1] by [2]

[1] https://github.com/xjdr-alt/entropix

[2] https://x.com/_xjdr

sfink1y ago

> proving that hallucination is conceptually a simple problem.

...proving that this one particular piece of the hallucination problem may be conceptually simple.

FTFY

tshadley1y ago

> ...proving that this one particular piece of the hallucination problem may be conceptually simple.

sfink1y ago

You're just renaming "error" to "uncertainty". That is incorrect.

You can come up with some metric that you label "uncertainty", and that metric may very well be measurable. But it's only going to be correlated with error, not equal to it.

1 more reply

IWeldMelons1y ago

LLM hallucinations in fact has a positive side effect too, if you are using them for learning some subject; makes you verify their claims, and finding errors in them is very rewarding.

skydhash1y ago

Why not just read a book where the author is sincerely trying to teach you?

fnordpiglet1y ago

Treating LLMs as a single source of truth and a monolithic resource is as bad an idea as excluding them as a tool in learning.

IWeldMelons1y ago

Not as interactive, not gamified.

skydhash1y ago

School, then. Which is so gamified that it has real stakes. And so much interactions.

2 more replies

imchillyb1y ago

Toddlers don't understand truth either, until it's taught.

This crayon is red. This crayon is blue.

The adult asks: "is this crayon red?" The child responds: "no that crayon is blue." The adult then affirms or corrects the response.

This occurs over and over and over until that child understands the difference between red and blue, orange and green, yellow and black etcetera.

We then move on to more complex items and comparisons. How could we expect AI to understand these truths without training them to understand?

mdp20211y ago

You probably need to be more clear: the LLM is trained with large amounts of data making statements about facts. It is told repeatedly, "according to this source that crayon is blue".

madiator1y ago

mdaniel1y ago

I am open to the fact that maybe the value your service provides is in spitting out a percentage, even if it is - itself - hallucinated. But, hey, it's a metric that can be monitored

pfisch1y ago

LLMs likely have a similar problem.

mwkaufma1y ago

How do we discriminate when a response is correct, vs. when it's "hallucinating" an accurate fact, by coincidence? Are all responses hallucinations, independent of correspondence to ground-truth?

prollyjethi1y ago

I am honestly very skeptical of articles like these. Hallucinations are a feature of LLMs. The only ways to "FIX" it is to either stop using LLMs. Or use a super bias some how.

janalsncm1y ago

Sergii0011y ago

That's all really weird. You can watch how chat gpt gives you advice on which mushrooms are safe. And now it can be just hallucinations

threeseed1y ago

Maybe don't make things up in a blog post about LLMs making things up.

Because you don't know how to fix it. Only how to mitigate it.

Mistletoe1y ago

Is there a way to code an LLM to just say "I don't know" when it is uncertain or reaching some sort of edge?

gerdesj1y ago

"It" does not know when it does not know. A LLM is a funny old beast that basically outputs words one after another based on probabilities. There is no reasoning as we would know it involved.

Intelligence is a slippery concept to even define, let alone ask what an artificial one might look like. LLMs are a part of the puzzle and certainly not a solution.

Garbage in; garbage out.

TZubiri1y ago

"It" does not know when it does not know.

But it does know when it has uncertainty.

In the chatgpt api this is logprobs, each generated token has a level of uncertainty, so:

"2+2="

The next token is with almost 100% certainty 4.

"Today I am feeling"

The next token will be very uncertain, it might be "happy", it might be "sad", it might be all sorts of things.

gerdesj1y ago

"The next token is with almost 100% certainty 4."

By using the word "almost" with regards 2 + 2 = 4, you have not exactly dispelled LLM "nonsense".

1 more reply

e12e1y ago

LLMs model language, not knowledge. For humans the two are deeply connected, so when we see probable ("correct") language output, we assume and assign meaning.

We could allow an LLM to stop generating new tokens if there are no new tokens found with a threshold probability - but I think it's unclear if that would help.

Consider the following prompt and response:

> Complete the paragraph: "The killer ..."

Is it a "true" response? If the question makes no sense - does it make any more sense for the prompt:

> Complete the paragraph: "The first law of thermodynamics ..."

Why? Does the LLM "think" differently in the second case?

sfink1y ago

If it works properly, it would need to say that it doesn't know that it doesn't know, and then where are you?

(Short answer is yes, but it only works for a limited set of things, and that set can be expanded with effort but will always remain limited.)

int_19h1y ago

dschuetz1y ago

Context is still a huge problem for AI models, and it's probably still the main reason for hallucinating AIs.

rabid_turtle1y ago

I don't like the output = hallucination

I like the output = creative

chefandy1y ago

sollewitt1y ago

"Why LLMs do the one and only thing they do (and how to fix it)"

jrflowers1y ago

TLDR: Hallucinations are inherent to the whole thing but as humans we can apply bubble gum, bandaids and prayers

PittleyDunkin1y ago

uz441001y ago

hallucination problem in LLM, been seeing this. Let me know if someone find a fix please

j / k navigate · click thread line to collapse