undefined | Better HN

0 pointssvara5mo ago0 comments

In my experience, the best models are already nearly as good as you can be for a large fraction of what I personally use them for, which is basically as a more efficient search engine.

The thing that would now make the biggest difference isn't "more intelligence", whatever that might mean, but better grounding.

It's still a big issue that the models will make up plausible sounding but wrong or misleading explanations for things, and verifying their claims ends up taking time. And if it's a topic you don't care about enough, you might just end up misinformed.

I think Google/Gemini realize this, since their "verify" feature is designed to address exactly this. Unfortunately it hasn't worked very well for me so far.

But to me it's very clear that the product that gets this right will be the one I use.

0 comments

stacktrace5mo ago

> It's still a big issue that the models will make up plausible sounding but wrong or misleading explanations for things, and verifying their claims ends up taking time. And if it's a topic you don't care about enough, you might just end up misinformed.

Exactly! One important thing LLMs have made me realise deeply is "No information" is better than false information. The way LLMs pull out completely incorrect explanations baffles me - I suppose that's expected since in the end it's generating tokens based on its training and it's reasonable it might hallucinate some stuff, but knowing this doesn't ease any of my frustration.

IMO if LLMs need to focus on anything right now, they should focus on better grounding. Maybe even something like a probability/confidence score, might end up experience so much better for so many users like me.

biofox5mo ago

I ask for confidence scores in my custom instructions / prompts, and LLMs do surprisingly well at estimating their own knowledge most of the time.

EastLondonCoder5mo ago

I’m with the people pushing back on the “confidence scores” framing, but I think the deeper issue is that we’re still stuck in the wrong mental model.

It’s tempting to think of a language model as a shallow search engine that happens to output text, but that metaphor doesn’t actually match what’s happening under the hood. A model doesn’t “know” facts or measure uncertainty in a Bayesian sense. All it really does is traverse a high‑dimensional statistical manifold of language usage, trying to produce the most plausible continuation.

That’s why a confidence number that looks sensible can still be as made up as the underlying output, because both are just sequences of tokens tied to trained patterns, not anchored truth values. If you want truth, you want something that couples probability distributions to real world evidence sources and flags when it doesn’t have enough grounding to answer, ideally with explicit uncertainty, not hand‑waviness.

People talk about hallucination like it’s a bug that can be patched at the surface level. I think it’s actually a feature of the architecture we’re using: generating plausible continuations by design. You have to change the shape of the model or augment it with tooling that directly references verified knowledge sources before you get reliability that matters.

7 more replies

drclau5mo ago

How do you know the confidence scores are not hallucinated as well?

2 more replies

ryoshu5mo ago

LLMs fail at causal accuracy. It's a fundamental problem with how they work.

kromokromo5mo ago

Asking an LLM to give itself a «confidence score» is like asking a teenager to grade his own exam. I LLMs doesn’t «feel» uncertainty and confidence like we do.

robocat5mo ago

> wrong or misleading explanations

Exactly the same issue occurs with search.

Unfortunately not everybody knows to mistrust AI responses, or have the skills to double-check information.

darkwater5mo ago

No, it's not the same. Search results send/show you one or more specific pages/websites. And each website has a different trust factor. Yes, plenty of people repeat things they "read on the Internet" as truths, but it's easy to debunk some of them just based on the site reputation. With AI responses, the reputation is shared with the good answers as well, because they do give good answers most of the time, but also hallucinate errors.

1 more reply

incrudible5mo ago

If somebody asks a question on Stackoverflow, it is unlikely that a human who does not know the answer will take time out of their day to completely fabricate a plausible sounding answer.

3 more replies

lins19095mo ago

What is it about people making up lies to defend LLMs? In what world is it exactly the same as search? They're literally different things, since you get information from multiple sources and can do your own filtering.

actionfromafar5mo ago

I wonder if the only way to fix this with current LLMs, would be to generate a lot synthetic data for a select number topics you really don't want it "go off the rails" with. That synthetic data would be lots of variations on that "I don't know how to do X with Y".

dolmen5mo ago

I would not bet on synthetic data.

LLMs are very good at detecting patterns.

RHSman25mo ago

The problem is not the intelligence of the LLM. It is the intelligence and desire to make things easy of the intelligence using them.

XCSme5mo ago

But most benchmarks are not about that...

Are there even any "hallucination" public benchmarks?

andrepd5mo ago

"Benchmarks" for LLMs are a total hoax, since you can train them on the benchmarks themselves.

1 more reply

basisword5mo ago

I think the thing even worse than false information is the almost-correct information. You do a quick Google to confirm it's on the right page but find there's an important misunderstanding. These are so much harder to spot I think than the blatantly false.

fauigerzigerk5mo ago

I agree, but the question is how better grounding can be achieved without a major research breakthrough.

I believe the real issue is that LLMs are still so bad at reasoning. In my experience, the worst hallucinations occur where only handful of sources exist for some set of facts (e.g laws of small countries or descriptions of niche products).

LLMs know these sources and they refer to them but they are interpreting them incorrectly. They are incapable of focusing on the semantics of one specific page because they get "distracted" by their pattern matching nature.

Now people will say that this is unavoidable given the way in which transformers work. And this is true.

But shouldn't it be possible to include some measure of data sparsity in the training so that models know when they don't know enough? That would enable them to boost the weight of the context (including sources they find through inference time search/RAG) relative to to their pretraining.

balder19915mo ago

Anything that is very specific has the same problem, because LLMs can’t have the same representation of all topics in the training. It doesn’t have to be too niche, just specific enough for it to start to fabricate it.

One of these days I had a doubt about something related to how pointers work in Swift and I tried discussing with ChatGPT (don’t remember exactly what, but it was purely intellectual curiosity). It gave me a lot of explanations that seemed correct, but being skeptical and started pushing it for ways to confirm what it was saying and eventually realized it was all bullshit.

This kind of thing makes me basically wary of using LLMs for anything that isn’t brainstorming, because anything that requires knowing information that isn’t easily/plentifully found online will likely be incorrect or have sprinkles of incorrect all over the explanations.

cachius5mo ago

Grounding in search results is what Perplexity pioneered and Google also does with AI mode and ChatGPT and others with web search tool.

As a user I want it but as webadmin it kills dynamic pages and that's why Proof of work aka CPU time captchas like Anubis https://github.com/TecharoHQ/anubis#user-content-anubis or BotID https://vercel.com/docs/botid are now everywhere. If only these AI crawlers did some caching, but no just go and overrun the web. To the effect that they can't anymore, at the price of shutting down small sites and making life worse for everyone, just for few months of rapacious crawling. Literally Perplexity moved fast and broke things.

cachius5mo ago

This dance to get access is just a minor annoyance for me, but I question how it proves I’m not a bot. These steps can be trivially and cheaply automated.
I think the end result is just an internet resource I need is a little harder to access, and we have to waste a small amount of energy.

From Tavis Ormandy who wrote a C program to solve the Anubis challenges out of browser https://lock.cmpxchg8b.com/anubis.html via https://news.ycombinator.com/item?id=45787775

Guess a mix of Markov tarpits and llm meta instructions will be added, cf. Feed the bots https://news.ycombinator.com/item?id=45711094 and Nephentes https://news.ycombinator.com/item?id=42725147

BatteryMountain5mo ago

My biggest problem with LLM's at this point is that they produce different and inconsistent results or behave differently, given the same prompt. The better grounding would be amazing at this point. I want to give an LLM the same prompt on different days and I want to be able to trust that it will do the same thing as yesterday. Currently they misbehave multiple times a week and I have to manually steer it a bit which destroys certain automated workflows completely.

fragmede5mo ago

It sounds like you have dug into this problem with some depth so I would love to hear more. When you've tried to automate things, I'm guessing you've got a template and then some data and then the same or similar input gives totally different results? What details about how different the results are can you share? Are you asking for eg JSON output and it totally isn't, or is it a more subtle difference perhaps?

conception5mo ago

You need to change the temperature to 0 and tune your prompts for automated workflows.

balder19915mo ago

It doesn’t really solve it as a slight shift in the prompt can have totally unpredictable results anyway. And if your prompt is always exactly the same, you’d just cache it and bypass the LLM anyway.

What would really be useful is a very similar prompt should always give a very very similar result.

2 more replies

dominotw5mo ago

have you tried this? this doesnt work because the way inference runs at big companies. its not just running your query in isolation.

maybe it can work if you are running your own inference.

sebastiennight5mo ago

> I want to give an LLM the same prompt on different days and I want to be able to trust that it will do the same thing as yesterday

Bad news, it's winter now in the Northern hemisphere, so expect all of our AIs to get slightly less performant as they emulate humans under-performing until Spring.

phorkyas825mo ago

Isn't that what no LLM can provide: being free of hallucinations?

arw0n5mo ago

I think the better word is confabulation; fabricating plausible but false narratives based on wrong memory. Fundamentally, these models try to produce plausible text. With language models getting large, they start creating internal world models, and some research shows they actually have truth dimensions. [0]

I'm not an expert on the topic, but to me it sounds plausible that a good part of the problem of confabulation comes down to misaligned incentives. These models are trained hard to be a 'helpful assistant', and this might conflict with telling the truth.

Being free of hallucinations is a bit too high a bar to set anyway. Humans are extremely prone to confabulations as well, as can be seen by how unreliable eye witness reports tend to be. We usually get by through efficient tool calling (looking shit up), and some of us through expressing doubt about our own capabilities (critical thinking).

[0] https://arxiv.org/abs/2407.12831

Tepix5mo ago

> false narratives based on wrong memory

I don't think "wrong memory" is accurate, it's missing information and doesn't know it or is trained not to admit it.

Checkout the Dwarkesh Podcast episode https://www.dwarkesh.com/p/sholto-trenton-2 starting at 1:45:38

Here is the relevant quote by Trenton Bricken from the transcript:

One example I didn't talk about before with how the model retrieves facts: So you say, "What sport did Michael Jordan play?" And not only can you see it hop from like Michael Jordan to basketball and answer basketball. But the model also has an awareness of when it doesn't know the answer to a fact. And so, by default, it will actually say, "I don't know the answer to this question." But if it sees something that it does know the answer to, it will inhibit the "I don't know" circuit and then reply with the circuit that it actually has the answer to. So, for example, if you ask it, "Who is Michael Batkin?" —which is just a made-up fictional person— it will by default just say, "I don't know." It's only with Michael Jordan or someone else that it will then inhibit the "I don't know" circuit.

But what's really interesting here and where you can start making downstream predictions or reasoning about the model, is that the "I don't know" circuit is only on the name of the person. And so, in the paper we also ask it, "What paper did Andrej Karpathy write?" And so it recognizes the name Andrej Karpathy, because he's sufficiently famous, so that turns off the "I don't know" reply. But then when it comes time for the model to say what paper it worked on, it doesn't actually know any of his papers, and so then it needs to make something up. And so you can see different components and different circuits all interacting at the same time to lead to this final answer.

1 more reply

svaraOP5mo ago

That's right - it does seem to have to do with trying to be helpful.

One demo of this that reliably works for me:

Write a draft of something and ask the LLM to find the errors.

Correct the errors, repeat.

It will never stop finding a list of errors!

The first time around and maybe the second it will be helpful, but after you've fixed the obvious things, it will start complaining about things that are perfectly fine, just to satisfy your request of finding errors.

1 more reply

officialchicken5mo ago

No, the correct word is hallucinating. That's the word everyone uses and has been using. While it might not be technically correct, everyone knows what it means and more importantly, it's not a $3 word and everyone can relate to the concept. I also prefer all the _other_ more accurate alternative words Wikipedia offers to describe it:

"In the field of artificial intelligence (AI), a hallucination or artificial hallucination (also called bullshitting,[1][2] confabulation,[3] or delusion[4]) is"

kyletns5mo ago

For the record, brains are also not free of hallucinations.

rimeice5mo ago

I still don’t really get this argument/excuse for why it’s acceptable that LLMs hallucinate. These tools are meant to support us, but we end up with two parties who are, as you say, prone to “hallucination” and it becomes a situation of the blind leading the blind. Ideally in these scenarios there’s at least one party with a definitive or deterministic view so the other party (i.e. us) at least has some trust in the information they’re receiving and any decisions they make off the back of it.

4 more replies

andrei_says_5mo ago

How much do you hallucinate at work? How many of your work hallucinations do you confidently present as reality in communication or code?

LLMs are being sold as viable replacement of paid employees.

If they were not, they wouldn’t be funded the way they are.

delaminator5mo ago

That’s not a very useful observation though is it?

The purpose of mechanisation is to standardise and over the long term reduce errors to zero.

Otoh “The final truth is there is no truth”

1 more reply

krzyk5mo ago

Hallucinations are not bad. It adds some kind of creativity, which is good for e.g. image generation, coding, or story telling.

It is bad only in case of reporting on facts.

svaraOP5mo ago

Yes, they'll probably not go away, but it's got to be possible to handle them better.

Gemini (the app) has a "mitigation" feature where it tries to to Google searches to support its statements. That doesn't currently work properly in my experience.

It also seems to be doing something where it adds references to statements (With a separate model? With a second pass over the output? Not sure how that works.). That works well where it adds them, but it often doesn't do it.

intended5mo ago

Doubt it. I suspect it’s fundamentally not possible in the spirit you intend it.

Reality is perfectly fine with deception and inaccuracy. For language to magically be self constraining enough to only make verified statements is… impossible.

1 more reply

SecretDreams5mo ago

Find me a human that doesn't occasionally talk out of their ass =[

svaraOP5mo ago

A part of it is reproducing incorrect information in the training data as well.

One area that I've found to be a great example of this is sports science.

Depending on how you ask, you can get a response lifted from scientific literature, or the bro science one, even in the course of the same discussion.

It makes sense, both have answers to similar questions and are very commonly repeated online.

sebastiennight5mo ago

> It's still a big issue that the models will make up plausible sounding but wrong or misleading explanations for things,

Due to how LLMs are implemented, you are always most likely to get a bogus explanation if you ask for an answer first, and why second.

A useful mental model is: imagine if I presented you with a potential new recruit's complete data (resume, job history, recordings of the job interview, everything) but you only had 1 second to tell me "hired: YES OR NO"

And then, AFTER you answered that, I gave you 50 pages worth of space to tell me why your decision is right. You can't go back on that decision, so all you can do is justify it however you can.

Do you see how this would give radically different outcomes vs. giving you the 50-page scratchpad first to think things through, and then only giving me a YES/NO answer?

jillesvangurp5mo ago

It's increasingly a space that is constrained by the tools and integrations. Models provide a lot of raw capability. But with the right tools even the simpler, less capable models become useful.

Mostly we're not trying to win a nobel prize, develop some insanely difficult algorithm, or solve some silly leetcode problem. Instead we're doing relatively simple things. Some of those things are very repetitive as well. Our core job as programmers is automating things that are repetitive. That always was our job. Using AI models to do boring repetitive things is a smart use of time. But it's nothing new. There's a long history of productivity increasing tools that take boring repetitive stuff away. Compilation used to be a manual process that involved creating stacks of punch cards. That's what the first automated compilers produced as output: stacks of punch cards. Producing and stacking punchcards is not a fun job. It's very repetitive work. Compilers used to be people compiling punchcards. Women mostly, actually. Because it was considered relatively low skilled work. Even though it arguably wasn't.

Some people are very unhappy that the easier parts of their job are being automated and they are worried that they get completely automated away completely. That's only true if you exclusively do boring, repetitive, low value work. Then yes, your job is at risk. If your work is a mix of that and some higher value, non repetitive, and more fun stuff to work on, your life could get a lot more interesting. Because you get to automate away all the boring and repetitive stuff and spend more time on the fun stuff. I'm a CTO. I have lots of fun lately. Entire new side projects that I had no time for previously I can now just pull off in a spare few hours.

Ironically, a lot of people currently get the worst of both worlds because they now find themselves baby sitting AIs doing a lot more of the boring repetitive stuff than they would be able to do without that to the point where that is actually all that they do. It's still boring and repetitive. And it should be automated away ultimately. Arguably many years ago actually. The reason so many react projects feel like Ground Hog Day is because they are very repetitive. You need a login screen, and a cookies screen, and a settings screen, etc. Just like the last 50 projects you did. Why are you rebuilding those things from scratch? Manually? These are valid questions to ask yourself if you are a frontend programmer. And now you have AI to do that for you.

Find something fun and valuable to work on and AI gets a lot more fun because it gives you more quality time with the fun stuff. AI is about doing more with less. About raising the ambition level.

giancarlostoro5mo ago

Yeah in my case I want the coding models to be less stupid, I asked for multiple file uploading, it kept the original button and it added a second one for additional files, when I pointed that out “You're absolutely correct!” Well why didnt you think of it before you cranked out code, I see coding agents as really capable Junior devs its really funny. I dont mind it though, saved me hours on my side project if not weeks worth of work.

withinboredom5mo ago

I was using an LLM to summarize benchmarks for me, and I realized after awhile it was omitting information that made the algorithm being benchmarked look bad. I'm glad I caught it early, before I went to my peers and was like "look at this amazing algorithm".

coffeecat5mo ago

It's important not to assume that LLMs are giving you an impartial perspective on any given topic. The perspective you're most likely getting is that of whoever created the most training data related to that topic.

andai5mo ago

So there's two levels to this problem.

Retrieval.

And then hallucination even in the face of perfect context.

Both are currently unsolved.

(Retrieval's doing pretty good but it's a Rube Goldberg machine of workarounds. I think the second problem is a much bigger issue.)

cachius5mo ago

Re: retrieval: That's where the snake eats its tail as AI slop floods the web, grounding is like laying a foundation in a swamp. And that Rube Goldberg machine tries to prevent the snake from reaching its tail. But RGs are brittle and not exactly the thing you want to build infrstructure on. Just look at https://news.ycombinator.com/item?id=46239752 for an example how easy it can break.

jacquesm5mo ago

There are four words that would make the output of any LLM instantly 1000x more useful and I haven't seen them yet: "I do not know.".

f_k5mo ago

> verifying their claims ends up taking time.

I've been working on this problem with https://citellm.com, specifically for PDFs.

Instead of relying on the LLM answer alone, each extracted field links to its source in the original document (page number + highlighted snippet + confidence score).

Checking any claim becomes simple: click and see the exact source.

rafaelmn5mo ago

I constantly see top models (opus 4.5, gemini 3) get a stroke mid task - they will solve the problem correctly in one place, or have a correct solution that needs to be reapplied in context - and then completely miss the mark in another place. "Lack of intelligence" is very much a limiting factor. Gemini especially will get into random reasoning loops - reading thinking traces - it gets unhinged pretty fast.

Not to mention it's super easy to gaslight these models, just asserting something wrong with vaguely plausible explanation and you get no pushback or reasoning validation.

So I know you qualified your post with "for your use case", but personally I would very much like more intelligence from LLMs.

virtuosarmo5mo ago

I've had better success finding information using Google Gemini vs. ChatGPT. I.e. someone mentions to me the name of someone or some company, but doesn't give the full details (i.e. Joe @ XYZ Company doing this, or this company with 10,000 people, in ABC industry)...sometimes i don't remember the full name. Gemini has been more effective for me in filling in the gaps and doing fuzzy search. I even asked ChatGPT why this was the case, and it affirmed my experience, saying that Gemini is better for these queries because of Search integration, Knowledge Graph, etc. Especially useful for recent role changes, which haven't been propagated through other channels on a widespread basis.

HeavyStorm5mo ago

All of them are heavily invested in improving grounding. The money isn't on personal use but enterprise customers and for those, grounding is essential.

anentropic5mo ago

Yeah I basically always use "web search" option in ChatGPT for this reason, if not using one of the more advanced modes.

BrtByte5mo ago

I'm pretty much in the same camp. For a lot of everyday use, raw "intelligence" already feels good enough

j / k navigate · click thread line to collapse

0 comments

stacktrace5mo ago

biofox5mo ago

I ask for confidence scores in my custom instructions / prompts, and LLMs do surprisingly well at estimating their own knowledge most of the time.

EastLondonCoder5mo ago

I’m with the people pushing back on the “confidence scores” framing, but I think the deeper issue is that we’re still stuck in the wrong mental model.

7 more replies

drclau5mo ago

How do you know the confidence scores are not hallucinated as well?

2 more replies

ryoshu5mo ago

LLMs fail at causal accuracy. It's a fundamental problem with how they work.

kromokromo5mo ago

Asking an LLM to give itself a «confidence score» is like asking a teenager to grade his own exam. I LLMs doesn’t «feel» uncertainty and confidence like we do.

robocat5mo ago

> wrong or misleading explanations

Exactly the same issue occurs with search.

Unfortunately not everybody knows to mistrust AI responses, or have the skills to double-check information.

darkwater5mo ago

1 more reply

incrudible5mo ago

If somebody asks a question on Stackoverflow, it is unlikely that a human who does not know the answer will take time out of their day to completely fabricate a plausible sounding answer.

3 more replies

lins19095mo ago

actionfromafar5mo ago

dolmen5mo ago

I would not bet on synthetic data.

LLMs are very good at detecting patterns.

RHSman25mo ago

The problem is not the intelligence of the LLM. It is the intelligence and desire to make things easy of the intelligence using them.

XCSme5mo ago

But most benchmarks are not about that...

Are there even any "hallucination" public benchmarks?

andrepd5mo ago

"Benchmarks" for LLMs are a total hoax, since you can train them on the benchmarks themselves.

1 more reply

basisword5mo ago

fauigerzigerk5mo ago

I agree, but the question is how better grounding can be achieved without a major research breakthrough.

Now people will say that this is unavoidable given the way in which transformers work. And this is true.

balder19915mo ago

cachius5mo ago

Grounding in search results is what Perplexity pioneered and Google also does with AI mode and ChatGPT and others with web search tool.

cachius5mo ago

From Tavis Ormandy who wrote a C program to solve the Anubis challenges out of browser https://lock.cmpxchg8b.com/anubis.html via https://news.ycombinator.com/item?id=45787775

Guess a mix of Markov tarpits and llm meta instructions will be added, cf. Feed the bots https://news.ycombinator.com/item?id=45711094 and Nephentes https://news.ycombinator.com/item?id=42725147

BatteryMountain5mo ago

fragmede5mo ago

conception5mo ago

You need to change the temperature to 0 and tune your prompts for automated workflows.

balder19915mo ago

What would really be useful is a very similar prompt should always give a very very similar result.

2 more replies

dominotw5mo ago

have you tried this? this doesnt work because the way inference runs at big companies. its not just running your query in isolation.

maybe it can work if you are running your own inference.

sebastiennight5mo ago

> I want to give an LLM the same prompt on different days and I want to be able to trust that it will do the same thing as yesterday

Bad news, it's winter now in the Northern hemisphere, so expect all of our AIs to get slightly less performant as they emulate humans under-performing until Spring.

phorkyas825mo ago

Isn't that what no LLM can provide: being free of hallucinations?

arw0n5mo ago

[0] https://arxiv.org/abs/2407.12831

Tepix5mo ago

> false narratives based on wrong memory

I don't think "wrong memory" is accurate, it's missing information and doesn't know it or is trained not to admit it.

Checkout the Dwarkesh Podcast episode https://www.dwarkesh.com/p/sholto-trenton-2 starting at 1:45:38

Here is the relevant quote by Trenton Bricken from the transcript:

1 more reply

svaraOP5mo ago

That's right - it does seem to have to do with trying to be helpful.

One demo of this that reliably works for me:

Write a draft of something and ask the LLM to find the errors.

Correct the errors, repeat.

It will never stop finding a list of errors!

1 more reply

officialchicken5mo ago

"In the field of artificial intelligence (AI), a hallucination or artificial hallucination (also called bullshitting,[1][2] confabulation,[3] or delusion[4]) is"

kyletns5mo ago

For the record, brains are also not free of hallucinations.

rimeice5mo ago

4 more replies

andrei_says_5mo ago

How much do you hallucinate at work? How many of your work hallucinations do you confidently present as reality in communication or code?

LLMs are being sold as viable replacement of paid employees.

If they were not, they wouldn’t be funded the way they are.

delaminator5mo ago

That’s not a very useful observation though is it?

The purpose of mechanisation is to standardise and over the long term reduce errors to zero.

Otoh “The final truth is there is no truth”

1 more reply

krzyk5mo ago

Hallucinations are not bad. It adds some kind of creativity, which is good for e.g. image generation, coding, or story telling.

It is bad only in case of reporting on facts.

svaraOP5mo ago

Yes, they'll probably not go away, but it's got to be possible to handle them better.

Gemini (the app) has a "mitigation" feature where it tries to to Google searches to support its statements. That doesn't currently work properly in my experience.

intended5mo ago

Doubt it. I suspect it’s fundamentally not possible in the spirit you intend it.

Reality is perfectly fine with deception and inaccuracy. For language to magically be self constraining enough to only make verified statements is… impossible.

1 more reply

SecretDreams5mo ago

Find me a human that doesn't occasionally talk out of their ass =[

svaraOP5mo ago

A part of it is reproducing incorrect information in the training data as well.

One area that I've found to be a great example of this is sports science.

Depending on how you ask, you can get a response lifted from scientific literature, or the bro science one, even in the course of the same discussion.

It makes sense, both have answers to similar questions and are very commonly repeated online.

sebastiennight5mo ago

> It's still a big issue that the models will make up plausible sounding but wrong or misleading explanations for things,

Due to how LLMs are implemented, you are always most likely to get a bogus explanation if you ask for an answer first, and why second.

And then, AFTER you answered that, I gave you 50 pages worth of space to tell me why your decision is right. You can't go back on that decision, so all you can do is justify it however you can.

Do you see how this would give radically different outcomes vs. giving you the 50-page scratchpad first to think things through, and then only giving me a YES/NO answer?

jillesvangurp5mo ago

It's increasingly a space that is constrained by the tools and integrations. Models provide a lot of raw capability. But with the right tools even the simpler, less capable models become useful.

Find something fun and valuable to work on and AI gets a lot more fun because it gives you more quality time with the fun stuff. AI is about doing more with less. About raising the ambition level.

giancarlostoro5mo ago

withinboredom5mo ago

coffeecat5mo ago

andai5mo ago

So there's two levels to this problem.

Retrieval.

And then hallucination even in the face of perfect context.

Both are currently unsolved.

(Retrieval's doing pretty good but it's a Rube Goldberg machine of workarounds. I think the second problem is a much bigger issue.)

cachius5mo ago

jacquesm5mo ago

There are four words that would make the output of any LLM instantly 1000x more useful and I haven't seen them yet: "I do not know.".

f_k5mo ago

> verifying their claims ends up taking time.

I've been working on this problem with https://citellm.com, specifically for PDFs.

Instead of relying on the LLM answer alone, each extracted field links to its source in the original document (page number + highlighted snippet + confidence score).

Checking any claim becomes simple: click and see the exact source.

rafaelmn5mo ago

Not to mention it's super easy to gaslight these models, just asserting something wrong with vaguely plausible explanation and you get no pushback or reasoning validation.

So I know you qualified your post with "for your use case", but personally I would very much like more intelligence from LLMs.

virtuosarmo5mo ago

HeavyStorm5mo ago

All of them are heavily invested in improving grounding. The money isn't on personal use but enterprise customers and for those, grounding is essential.

anentropic5mo ago

Yeah I basically always use "web search" option in ChatGPT for this reason, if not using one of the more advanced modes.

BrtByte5mo ago

I'm pretty much in the same camp. For a lot of everyday use, raw "intelligence" already feels good enough

j / k navigate · click thread line to collapse