Being “Confidently Wrong” is holding AI back (opens in new tab)

(promptql.io)

155 pointstango129mo ago262 comments

262 comments

While the thrust of this article is generally correct, I have two issues with it:

1. The words "the only thing" massively underplays the difficulty of this problem. It's not a small thing.

2. One of the issues I've seen with a lot of chat LLMs is their willingness to correct themselves when asked - this might seem, on the surface, to be a positive (allowing a user to steer the AI toward a more accurate or appropriate solution), but in reality it simply plays into users' biases & makes it more likely that the user will accept & approve of incorrect responses from the AI. Often, rather than "correcting" itself it merely "teaches" the AI how to be confidently wrong in an amenable & subtle manner which the individual user finds easy to accept (or more difficult to spot).

If anything, unless/until we can solve the (insurmountable) problem of AI being wrong, AI should at least be trained to be confidently & stubbornly wrong (or right). This would also likely lead to better consistency in testing.

traceroute669mo ago

> is their willingness to correct themselves when asked

Except they don't correct themselves when asked.

I'm sure we've all been there, many, many, many,many,many times ....

   - User: "This is wrong because X"
   - AI: "You're absolutely right !  Here's a production-ready fixed answer"
   - User: "No, that's wrong because Y"
   - AI: "I apologise for frustrating you ! Here's a robust answer that works"
   - User: "You idiot, you just put X back in there"
   - and so continues the vicious circle....

ACCount379mo ago

1-turn instruction following and multi-turn instruction following are not the same exact capability, and some AIs only "get good" at the former. 1-turn gets more training attention - because it's more noticeable, in casual use and benchmarks both, and also easier to train for.

With weak multi-turn instruction following, context data will often dominate over user instructions. Resulting in very "loopy" AI - and more sessions that are easier to restart from scratch than to "fix".

Gemini is notorious for underperforming at this, while Claude has relatively good performance. I expect that many models from lesser known providers would also have a multi-turn instruction following gap.

vidarh9mo ago

This is a good point, and to drive this home to people, if you have a conversation of this pattern:

    User: Fix this problem ...
    Assistant: X
    User: No, don't do X
    Assistant: Y
    User: No, Y is wrong too.
    Assistant: X

It is generally pointless to continue. You now have a context that is full of the assistant explaining to you and itself why X and Y are the right answers, and much less context of you explaining why it is wrong.

If you reach that state, start over, and constrain your initial request to exclude X and Y. If it brings up either again, start over, and constrain your request further.

If the model is bad at handling multiple turns without getting into a loop, telling it that it is wrong is not generally going to achieve anything, but starting over with better instructions often will.

I see so many people get stuck "arguing" with a model over this, getting more and more frustrated as the model keeps repeating variations of the broken answer, without realising they're filling the context with arguments from the model for why the broken answer is right.

5 more replies

rtkwe9mo ago

I have this problem all the time with minor image edits on ChatGPT the few time's I've tried it. Any time I try to do a second edit or change to the generated image it seems to take the already degraded output from it's first attempt and use that instead of the original image.

stetrain9mo ago

Yep, the LLM will happily continue this spiral indefinitely but I've learned that if providing a bit more context and one correction doesn't provide a good solution, continuing is generally a waste of time.

They tend to very quickly lose useful context of the original problem and stated goals.

nyeah9mo ago

Yes, that is the point of the comment.

1 more reply

therobots9279mo ago

Yeah I think our jobs are safe. Why doesn’t anyone acknowledge loops like this? They happen all the time and I’m only using it once a week at the most

gavinray9mo ago

  > Yeah I think our jobs are safe.

I give myself 6-18 months before I think top-performing LLM's can do 80% of the day-to-day issues I'm assigned.

  > Why doesn’t anyone acknowledge loops like this?

Thisis something you run into early-on using LLM's and learn to sidestep. This looping is a sort of "context-rot" -- the agent has the problem statement as part of it's input, and then a series of incorrect solutions.

Now what you've got is a junk-soup where the original problem is buried somewhere in the pile.

Best approach I've found is to start a fresh conversation with the original problem statement and any improvements/negative reinforcements you've gotten out of the LLM tacked on.

I typically have ChatGPT 5 Thinking, Claude 4.1 Opus, Grok 4, and Gemini 2.5 Pro all churning on the same question at once and then copy-pasting relevant improvements across each.

6 more replies

RyanOD9mo ago

But still under pressure in the short-term, no? As companies lean into AI as a means of efficiency / competitive advantage / cost savings, jobs will be eliminated / reduced while companies find their direction. The potential gains are said to be too big to sit on the sidelines and wait to be a late-adopter.

1 more reply

jansper399mo ago

Honestly when I speak about these sorts of issues I get the feeling that other people view me as some kind of luddite, especially people above me who presumably want to replace as many people with AI as possible. I suppose me pointing out the flaws breaks the illusion of magic that people want AI to have.

1 more reply

vidarh9mo ago

Because it's easy to learn to stop engaging with those loops, treating them as a sign you provided too little context, and instead start a new conversation with an expanded prompt.

It doesn't mean these loops aren't an issue, because they are, but once you stop engaging with them and cut them off, they're a nuisance rather than a showstopper.

1 more reply

traceroute669mo ago

The AI-fanbois will quickly tell you that you are misusing the context or your prompt is "wrong".

But I've had it consistently happens to me on tiny contexts (e.g. I've had to spend time trying - and failing - to get it to fix a mess it was making with a straightforward 200-ish line bash script).

And its also very frequently happened to me when I've been very careful with my prompts (e.g. explicitly telling it to use a specific version of a specific library ... and it goes and ignores me completely and picks some random library).

1 more reply

dkersten9mo ago

My pet peace is when I point out a problem it responds with acknowledgement and then explaining why it’s wrong. Like… I already know why it’s wrong, since I’m the one that pointed it out!

brookst9mo ago

You're conflating "correct themselves" with "are guaranteed to give the correct answer", which are two really different things. And in fact you're just echoing GP's point: their corrections can be wrong.

You case is no different from:

- AI: "The capital of France is Paris"

- User: "This is wrong, it changed to Montreal in 2005"

- AI: "You're absolutely right! The capital of France is Montreal"

kelseyfrog9mo ago

Instead I get this:

    Nope—Paris is the capital of France and has been for centuries. Montreal is in Quebec, Canada. France’s presidency (Élysée), parliament (Assemblée nationale and Sénat), and ministries are all in Paris.

1 more reply

zero_iq9mo ago

I've seen ChatGPT get stuck in this loop all by itself, generating a long multi-page answer where it constantly catches itself, refutes itself, offers a new answer with the same problem, rinse and repeat... All in the same response!

lucideer9mo ago

True. This also often happens.

Probably the ideal would be to have a UI / non-chat-based mechanism for discarding select context.

Wowfunhappy9mo ago

...I don't know why, but I swear to god, when Claude gets into one of these cycles I can often get it out by dropping the f-bomb, with maybe a 50% success rate. Something about that word lets it know that it needs to break the pattern.

dns_snek9mo ago

> but in reality it simply plays into users' biases & makes it more likely that the user will accept & approve of incorrect responses from the AI.

Yes! I often find myself overthinking my phrasing to the nth degree because I've learned that even a sprinkle of bias can often make the LLM run in that direction even if it's not the correct answer.

It often feels a bit like interacting with a deeply unstable and insecure people pleasing person. I can't say anything that could possibly be interpreted as a disagreement because they'll immediately flip the script, I can't mention that I like pizza before asking them what their favorite food is because they'll just mirror me.

stingraycharles9mo ago

> 1. The words "the only thing" massively underplays the difficulty of this problem. It's not a small thing.

Exactly. One could argue that this is just an artifact from the fundamental technique being used: it’s a really fancy autocomplete based on a huge context window.

People still think there’s actual intelligence in there, while the actual problems by making these systems appear intelligent is mostly algorithms and software managing exactly what goes into these context windows at what place.

Don’t get me wrong: it feels like magic. But I would argue that the only way to recognize a model being “confidently wrong” is to let another model, trained on completely different datasets with different techniques, judge them. And then preferably multiple.

(This is actually a feature of an MCP tool I use, “consensus” from zen-mcp-server, which enables you to query multiple different models to reach a consensus on a certain problem / solution).

tango12OP9mo ago

The AI being wrong problem is probably not insurmountable.

Humans have meta-cognition that helps them judge if they're doing a thing with lots of assumptions vs doing something that's blessed.

Humans decouple planning from execution right? Not fully but we choose when to separate it and when to not.

If we had enough data on here's a good plan given user context and here's a bad plan, it doesn't seem unreasonable to have a pretty reliable meta cognition capability on the goodness of a plan.

procaryote9mo ago

Depending on your definitions, either:

* there are already lots of "reasoning" models trying meta-cognition, while still getting simple things wrong

or:

* the models aren't doing cognition, so meta-cognition seems very far away

energy1239mo ago

Mechanistic interpretability could play a role here. The sycophancy you describe in chat mode could be when the question is "too difficult" and the AI defaults to easy circuits that rely on simple rule of thumbs (like does the context contain positive words such as "excellent"). The user experiences this as the AI just following basic nudges.

Could real-time observability into the network's internals somehow feed back into the model to reduce these hallucination-inducing shortcuts? Like train the system to detect when a shortcut is being used, then do something about it?

ninetyninenine9mo ago

It’s not massively underplaying it imo. AI hype is real. This is revolutionary technology that humanity has never seen before.

But it happened at a time where hype can be delivered at a magnitude never before seen by humanity as well to a degree of volume that is completely unnatural by any standard set previously by hype machines created by humanity. Not even landing on the moon has inundated people with as much hype. But inevitably like landing on the moon, humanity is suffering from hype fatigue.

Too much hype makes us numb to the reality of how insane the technology is.

Like when someone says the only thing stopping LLMs is hallucinations… that is literally the last gap. LLMs cover creativity, comprehension, analysis, knowledge and much more. Hallucinations is it. The final problem is targeted and boxed into something much more narrower then just build a human level AI from scratch.

Don’t get me wrong. Hallucinations are hard. But this being the last thing left is not an underplay. Yes it’s a massive issue but yes it is also a massive achievement to reduce all of agi to simply solving just an hallucination problem.

NoGravitas9mo ago

> Like when someone says the only thing stopping LLMs is hallucinations… that is literally the last gap.

What you are missing here is that the "hallucinations" you don't like and the "results" you do like are, in terms of the underlying process, exactly the same thing. They are not an aberration you can remove. Producing these kinds of results without "hallucinations" is going to require fundamentally different techniques. It's not a "last gap".

ninetyninenine9mo ago

That's not true. There is we just need to find it.

Humans have a condition called schizophrenia where we literally are incapable of differentiating hallucination and reality. What that capability is, is something we need to find out and discover for both ourselves and LLMs.

For example: Mathematically speaking it's possible to know how far away an inferenced point is away from a cluster of real world data. That delta when fed back into the neural network can allow the LLM to know how speculative a response is. From there we can feed the response back into itself for refinement.

1 more reply

indigoabstract9mo ago

I think you would have really enjoyed living in the '50s, when the future was bright and colonizing Mars was basically a solved problem.

What we got instead is a bunch of wisecracking programmers who like to remind everyone of the 90–90 rule, or the last 10 percent.

ninetyninenine9mo ago

Mars wasn't part of the 90-90 rule genius. If you know your history, it's mainly because political interest was lost. Technologically sending someone to mars is 100000000x more feasible than agi simply because the technology and theory exists such that we can do it.

AGI is part of the last 10 percent rule. But like that's the entire issue. 90% is still 90% progress. That is massive. And the hype surrounding LLMs has made people forget how far 90% is. People are going, "LLMs are retarded because it has the IQ of a 5 year old". They don't realize how even getting it to the level of a 5 year old was impossible for decades and decades.

ninetyninenine9mo ago

Wisecracking programmers lol. You talk as if programming is like something to be proud of. It’s one of the most lucrative jobs with ease of entry as a boot camp can turn someone from zero to hero in a year.

And then you mouth off a buzz phrase not even coined by a programmer but repeated to the point of annoyance about how the final 10 percent is always the hardest as if programmers who copy the phrase are so smart.

Bro the last 10 percent being the hardest doesn’t mean the previous 90 percent didn’t happen. The first 90 percent is a feat in itself and LLMs can now even do PRs. That was a feat no one just 5 years ago could have predicted was possible in our lifetimes.

Idiot programmers and their generic wise cracks were the ones saying that AI would never be able to pass the Turing test and this was just 4 years ago.

taco_emoji9mo ago

Oh, buddy, LLM hallucinations are not the only gap left for AGI

ninetyninenine9mo ago

It is. After that it's virtually indistinguishable from chatting with a human

stetrain9mo ago

Yes, the quick to correct itself isn't really useful. I would not like a human assistant/intern/pair programmer who when asked how to do X said:

> To accomplish X you can just use Y!

But Y isn't applicable in this scenario.

> Oh, you're absolutely right! Instead of Y you can do Z.

Are you sure? I don't think Z accomplishes X.

> On second thought you're absolutely correct. Y or Z will clearly not accomplish X, but let's try Q....

sfn429mo ago

Being confidently wrong isn't even the problem. It's a symptom of the much deeper problem that these things aren't AI at all, they're just atocomplete bots good enough to kind of seem like AI. There's no actual intelligence. That's the problem.

indigoabstract9mo ago

I think what matters most is that we now know that it's possible, that a computer mimicking most of our abilities (but not all) which we have long considered intelligent is obviously possible in some indeterminate future.

It's not obvious how long until that point or what form it will finally take, but it should be obvious that it's going to happen at some point.

My speculation is that until AI starts having senses like sight, hearing, touch and the ability to learn from experience, it will always be just a tool/help/aider to someone doing a job, but could not possibly replace that person in that job as it lacks the essential feedback mechanisms for successfully doing that job in the first place.

Workaccount29mo ago

My favorite "paper" on AI pretty accurately describes this line of thinking

https://ai.vixra.org/pdf/2506.0065v1.pdf

ninetyninenine9mo ago

No. The experts in the field are past this argument. People have moved on. It is clear to everyone who builds LLMs that the AI is intelligent. The algorithm was autocomplete, but we are finding as an autocomplete bot is basically autocompleting things with humanity changing intelligent content. Your opinion is a minority now and not shared by people on the forefront of building these things. Your holding onto the initial fever pitched alarmist reaction people had to LLMs when it first came out.

Like you realize humans hallucinate too right? And that there are humans that have a disease that makes them hallucinate constantly.

Hallucinations don’t preclude humans from being “intelligent”. It also doesn’t preclude the LLM from being intelligent.

dns_snek9mo ago

> Your opinion is a minority now and not shared by people on the forefront of building these things.

Minority != wrong, with many historic examples that imploded in spectacular fashion. People at the forefront of building these things aren't immune from grandiose beliefs, many of them are practically predisposed to them. They also have a vested interest in perpetuating the hype to secure their generational wealth.

1 more reply

CodexArcanum9mo ago

LLMS don't "hallucinate" they generate a stochastic sequence of plausible tokens that, in context when read by a human, are a false statement or nonsensical.

They also dont have an internal world model. Well I don't think so, but the debate is far from settled. "Experts" like the cofounders of various AI companies (whose livelihood depends on selling these things) seem to believe that. Others do not.

https://aiguide.substack.com/p/llms-and-world-models-part-1

https://yosefk.com/blog/llms-arent-world-models.html

1 more reply

kilpikaarna9mo ago

> It is clear to everyone who builds LLMs that the AI is intelligent.

So presumably we have a solid, generally-agreed-upon definition on intelligence now?

> autocompleting things with humanity changing intelligent content.

What does this even mean?

2 more replies

NoGravitas9mo ago

IOW, the realistic position is not held by the majority of people whose paychecks depend on it being wrong. I'm shocked.

2 more replies

mcv9mo ago

Yeah, but how much of that is wishful thinking? If your job depends on believing this is real intelligence, you're more likely to believe that.

eCa9mo ago

> Like you realize humans hallucinate too right?

A developer that hallucinates at work to the extent that LLMs does would probably have issues getting their PRs past code reviews a lot.

2 more replies

burnte9mo ago

I'm just so glad people are seeing this. I started saying this literally days after ChatGPT came out and I started examining the technology. It's SUPER useful, but it's assistive, it can't be trusted to do things autonomously yet. That's ok, though, it can make human workers more productive, rather than worrying about replacing humans.

KoolKat239mo ago

Gemini 2.5 pro is quite good at being stubborn (well at least the initial release versions, haven't tested since).

decentrality9mo ago

Agreed with #1 ( came here to say that also )

Pronoun and noun wordplay aside ( 'Their' ... `themselves` ) I also agree that LLMs can correct the path being taken, regenerate better, etc...

But the idea that 'AI' needs to be _stubbornly_ wrong ( more human in the worst way ) is a bad idea. There is a fundamental showing, and it is being missed.

What is the context reality? Where is this prompt/response taking place? Almost guaranteed to be going on in a context which is itself violated or broken; such as with `Open Web UI` in a conservative example: Who even cares if we get the responses right? Now we have 'right' responses in a cul-de-sac universe. This might be worthwhile using `Ollama` in `Zed` for example, but for what purpose? An agentic process that is going to be audited anyway, because we always need to understand the code? And if we are talking about decision-making processes in a corporate system strategy... now we are fully down the rabbit hole. The corporate context itself is coming or going on whether it is right/wrong, good/evil, etc... as the entire point of what is going on there. The entire world is already beating that corporation to death or not, or it is beating the world to death or not... so the 'AI' aspect is more of an accelerant of an underlying dynamic, and if we stand back... what corporation is not already stubbornly wrong, on average?

taco_emoji9mo ago

> Pronoun and noun wordplay aside ( 'Their' ... `themselves` )

How is that wordplay? Those are the correct pronouns.

corytheboyd9mo ago

Isn’t it obvious that the confidently wrong problem will never go away because all of this is effectively built on a statistical next token matcher? Yeah sure you can throw on hacks like RAG, more context window, but it’s still built on the same foundation.

It’s like saying you built a 3D scene on a 2D plane. You can employ clever tricks to make 2D look 3D at the right angle, buts it’s fundamentally not 3D, which obviously shows when you take the 2D thing and turn it.

It seems like the effectiveness plateau of these hacks will soon be (has been?) reached and the smoke and mirrors snake oil sales booths cluttering Main Street will start to go away. Still a useful piece of tech, just, not for every-fucking-thing.

raynr9mo ago

As a layman, this too strikes me as the problem underlying the "confidently wrong" problem.

The author proposes ways for an AI to signal when it is wrong and to learn from its mistakes. But that mechanism feeds back to the core next token matcher. Isn't this just replicating the problem with extra steps?

I feel like this is a framing problem. It's not that an LLM is mostly correct and just sometimes confabulates or is "confidently wrong". It's that an LLM is confabulating all the time, and all the techniques thrown at it do is increase the measured incidence of LLM confabulations matching expected benchmark answers.

jdbernard9mo ago

It seems obvious to me, but there was a camp that thought, at least at one time, that probabilistic next token could be effectively what humans are doing anyways, just scaled up several more orders of magnitude. It always felt obvious to me that there was more to human cognition than just very sophisticated pattern matching, so I'm not surprised that these approaches are hitting walls.

yifanl9mo ago

There are people convinced that if we throw a sufficient amount of training data and VC money at more hardware, we'll overcome the gap.

Technically, I can't prove that they're wrong, novel solutions sometimes happen, and I guess the calculus is that it's likely enough to justify a trillion dollars down the hole.

gavinray9mo ago

There's a guy, Ken Stanley, who wrote the NEAT[0]/HyperNEAT[1] algorithms.

His big idea is that evolution/advancements don't happen incrementally, but rather in unpredictable large leaps.

He wrote a whole book about it that's pretty solid IMO: "Why Greatness Cannot Be Planned: The Myth of the Objective."

[0] https://en.wikipedia.org/wiki/Neuroevolution_of_augmenting_t... [1] https://en.wikipedia.org/wiki/HyperNEAT

delichon9mo ago

https://en.wikipedia.org/wiki/Saltation_(biology)

1 more reply

devin9mo ago

Whenever I try to tell people about the myth of the objective they look at me like I'm insane. It's not very popular to tell people that their best laid plans are actually part of the problem.

yifanl9mo ago

I would suspect that any next step comes with a novel implementation though, not just trying to scale the same shit to infinity.

I guess the bitter lesson is gospel now, which doesn't sit right with me now that we're past the stage of Moore's Law being relevant, but I'm not the one with a trillion dollars, so I don't matter.

corytheboyd9mo ago

I’d say it was worth throwing down some cash for, because we get cool new things by full-assing new ideas. But… yeah… a TRILLION dollars is waaaay too far.

kovacs9mo ago

This is the best analogy I've read to explain what's going on and takes me back to the days of Doom and how it was so transformative at the time. Perhaps in time the current generation will be viewed as the Doom engine as we await the holy grail of full 3D in Quake.

corytheboyd9mo ago

I guess technically 3D on computers is still clever 2D, but let’s not break the metaphor down too hard lol. Love the Doom/Quake comparison!

gus_massa9mo ago

It's easy to solve if they modify they training to remove some weight from Stack Overflow and add more weight to Yahoo! Answers :) .

I remember a few years ago, we were planing to make some kind of math forum for students in the first year of the university. My opinion was that it was too easy to do it wrong. On one way you can be like Math Overflow were all the questions are too technical (for first year of the university) and all the answers are too technical (first year of the university). On the other way, you can be like Yahoo! Answers, where more than half of the answers were "I don't know", with many "I don't know" per question.

For the AI, you want to give it some room to generalize/bullshit. It one page says that "X was a few months before Z" and another page says that "Y was a few days before Z", than you want an hallucinated reply that says that "X happened before Y".

On the other hand, you want the AI to say "I don't know.". They just gave too little weight to the questions that are still open. Do you know a good forum where people post questions that are still open?

corytheboyd9mo ago

> For the AI, you want to give it some room to generalize/bullshit.

Totally! In my mind I’ve been playing with the phrase: it’s good at _fuzzy_ things. For example IMO voice synthesis before and after this wave of AI hype is actually night and day! In part, to my fuzzy idea, because voice synthesis isn’t factual, it’s billions of little data points coming together to emulate sound waves, which is incredibly fuzzy. Versus code, which is pointy: it has one/few correct forms, and infinite/many incorrect forms.

aidenn09mo ago

I mean if it's trained on things like Reddit then it's just reflecting its training data. I asked a question on reddit just yesterday and the only response I got was confidently wrong. This is not the first time it has happened.

rwmj9mo ago

Only thing? Just off the top of my head: That the LLM doesn't learn incrementally from previous encounters. That we appear to have run out of training data. That we seem to have hit a scaling wall (reflected in the performance of GPT5).

I predict we'll get a few research breakthroughs in the next few years that will make articles like this seem ridiculous.

energy1239mo ago

Re online learning - If I freeze 40 yo Einstein and make it so he can't form new memories beyond 5 minutes, that's still an incredibly useful, generally intelligent thing. Doesn't seem like a problem that needs to be solved on the critical path to AGI.

Re training data - We have synthetic data, and we probably haven't hit a wall. Gpt-5 was only 3.5 months after o3. People are reading too much into the tea leaves here. We don't have visibility into the cost of Gpt-5 relative to o3. If it's 20% cheaper, that's the opposite of a wall, that's exponential like improvement. We don't have visibility into the IMO/IOI medal winning models. All I see are people curve fitting onto very limited information.

tliltocatl9mo ago

> If I freeze 40 yo Einstein and make it so he can't form new memories beyond 5 minutes, that's still an incredibly useful, generally intelligent thing

A "frozen mind" feels like something not unlike a book - useful, but only with a smart enough "human user", and even so be progressively less useful as time passes.

>Doesn't seem like a problem that needs to be solved on the critical path to AGI.

It definitely is one. I know we are running into definitions, but being able to form novel behavior patterns based on experience is pretty much the essence of what intelligence is. That doesn't necessary mean that a "frozen mind" will be useless, but it would certainly not qualify as AGI.

>We don't have visibility into the IMO/IOI medal winning models.

There are lies, damn lies and LLM benchmarks. IMO/IOI is not necessarily indicative of any useful tasks.

Dilettante_9mo ago

>If I freeze 40 yo Einstein and make it so he can't form new memories beyond 5 minutes, that's still an incredibly useful, generally intelligent thing.

But every time you tried to get him to do something you'd have to teach him from first principles. Good luck getting ChatStein to interact with the internet, to write code or design a modern airplane. Even in physics, he'd be using antiquated methods and assumptions, this getting worse as time progresses(like sib comment I believe was alluding to).

And don't even get me started on the language barrier.

I recently read this short story[1] on the topic so it's fresh on my mind.

[1]https://qntm.org/mmacevedo

selfmodruntime9mo ago

Never before did we have a combination of well and poison where the pollution of the first was both as instantaneous and as easily achieved.

I‘ve yet to see a convincing article for artificial training data.

AIPedant9mo ago

It does seem like it helps with math, but in a way that demonstrates the futility of the enterprise: "after training the LLM on 10,000,000 examples of K-8 arithmetic it is now superhuman up to 12 digits, after which it falls off a cliff. Also it demonstrably doesn't understand what 'four' means conceptually and it still fails on many trivial counting problems."

ausbah9mo ago

yeah like another commenter said, if you can get synthetic data with some some sort of easily verifiable grounding (math, games, code) models can do very well. this is one of the underpinnings of reinforcement learning that has helped some advancements in past year or so (AFAIK)

tliltocatl9mo ago

> LLM doesn't learn incrementally from previous encounters

This. Lack of any way to incorporate previous experience seems like the main problem. Humans are often confidently wrong as well - and avoiding being confidently wrong is actually something one must learn rather than an innate capability. But humans wouldn't repeat same mistake indefinitely.

ACCount379mo ago

You can gather feedback from inference and funnel that back into model training. It's just very, very hard to do that without shooting yourself in the foot.

The feedback you get is incredibly entangled, and disentangling it to get at the signals that would be beneficial for training is nowhere near a solved task.

Even OpenAI has managed to fuck up there - by accidentally training 4o to be a fully bootlickmaxxed synthetic sycophant. Then they struggled to fix that for a while, and only made good progress at that with GPT-5.

FergusArgyll9mo ago

The problem is the kinds of "data" users will feed it. It's basically an impossible task to put a continuous learning model online and not have it devolve into the optimal mix of stalin & hitler

lvl1559mo ago

Incrementally learning model is pretty hard. That’s actually something I am working on right now and it’s completely different from developing/implementing LLMs.

criddell9mo ago

I think that's what it's going to take. Eventually put the learning model in a robot body and send it out into the real world where there's no shortage of training data.

Tiktaalik9mo ago

Yea it'll learn real quick what falling in a ravine is like

tliltocatl9mo ago

Cool, got any previous work to share?

impossiblefork9mo ago

Having run out of training data isn't something holding back LLMs in this sense.

But I agree that being confidently wrong is not the only thing they can't do. Programming, great, maths, apparently great nowadays, since Google and OpenAI have something that could solve most problems on the IMO, even if the models we get to see probably aren't models that can do this, but LLMs produce crazy output when asked to produce stories, they produce crazy output when given too long confusing contexts and have some other problems of that sort.

I think much of it is solvable. I certainly have ideas about how it can be done.

traceroute669mo ago

> That we appear to have run out of training data.

I think the next iteration of LLM is going to be "interesting", i.e. now that all the websites they used to freely scrape have been increasingly putting up walls.

procaryote9mo ago

Also that no companies involved seem to be making a profit, have a reasonable vision to make a profit, or even revenue in the same ballpark as costs.

Except nvidia perhaps

Dilettante_9mo ago

But that's not so unusual these days, no? I'm not terribly knowledgeable in the field, but don't Amazon, Netflix, Uber and such work on this kind of funding structure?

mrguyorama9mo ago

Uh no, Netflix runs a real business with real fundamentals and do things like invest in their content creation pipeline to provide their customers a better reason to stay, and their price to consumers is set by their costs and desired profit margin, and they have run a pretty straightforward business since the very beginning.

Uber successfully turned a war chest into a partial monopoly of all ride hailing for significant chunks of the world. That was the clear plan from the start, and was always meant to own the market so they could extract whatever rent they want.

Amazon reinvested heavily while competition floundered in order to literally own the market, and has spent every second since squeezing the market, their partners, everyone in the chain to extract ever more money from it.

None of those are even close to buying absurdly overpriced hardware from a monopoly and reselling access to that hardware for less than it costs to run and doing huge PR sweeps about how what you are building might kill everyone so we should obviously give them trillions in government dollars because if an American company isn't the one to kill everyone than we have failed.

"Not terribly knowledgeable in the field"?

tango12OP9mo ago

Author here.

You’re right in that it’s obviously not the only problem.

But without solving this seems like no matter how good the models get it’ll never be enough.

Or, yes, the biggest research breakthrough we need is reliable calibrated confidence. And that’ll allow existing models as they are to become spectacularly more useful.

BitsAndObjects9mo ago

The biggest breakthrough that we need is something resembling actual intelligence in AI (human or machine, I’ll let you decide where we need it more ;) )

binarymax9mo ago

You might be getting downvoted because you editorialized your own title. If it’s obviously not the only thing then don’t add that to the title :)

mettamage9mo ago

> Only thing? Just off the top of my head: That the LLM doesn't learn incrementally from previous encounters. That we appear to have run out of training data.

Ha, that almost seems like an oxymoron. The previous encounters can be the new training data!

jazzyjackson9mo ago

The old training was human responses to human questions. From this the bot learned to mimick human responses.

What would be the point of training an LLM on bot answers to human questions? This is only useful if you want to get an LLM that behaves like an already existing LLm

selfmodruntime9mo ago

Queries are questions in a sense that they are not the original facts. I don’t think they are useful for training data.

harsh31959mo ago

In terms of adoption, I think the user is right. That is the only thing stopping adoption of existing models in the real world.

moduspol9mo ago

Unclear limits on how much context can be reliably provided and effectively used without degrading the result.

ninetyninenine9mo ago

It does. We keep a section of the context window for memory. The LLM however is the one deciding what is remembered. Technically via the system prompt we can have it remember every prompt if needed.

But memory is a minor thing. Talking to a knowledgeable librarian or professor you never met is the level we essentially need to get it to for this stuff to take off.

firesteelrain9mo ago

> That we appear to have run out of training data

And now, in some cases for a while, it is training on its own slop.

lazide9mo ago

The article is the peak of confidently wrong itself, for solid irony points.

therobots9279mo ago

Apple released this recently: https://machinelearning.apple.com/research/illusion-of-think...

roxolotl9mo ago

The big thing here is that they can’t even be confident. There is no there there. They are a, admittedly very useful, statistical model. Ascribing confidence to it is an anthropomorphizing mistake which is easy to make since we’re wired to trust text that feels human.

They are at their most useful when it is cheaper to verify their output than it is to generate it yourself. That’s why code is rather ok; you can run it. But once validation becomes more expensive than doing it yourself, be it code or otherwise, their usefulness drops off significantly.

projektfu9mo ago

The article buries the lede by waiting until the very end to talk about solutions like having the LLM write DSL code. Presumably if you feed an LLM your orders table and a question about it, you'll get an answer that you can't trust. But if you ask it to write some SQL or similar thing based on your database to get the answer and run it, you can have more confidence.

IsTom9mo ago

Until it mishandles a NULL somewhere in a condition on does JOIN instead of a LEFT JOIN and outputs something plausibly-looking that is just plain wrong. To verify it you'll need to do the work that it would take to write it anyway.

projektfu9mo ago

I disagree, both because LLMs can be less likely to make those errors than a lot of humans, and because it's easier for me to review and critique its code than to review my own. I can also have a basis for testing, and I can tell it to fix problems in the code rather than having it make up a new answer.

If what I am doing is summarizing data and it will likely have uncertainty as a result, I can include statistics in the specification of what I want.

I have also been impressed from time to time where Claude Code catches a mistake I would have written. For example, I asked it to create a configuration file with some names of my staff to use for a query. It then ran the query and noticed that one name I gave was not in the database, but that there was a similar name, and it recommended changing the config.

I am pessimistic about whether these tools are intelligent or will ever achieve intelligence, but where they are useful, we should use them.

z3c09mo ago

Agreed. All these attempts to benchmark LLM performance based on the interpreted validity of the outputs are completely misguided. It may be the semantics of "context" causing people to anthropomorphize the models (besides the lifelike outputs.) Establishing context for humans is the process of holding external stimuli against an internal model of reality. Context for an LLM is literally just "the last n tokens". In that case, the performance would be how valid the most probablistic token was with the prior n tokens being present, which really has nothing to do with the perceived correctness of the output.

hodgehog119mo ago

But as a statistical model, it should be able to report some notion of statistical uncertainty, not necessarily in its next-token outputs, but just as a separate measure. Unfortunately, there really doesn't seem to be a lot of effort going into this.

PessimalDecimal9mo ago

Even then, wouldn't its uncertainty be about the probability of the output given the input? That's different from probability of being correct in some factual sense. At least for this class of models.

hodgehog119mo ago

There are many types of model uncertainty, but factual errors should play a role in conditional uncertainties. If you do it right, then you can report when the output is truly veering into out-of-distribution territory.

z3c09mo ago

The statistical certainty is indeed present in the model. Each token comes with a probablility; if your softmax results approach a uniform distribution (i.e. all selected tokens at the given temp have near equal probabilities), then the next most likely token is very uncertain. Reporting the probabilities of the returned tokens can help the user understand how likely hallucinations are. However, that information is deliberately obfuscated now, to prevent distillation techniques.

esafak9mo ago

That is not the same thing! You are talking about the point distribution of the next token. We are talking about the uncertainty associated with each of those candidate tokens; a distribution of distributions.

It's the difference between a categorical distribution and a Dirichlet. https://en.wikipedia.org/wiki/Dirichlet_distribution

1 more reply

NoGravitas9mo ago

The thing holding AI back is that LLMS are not world models, and do not have world models. Being confidently wrong is just a side effect of that. You need a model of the world to be uncertain about. Without one, you have no way to estimate whether your next predicted sentence is true, false, or uncertain; one predicted sentence is as good as another as long as it resembles the training data.

mojuba9mo ago

In other words, just like with autonomous driving, you need real world experience aka general intelligence to be truly useful. Having a model of the world and knowing your place in it is one of the critical parts of intelligence that both autonomous vehicle systems and LLM's are missing.

rar009mo ago

I know people are pushing back, taking "only" literally, but from a reasonable perspective what causes LLMs (technically their outputs) to give that impression is indeed the crux of what holds progress back: how/what LLMs learn from data. In my personal opinion, there's something fundamentally flawed the whole field has yet to properly pinpointing and fix.

jqpabc1239mo ago

there's something fundamentally flawed the whole field has yet to properly pinpointing and fix.

Isn't it obvious?

It's all built around probability and statistics.

This is not how you reach definitive answers. Maybe the results make sense and maybe they're just nice sounding BS. You guess which one is the case.

The real catch --- if you know enough to spot the BS, you probably didn't need to ask the question in the first place.

ctoth9mo ago

> It's all built around probability and statistics.

Yes, the world is probabilistic.

> This is not how you reach definitive answers.

Do go on? This is the only way to build anything approximating certainty in our world. Do you think that ... answers just exist? What type of weird deterministic video game world do you live in where this is not the case?

jqpabc1239mo ago

How many "r's" are in the word "strawberry"?

I'm certain this simple question has a definitive answer.

1 more reply

darth_avocado9mo ago

Funnily the same thing would get you promoted in corporate America as a human

jqpabc1239mo ago

But only if you are physically attractive and skilled at golf.

EE84M3i9mo ago

This is so prominent in the cultural consciousness that it was lampooned in this week episode of south park, where Randy Marsh goes on a chatgpt (and ketamone) fueled bender and destroys his business.

CloseChoice9mo ago

LLMs are largely used by developers, who (in some sense or the other) supervise what the LLM does constantly (even if that means for sum committing to main and running in production). We do already have a lot of tools: tests, compilation, a programming language with its harsh restrictions compared to natural language, and of course the eye test, this is not the case for a lot of jobs where GenAI is used for hyperautomation, so I am really curious in which way it will or won't get adopted in other areas.

JCM99mo ago

Add to being confidently wrong is the super annoying way it corrects itself after disastrously screwing something up.

AI: “I’ve deployed the API data into your app, following best practices and efficient code.”

Me: “Nope thats totally wrong and in fact you just wrote the API credential into my code, in plaintext, into the JavaScript which basically guarantees that we’re gonna get hacked.”

AI: “You’re absolutely right. Putting API credentials into the source code for the page is not a best practice, let me fix that for you.”

jqpabc1239mo ago

AI Apologetics: "It's all your fault for not being specific enough."

AlecSchueler9mo ago

And then proceeds not to fix it.

ColinEberhardt9mo ago

I agree with the overall sentiment here, having written something similar recently:

“LLMs don’t know what they don’t know” https://blog.scottlogic.com/2025/03/06/llms-dont-know-what-t...

But I wouldn’t say it is the only problem with this technology! Rather, it is a subtle issue that most users don’t understand

esafak9mo ago

Bayesian models solve this problem but they occupy model capacity which practitioners have traditionally preferred to devote to improving point estimates.

hodgehog119mo ago

I've always found this perspective remarkably misguided. Prediction performance is not everything; it can be extraordinarily powerful to have uncertainty estimates as well.

esafak9mo ago

A lot of machine learning practitioners don't have the statistical sophistication to appreciate this, and as ML becomes increasingly "democratized" I expect the situation will worsen.

tangotaylor9mo ago

I don't think humans are good at assessing the accuracy of their own opinions either and I'm not sure how AI is going to do it. Usually what corrects us is failure: some external stimulus that is indifferent or hostile to us.

As Mazer Rackham from Ender's Game said: "Only the enemy shows you where you are weak."

nijave9mo ago

Maybe AI isn't artificial enough here...

blibble9mo ago

the only thing holding me back from being a billionare is my lack of a billion dollars

jqpabc1239mo ago

Being able to recall all the data from the internet doesn't make you "intelligent".

It makes you a walking database --- an example of savant syndrome.

Combine this with failure on simple logical and cognitive tests and the diagnosis would be --- idiot savant.

This is the best available diagnosis of an LLM. It excels at recall and text generation but fails in many (if not most) other cognitive areas.

But that's ok, let's use it to replace our human workers and see what happens. Only an idiot would expect this to go well.

https://nypost.com/2024/06/17/business/mcdonalds-to-end-ai-d...

pxc9mo ago

For programming, at least, there are also problems with overall output quality, instruction following, and the scopes of changes.

LLMs don't do well at following style instructions, and existing memory systems aren't adequate for "remembering" my style preferences.

When you ask for one change, you often get loads of other changes alongside it. Transformers suck at targeted edits.

The hallucination problem and the sycophancy/suggestibility problem (which perhaps both play into the phenomenon of being "confidently wrong") are both real and serious. But they hardly form a singular bottleneck for the usefulness of LLMs.

bwfan1239mo ago

Arguably, the biggest breakthroughs we have had came out of formalization of our world models. Math formalizes abstract worlds, and science formalizes the real world with testable actions.

The key feature of formalization is the ability to create statements, and test statements for correctness. ie, we went from fuzzy feel-good thinking to precise thinking thanks to the formalization.

Furthermore, the ingenuity of humans is to create new worlds and formalize them, ie we have some resonance with the cosmos so to speak, and the only resonance that the LLMs have is with their training datasets.

ofrzeta9mo ago

LLMs just can't learn or understand from the context. The context is there to somehow statistically affect the token production but there is no real understanding. You can provide an LLM a full specification of a problem including all elements that are needed to solve it, for instance all specific functions of a programming library (that is not on the Internet). An competent programmer could read this and implement the solution straightforward. With LLMs this does not work - they still confidently continue producing wrong solutions, though.

dismalaf9mo ago

No, AI's lack of understanding holds it back.

It's literally just a statistical model that guesses what you want based on the prompt and a whole bunch of training data.

If we want a black box that's AGI/SGI, we need a completely new paradigm. Or we apply a bunch of old-school AI techniques (aka. expert systems) to augment LLMs and get something immediately useful, yet slightly limited.

RIght now LLMs do things and are somewhat useful. Short of some expectations, butter than others, but yeah, a statistical model was never going to be more than the sum of its training data.

ldikrtjliaj9mo ago

Well fucking yeah

Yesterday I asked ChatGPT a really simple, factual question. "Where is this feature on this software?" And it made up a menu that didn't exist. I told "No,, you're hallucinating, search the internet for the correct answer" and it directly responded (without the time delay and introspection bubbles that indicate an internet search) "That is not a hallucination, that is factually correct". God damn.

lenerdenator9mo ago

Works fine for humans; I guess we'll know that AI has truly reached human levels of intelligence when being confidently wrong stops holding it back.

rokkamokka9mo ago

The interesting question here is if a statistical model like GPTs actually can encode this is a meaningful way. Nobody has quite found it yet, if so

ACCount379mo ago

They can, and they already do it somewhat. We've found enough to know that.

As the most well known example: Anthropic examined their AIs and found that they have a "name recognition" pathway - i.e. when asked about biographic facts, the AI will respond with "I don't know" if "name recognition" has failed.

This pathway is present even in base models, but only results in consistent "I don't know" if AI was trained for reduced hallucinations.

AIs are also capable of recognizing their own uncertainity. If you have an AI-generated list of historic facts that includes hallucinated ones, you can feed that list back to the same AI and ask it about how certain it is about every fact listed. Hallucinated entries will consistently have less certainty. This latent "recognize uncertainty" capability can, once again, be used in anti-hallucination training.

Those anti-hallucination capabilities are fragile, easy to damage in training, and do not fully generalize.

Can't help but think that limited "self-awareness" - and I mean that in a very mechanical, no-nonsense "has information about its own capabilities" way - is a major cause of hallucinations. An AI has some awareness of its own capabilities and how certain it is about things - but not nearly enough of it to avoid hallucinations consistently across different domains and settings.

meindnoch9mo ago

Not just AI.

1970-01-019mo ago

Confidently wrong while being unable to unlearn its incorrect assumption. I'd be happy with confidently wrong if it understood critical feedback is not an ask, it's an ultimatum for our current discussion to continue with the facts.

kemcho9mo ago

The angle that being to detect confidently wrong, which then helps kicks off new learning is interesting.

Has anyone had any success with continuous learning type AI products? Seems like there’s a lot of hype around RL to specialise.

ACCount379mo ago

There's no "hype" because continuous learning is algorithmically hard and computationally intensive.

There's no known good recipe for continuous learning that's "worth it". No ready-made solution for everyone to copy. People are working on it, no doubt, but it's yet to get to the point of being readily applicable.

nyeah9mo ago

PG pointed this out a while back. He said that AIs were great at generating typical online comments. (NB I don't know which site's comments he might have been referring to.)

squigz9mo ago

I've said from the beginning that until an LLM can determine and respond with "I do not know that", their usefulness will be limited and they cannot be trusted.

gloosx9mo ago

We had many examples of AIs which tried to learn from feedback in the public domain. They all quickly becoming racist nazis for some reason.

AlecSchueler9mo ago

What are examples other than Grok which apparently had nazi sympathies hardcoded in the system prompt?

mediumsmart9mo ago

I sometimes take an answer that does not work, open a new chat, paste the thing and ask "why does this not do (whatever its supposed to do) ?

mtkd9mo ago

The link is a sales pitch for some tech that uses MCPs ... see the platform overview on the product top menu

Because MCPs solve the exact issue the whole post is about

witnessme9mo ago

I like the DSL approach but can't imagine how practical and effective it is. Specially considering the cost.

frays9mo ago

Great article, words a lot of my experiences with AI (needing to tell it to make a plan, then I assess it)

ChrisMarshallNY9mo ago

My favorite is "Tested and Verified," then giving me code that won't even compile.

myahio9mo ago

Yep, this is why I'm skeptical about using LLMs as a learning tool

giancarlostoro9mo ago

What's really funny to me is, sometimes it fixes itself if you just ask "are you SURE ABOUT THIS ANSWER?" myself and others often wonder, why the heck don't they run a 2nd model to "proofread" output or spot check it. Like did you actually answer the question or are you going off a really weird tangent.

I asked Perplexity some question for sample UI code for Rust / Slint, it gave me a beautiful web UI, I think it got confused because I wanted to make a UI for an API that has its own web UI, I told it you did NOT give me code for Slint, even though some of its output made references to "ui.slint" and other Rust files, it realized its mistake and gave me exactly what I wanted to see.

tl;dr why dont llms just vet themselves with a new context window to see if they actually answered the question? The "reasoning" models don't always reason.

asadotzler9mo ago

I've asked that question on accurate answers and had the bot say oops and change the answer to an inaccurate one. This seems to happen with about the same frequency on both sides so I'm not sure how helpful it will ultimately be.

giancarlostoro9mo ago

Interesting! Have not tried that

ACCount379mo ago

Because that would be twice as computationally intensive.

"Reasoning" models integrate some of that natively. In a way, they're trained to double check themselves - which does improve accuracy at the cost of compute.

merelysounds9mo ago

I’m especially surprised by how little progress has been made. Today’s hallucinations, while less frequent, continue to have a major negative impact. And the problem has been noticed since the start.

> "I will admit, to my slight embarrassment … when we made ChatGPT, I didn't know if it was any good," said Sutskever.

> "When you asked it a factual question, it gave you a wrong answer. I thought it was going to be so unimpressive that people would say, 'Why are you doing this? This is so boring!'" he added.

https://www.businessinsider.com/chatgpt-was-inaccurate-borin...

dankobgd9mo ago

i am pretty sure it has many more problems

_Algernon_9mo ago

Rolling weighted dice repeatedly to generate words isn't factually accurate. More at 11.

chpatrick9mo ago

It is if the weights are sufficiently advanced.

blueflow9mo ago

I find such statements frightening. Too many people can not tell the different between prevalence ("everybody does it") and factually correct.

chpatrick9mo ago

Nothing to do with dice though.

2 more replies

Zigurd9mo ago

The weights, so to speak, come from the knowledge base. That means you can't get away from the quality of the knowledge base. That isn't uniform across all domains of knowledge. Then the problem becomes how do you make the training material uniformly high-quality in every knowledge domain? At best it becomes the meta problem of determining the quality of knowledge in some way that makes an LLM able to calibrate confidence to a knowledge domain. But more likely we're stuck with the dubious quality that comes from human bias and wishful thinking in supposedly authoritative material.

chpatrick9mo ago

Sure, it's only as good as the training data. But human experts also output tokens with some statistical distribution. That doesn't mean anything.

2 more replies

nijave9mo ago

MCP and agents seem like a solutions but as far as I know maintaining sufficient context is still a problem

I.e. ability to plug in expert data sources

1 more reply

JamesSwift9mo ago

I think youre missing the point. The issue is not the amount of knowledge it possesses. The problem is that theres no way to go from "statistically generate the next word" to "what is your confidence level in the fact you just stated". Maybe, with an enormous amount of computation we could layer another AI on top to evaluate or add confidence intervals, but I just dont see how we get there wihthout another quantum leap.

1 more reply

paul79869mo ago

And being overHyped with the doom and gloom of it's affects on society.

chatGPT (5) is not there especially in replacing my field and skills: graphic, web design and web development. The first 2 there it spits out solid creations per your prompt request yet can not edit it's creations just creates new ones lol. So it's just another tool in my arsenal not a replacement to me.

Such Makes me wonder how it generates the logos and website designs ... is it all just hocus pocus.. the Wizard of OZ?

nijave9mo ago

I don't know much about it but apparently we've been having success at work with Figma MCP hooked up to Claude in Cursor. Apparently it can pull from our component library and generate useable code (although still needs engineering to productionalize)

I don't know about replacing anyone but our UI/UX designers are claiming it's significantly faster than traditional mock ups

paul79869mo ago

Well until these LLMs are able to spit out initial creations it's user likes and then is able to edit it properly per each request entered into the text prompt our jobs are safe! Even better if you are also a UX Researcher along with a Designer and Developer. Research requires human interaction and AI can't touch that present to a decade or more away.

amai9mo ago

Bayesian LLMs anyone?

jeffxtreme9mo ago

Does anyone know which XKCD comic the top image was? Or was it just created in the style of XKCD?

arduanika9mo ago

The latter, I think.

Randall Munroe has called this abomination "an insult to life itself". But that might be quoting him out of context.

jrm49mo ago

Genuine question from someone who thinks they understand the tech:

I don't get why I haven't seen a whole lot of (or any) of these models or tools "self reporting" on "confidence in their answer?"

This feels like it would be REALLY easy; these things predict likelihoods of tokens -- just, you know, give us that number?

SalariedSlave9mo ago

Anybody remember active learning? I'm old, and ML was much different back then, but this reminds me of grueling annotation work I had to do.

On a different note: is it just me or are some parts of this article oddly written? The sentence structure and phrasing read as confusing - which I find ironic, given the context.

1vuio0pswjnm79mo ago

What is interesting IMO about the "confidently wrong" phenomenon is that this was also commonly found in internet forums and online commentary in general prior to widespread use of today's confidently wrong "AI". That is, online commenters routinely were and still are "confidently wrong". IMHO and IME, the "confidentlay wrong" phenonmenon was and still is greater represented in online commnentary than "IRL".

No surprise IMO that, generally, online commenters and so-called "tech" companies who tend to be overly fixated on computers as the solution to all problems, are also the most numerous promoters of confidently wrong "AI".

The nature of the medium itself and those so-called "tech" companies that have sought to dominate it through intermediation and "ad services"^1 could have something to do with the acceptance and promotion of confidently wrong "AI". Namely, its ability to reduce critical thinking and the relative ease with which uninformed opinions, misinformation, and other non-factual "confidently wrong" information can be spread by virtually anyone.

1. If "confidently wrong" information is popular, if it "goes viral", then with few exceptions it will be promoted by these companies to drive traffic and increase ad services revenue.

Please note: I could be wrong.

dgfitz9mo ago

s/confidently//

Because “ai” is fallible, right now it is at best a very powerful search engine that can also muck around in (mostly JavaScript) codebases. It also makes mistakes in code, adds cruft, and gives incorrect responses to “research-type” questions. It can usually point you in the right direction, which is cool, but Google was able to do that before its enshittification.

s/AI/LLMs

The part where people call it AI is one of the greatest marketing tricks of the 2020s.

captainclam9mo ago

Wow, there really is an xkcd for everything.

Dwedit9mo ago

Those are original cartoons drawn in the style of XKCD. But strangely enough, in the second cartoon, the Megan clone seems to change from a thin stick figure to suddenly wearing clothes?

I'm not sure if the comic was AI-assisted or not. AI-generated images do not usually contain identical pixel data when a panel repeats.

ryukoposting9mo ago

The script is uncanny as well. My guess is the author used AI to generate the panels/dialogue, then stitched together cutouts from real xkcd comics over the top of the AI-generated panels here and there. That exact shape of the heads is too close to not be a copy-paste job, but other variations suggest AI involvement. The desk gets rotated in the third panel of the first comic, the female character in the second comic gets clothes out of nowhere, etc.

Regardless of how the author made the comics, they're very weird.

j / k navigate · click thread line to collapse

262 comments

lucideer9mo ago

While the thrust of this article is generally correct, I have two issues with it:

1. The words "the only thing" massively underplays the difficulty of this problem. It's not a small thing.

traceroute669mo ago

> is their willingness to correct themselves when asked

Except they don't correct themselves when asked.

I'm sure we've all been there, many, many, many,many,many times ....

   - User: "This is wrong because X"
   - AI: "You're absolutely right !  Here's a production-ready fixed answer"
   - User: "No, that's wrong because Y"
   - AI: "I apologise for frustrating you ! Here's a robust answer that works"
   - User: "You idiot, you just put X back in there"
   - and so continues the vicious circle....

ACCount379mo ago

vidarh9mo ago

This is a good point, and to drive this home to people, if you have a conversation of this pattern:

    User: Fix this problem ...
    Assistant: X
    User: No, don't do X
    Assistant: Y
    User: No, Y is wrong too.
    Assistant: X

If you reach that state, start over, and constrain your initial request to exclude X and Y. If it brings up either again, start over, and constrain your request further.

5 more replies

rtkwe9mo ago

stetrain9mo ago

They tend to very quickly lose useful context of the original problem and stated goals.

nyeah9mo ago

Yes, that is the point of the comment.

1 more reply

therobots9279mo ago

Yeah I think our jobs are safe. Why doesn’t anyone acknowledge loops like this? They happen all the time and I’m only using it once a week at the most

gavinray9mo ago

  > Yeah I think our jobs are safe.

I give myself 6-18 months before I think top-performing LLM's can do 80% of the day-to-day issues I'm assigned.

  > Why doesn’t anyone acknowledge loops like this?

Now what you've got is a junk-soup where the original problem is buried somewhere in the pile.

Best approach I've found is to start a fresh conversation with the original problem statement and any improvements/negative reinforcements you've gotten out of the LLM tacked on.

I typically have ChatGPT 5 Thinking, Claude 4.1 Opus, Grok 4, and Gemini 2.5 Pro all churning on the same question at once and then copy-pasting relevant improvements across each.

6 more replies

RyanOD9mo ago

1 more reply

jansper399mo ago

1 more reply

vidarh9mo ago

Because it's easy to learn to stop engaging with those loops, treating them as a sign you provided too little context, and instead start a new conversation with an expanded prompt.

It doesn't mean these loops aren't an issue, because they are, but once you stop engaging with them and cut them off, they're a nuisance rather than a showstopper.

1 more reply

traceroute669mo ago

The AI-fanbois will quickly tell you that you are misusing the context or your prompt is "wrong".

But I've had it consistently happens to me on tiny contexts (e.g. I've had to spend time trying - and failing - to get it to fix a mess it was making with a straightforward 200-ish line bash script).

1 more reply

dkersten9mo ago

My pet peace is when I point out a problem it responds with acknowledgement and then explaining why it’s wrong. Like… I already know why it’s wrong, since I’m the one that pointed it out!

brookst9mo ago

You case is no different from:

- AI: "The capital of France is Paris"

- User: "This is wrong, it changed to Montreal in 2005"

- AI: "You're absolutely right! The capital of France is Montreal"

kelseyfrog9mo ago

Instead I get this:

    Nope—Paris is the capital of France and has been for centuries. Montreal is in Quebec, Canada. France’s presidency (Élysée), parliament (Assemblée nationale and Sénat), and ministries are all in Paris.

1 more reply

zero_iq9mo ago

lucideer9mo ago

True. This also often happens.

Probably the ideal would be to have a UI / non-chat-based mechanism for discarding select context.

Wowfunhappy9mo ago

dns_snek9mo ago

> but in reality it simply plays into users' biases & makes it more likely that the user will accept & approve of incorrect responses from the AI.

Yes! I often find myself overthinking my phrasing to the nth degree because I've learned that even a sprinkle of bias can often make the LLM run in that direction even if it's not the correct answer.

stingraycharles9mo ago

> 1. The words "the only thing" massively underplays the difficulty of this problem. It's not a small thing.

Exactly. One could argue that this is just an artifact from the fundamental technique being used: it’s a really fancy autocomplete based on a huge context window.

(This is actually a feature of an MCP tool I use, “consensus” from zen-mcp-server, which enables you to query multiple different models to reach a consensus on a certain problem / solution).

tango12OP9mo ago

The AI being wrong problem is probably not insurmountable.

Humans have meta-cognition that helps them judge if they're doing a thing with lots of assumptions vs doing something that's blessed.

Humans decouple planning from execution right? Not fully but we choose when to separate it and when to not.

If we had enough data on here's a good plan given user context and here's a bad plan, it doesn't seem unreasonable to have a pretty reliable meta cognition capability on the goodness of a plan.

procaryote9mo ago

Depending on your definitions, either:

* there are already lots of "reasoning" models trying meta-cognition, while still getting simple things wrong

or:

* the models aren't doing cognition, so meta-cognition seems very far away

energy1239mo ago

ninetyninenine9mo ago

It’s not massively underplaying it imo. AI hype is real. This is revolutionary technology that humanity has never seen before.

Too much hype makes us numb to the reality of how insane the technology is.

NoGravitas9mo ago

> Like when someone says the only thing stopping LLMs is hallucinations… that is literally the last gap.

ninetyninenine9mo ago

That's not true. There is we just need to find it.

1 more reply

indigoabstract9mo ago

I think you would have really enjoyed living in the '50s, when the future was bright and colonizing Mars was basically a solved problem.

What we got instead is a bunch of wisecracking programmers who like to remind everyone of the 90–90 rule, or the last 10 percent.

ninetyninenine9mo ago

Idiot programmers and their generic wise cracks were the ones saying that AI would never be able to pass the Turing test and this was just 4 years ago.

taco_emoji9mo ago

Oh, buddy, LLM hallucinations are not the only gap left for AGI

ninetyninenine9mo ago

It is. After that it's virtually indistinguishable from chatting with a human

stetrain9mo ago

Yes, the quick to correct itself isn't really useful. I would not like a human assistant/intern/pair programmer who when asked how to do X said:

> To accomplish X you can just use Y!

But Y isn't applicable in this scenario.

> Oh, you're absolutely right! Instead of Y you can do Z.

Are you sure? I don't think Z accomplishes X.

> On second thought you're absolutely correct. Y or Z will clearly not accomplish X, but let's try Q....

sfn429mo ago

indigoabstract9mo ago

It's not obvious how long until that point or what form it will finally take, but it should be obvious that it's going to happen at some point.

Workaccount29mo ago

My favorite "paper" on AI pretty accurately describes this line of thinking

https://ai.vixra.org/pdf/2506.0065v1.pdf

ninetyninenine9mo ago

Like you realize humans hallucinate too right? And that there are humans that have a disease that makes them hallucinate constantly.

Hallucinations don’t preclude humans from being “intelligent”. It also doesn’t preclude the LLM from being intelligent.

dns_snek9mo ago

> Your opinion is a minority now and not shared by people on the forefront of building these things.

1 more reply

CodexArcanum9mo ago

LLMS don't "hallucinate" they generate a stochastic sequence of plausible tokens that, in context when read by a human, are a false statement or nonsensical.

https://aiguide.substack.com/p/llms-and-world-models-part-1

https://yosefk.com/blog/llms-arent-world-models.html

1 more reply

kilpikaarna9mo ago

> It is clear to everyone who builds LLMs that the AI is intelligent.

So presumably we have a solid, generally-agreed-upon definition on intelligence now?

> autocompleting things with humanity changing intelligent content.

What does this even mean?

2 more replies

NoGravitas9mo ago

IOW, the realistic position is not held by the majority of people whose paychecks depend on it being wrong. I'm shocked.

2 more replies

mcv9mo ago

Yeah, but how much of that is wishful thinking? If your job depends on believing this is real intelligence, you're more likely to believe that.

eCa9mo ago

> Like you realize humans hallucinate too right?

A developer that hallucinates at work to the extent that LLMs does would probably have issues getting their PRs past code reviews a lot.

2 more replies

burnte9mo ago

KoolKat239mo ago

Gemini 2.5 pro is quite good at being stubborn (well at least the initial release versions, haven't tested since).

decentrality9mo ago

Agreed with #1 ( came here to say that also )

Pronoun and noun wordplay aside ( 'Their' ... `themselves` ) I also agree that LLMs can correct the path being taken, regenerate better, etc...

But the idea that 'AI' needs to be _stubbornly_ wrong ( more human in the worst way ) is a bad idea. There is a fundamental showing, and it is being missed.

taco_emoji9mo ago

> Pronoun and noun wordplay aside ( 'Their' ... `themselves` )

How is that wordplay? Those are the correct pronouns.

corytheboyd9mo ago

raynr9mo ago

As a layman, this too strikes me as the problem underlying the "confidently wrong" problem.

jdbernard9mo ago

yifanl9mo ago

There are people convinced that if we throw a sufficient amount of training data and VC money at more hardware, we'll overcome the gap.

Technically, I can't prove that they're wrong, novel solutions sometimes happen, and I guess the calculus is that it's likely enough to justify a trillion dollars down the hole.

gavinray9mo ago

There's a guy, Ken Stanley, who wrote the NEAT[0]/HyperNEAT[1] algorithms.

His big idea is that evolution/advancements don't happen incrementally, but rather in unpredictable large leaps.

He wrote a whole book about it that's pretty solid IMO: "Why Greatness Cannot Be Planned: The Myth of the Objective."

[0] https://en.wikipedia.org/wiki/Neuroevolution_of_augmenting_t... [1] https://en.wikipedia.org/wiki/HyperNEAT

delichon9mo ago

https://en.wikipedia.org/wiki/Saltation_(biology)

1 more reply

devin9mo ago

Whenever I try to tell people about the myth of the objective they look at me like I'm insane. It's not very popular to tell people that their best laid plans are actually part of the problem.

yifanl9mo ago

I would suspect that any next step comes with a novel implementation though, not just trying to scale the same shit to infinity.

I guess the bitter lesson is gospel now, which doesn't sit right with me now that we're past the stage of Moore's Law being relevant, but I'm not the one with a trillion dollars, so I don't matter.

corytheboyd9mo ago

I’d say it was worth throwing down some cash for, because we get cool new things by full-assing new ideas. But… yeah… a TRILLION dollars is waaaay too far.

kovacs9mo ago

corytheboyd9mo ago

I guess technically 3D on computers is still clever 2D, but let’s not break the metaphor down too hard lol. Love the Doom/Quake comparison!

gus_massa9mo ago

It's easy to solve if they modify they training to remove some weight from Stack Overflow and add more weight to Yahoo! Answers :) .

corytheboyd9mo ago

> For the AI, you want to give it some room to generalize/bullshit.

aidenn09mo ago

rwmj9mo ago

I predict we'll get a few research breakthroughs in the next few years that will make articles like this seem ridiculous.

energy1239mo ago

tliltocatl9mo ago

> If I freeze 40 yo Einstein and make it so he can't form new memories beyond 5 minutes, that's still an incredibly useful, generally intelligent thing

A "frozen mind" feels like something not unlike a book - useful, but only with a smart enough "human user", and even so be progressively less useful as time passes.

>Doesn't seem like a problem that needs to be solved on the critical path to AGI.

>We don't have visibility into the IMO/IOI medal winning models.

There are lies, damn lies and LLM benchmarks. IMO/IOI is not necessarily indicative of any useful tasks.

Dilettante_9mo ago

>If I freeze 40 yo Einstein and make it so he can't form new memories beyond 5 minutes, that's still an incredibly useful, generally intelligent thing.

And don't even get me started on the language barrier.

I recently read this short story[1] on the topic so it's fresh on my mind.

[1]https://qntm.org/mmacevedo

selfmodruntime9mo ago

Never before did we have a combination of well and poison where the pollution of the first was both as instantaneous and as easily achieved.

I‘ve yet to see a convincing article for artificial training data.

AIPedant9mo ago

ausbah9mo ago

tliltocatl9mo ago

> LLM doesn't learn incrementally from previous encounters

ACCount379mo ago

You can gather feedback from inference and funnel that back into model training. It's just very, very hard to do that without shooting yourself in the foot.

The feedback you get is incredibly entangled, and disentangling it to get at the signals that would be beneficial for training is nowhere near a solved task.

FergusArgyll9mo ago

The problem is the kinds of "data" users will feed it. It's basically an impossible task to put a continuous learning model online and not have it devolve into the optimal mix of stalin & hitler

lvl1559mo ago

Incrementally learning model is pretty hard. That’s actually something I am working on right now and it’s completely different from developing/implementing LLMs.

criddell9mo ago

I think that's what it's going to take. Eventually put the learning model in a robot body and send it out into the real world where there's no shortage of training data.

Tiktaalik9mo ago

Yea it'll learn real quick what falling in a ravine is like

tliltocatl9mo ago

Cool, got any previous work to share?

impossiblefork9mo ago

Having run out of training data isn't something holding back LLMs in this sense.

I think much of it is solvable. I certainly have ideas about how it can be done.

traceroute669mo ago

> That we appear to have run out of training data.

I think the next iteration of LLM is going to be "interesting", i.e. now that all the websites they used to freely scrape have been increasingly putting up walls.

procaryote9mo ago

Also that no companies involved seem to be making a profit, have a reasonable vision to make a profit, or even revenue in the same ballpark as costs.

Except nvidia perhaps

Dilettante_9mo ago

But that's not so unusual these days, no? I'm not terribly knowledgeable in the field, but don't Amazon, Netflix, Uber and such work on this kind of funding structure?

mrguyorama9mo ago

"Not terribly knowledgeable in the field"?

tango12OP9mo ago

Author here.

You’re right in that it’s obviously not the only problem.

But without solving this seems like no matter how good the models get it’ll never be enough.

Or, yes, the biggest research breakthrough we need is reliable calibrated confidence. And that’ll allow existing models as they are to become spectacularly more useful.

BitsAndObjects9mo ago

The biggest breakthrough that we need is something resembling actual intelligence in AI (human or machine, I’ll let you decide where we need it more ;) )

binarymax9mo ago

You might be getting downvoted because you editorialized your own title. If it’s obviously not the only thing then don’t add that to the title :)

mettamage9mo ago

> Only thing? Just off the top of my head: That the LLM doesn't learn incrementally from previous encounters. That we appear to have run out of training data.

Ha, that almost seems like an oxymoron. The previous encounters can be the new training data!

jazzyjackson9mo ago

The old training was human responses to human questions. From this the bot learned to mimick human responses.

What would be the point of training an LLM on bot answers to human questions? This is only useful if you want to get an LLM that behaves like an already existing LLm

selfmodruntime9mo ago

Queries are questions in a sense that they are not the original facts. I don’t think they are useful for training data.

harsh31959mo ago

In terms of adoption, I think the user is right. That is the only thing stopping adoption of existing models in the real world.

moduspol9mo ago

Unclear limits on how much context can be reliably provided and effectively used without degrading the result.

ninetyninenine9mo ago

It does. We keep a section of the context window for memory. The LLM however is the one deciding what is remembered. Technically via the system prompt we can have it remember every prompt if needed.

But memory is a minor thing. Talking to a knowledgeable librarian or professor you never met is the level we essentially need to get it to for this stuff to take off.

firesteelrain9mo ago

> That we appear to have run out of training data

And now, in some cases for a while, it is training on its own slop.

lazide9mo ago

The article is the peak of confidently wrong itself, for solid irony points.

therobots9279mo ago

Apple released this recently: https://machinelearning.apple.com/research/illusion-of-think...

roxolotl9mo ago

projektfu9mo ago

IsTom9mo ago

projektfu9mo ago

If what I am doing is summarizing data and it will likely have uncertainty as a result, I can include statistics in the specification of what I want.

I am pessimistic about whether these tools are intelligent or will ever achieve intelligence, but where they are useful, we should use them.

z3c09mo ago

hodgehog119mo ago

PessimalDecimal9mo ago

hodgehog119mo ago

z3c09mo ago

esafak9mo ago

It's the difference between a categorical distribution and a Dirichlet. https://en.wikipedia.org/wiki/Dirichlet_distribution

1 more reply

NoGravitas9mo ago

mojuba9mo ago

rar009mo ago

jqpabc1239mo ago

there's something fundamentally flawed the whole field has yet to properly pinpointing and fix.

Isn't it obvious?

It's all built around probability and statistics.

This is not how you reach definitive answers. Maybe the results make sense and maybe they're just nice sounding BS. You guess which one is the case.

The real catch --- if you know enough to spot the BS, you probably didn't need to ask the question in the first place.

ctoth9mo ago

> It's all built around probability and statistics.

Yes, the world is probabilistic.

> This is not how you reach definitive answers.

jqpabc1239mo ago

How many "r's" are in the word "strawberry"?

I'm certain this simple question has a definitive answer.

1 more reply

darth_avocado9mo ago

Funnily the same thing would get you promoted in corporate America as a human

jqpabc1239mo ago

But only if you are physically attractive and skilled at golf.

EE84M3i9mo ago

CloseChoice9mo ago

JCM99mo ago

Add to being confidently wrong is the super annoying way it corrects itself after disastrously screwing something up.

AI: “I’ve deployed the API data into your app, following best practices and efficient code.”

Me: “Nope thats totally wrong and in fact you just wrote the API credential into my code, in plaintext, into the JavaScript which basically guarantees that we’re gonna get hacked.”

AI: “You’re absolutely right. Putting API credentials into the source code for the page is not a best practice, let me fix that for you.”

jqpabc1239mo ago

AI Apologetics: "It's all your fault for not being specific enough."

AlecSchueler9mo ago

And then proceeds not to fix it.

ColinEberhardt9mo ago

I agree with the overall sentiment here, having written something similar recently:

“LLMs don’t know what they don’t know” https://blog.scottlogic.com/2025/03/06/llms-dont-know-what-t...

But I wouldn’t say it is the only problem with this technology! Rather, it is a subtle issue that most users don’t understand

esafak9mo ago

Bayesian models solve this problem but they occupy model capacity which practitioners have traditionally preferred to devote to improving point estimates.

hodgehog119mo ago

I've always found this perspective remarkably misguided. Prediction performance is not everything; it can be extraordinarily powerful to have uncertainty estimates as well.

esafak9mo ago

A lot of machine learning practitioners don't have the statistical sophistication to appreciate this, and as ML becomes increasingly "democratized" I expect the situation will worsen.

tangotaylor9mo ago

As Mazer Rackham from Ender's Game said: "Only the enemy shows you where you are weak."

nijave9mo ago

Maybe AI isn't artificial enough here...

blibble9mo ago

the only thing holding me back from being a billionare is my lack of a billion dollars

jqpabc1239mo ago

Being able to recall all the data from the internet doesn't make you "intelligent".

It makes you a walking database --- an example of savant syndrome.

Combine this with failure on simple logical and cognitive tests and the diagnosis would be --- idiot savant.

This is the best available diagnosis of an LLM. It excels at recall and text generation but fails in many (if not most) other cognitive areas.

But that's ok, let's use it to replace our human workers and see what happens. Only an idiot would expect this to go well.

https://nypost.com/2024/06/17/business/mcdonalds-to-end-ai-d...

pxc9mo ago

For programming, at least, there are also problems with overall output quality, instruction following, and the scopes of changes.

LLMs don't do well at following style instructions, and existing memory systems aren't adequate for "remembering" my style preferences.

When you ask for one change, you often get loads of other changes alongside it. Transformers suck at targeted edits.

bwfan1239mo ago

Arguably, the biggest breakthroughs we have had came out of formalization of our world models. Math formalizes abstract worlds, and science formalizes the real world with testable actions.

The key feature of formalization is the ability to create statements, and test statements for correctness. ie, we went from fuzzy feel-good thinking to precise thinking thanks to the formalization.

ofrzeta9mo ago

dismalaf9mo ago

No, AI's lack of understanding holds it back.

It's literally just a statistical model that guesses what you want based on the prompt and a whole bunch of training data.

RIght now LLMs do things and are somewhat useful. Short of some expectations, butter than others, but yeah, a statistical model was never going to be more than the sum of its training data.

ldikrtjliaj9mo ago

Well fucking yeah

lenerdenator9mo ago

Works fine for humans; I guess we'll know that AI has truly reached human levels of intelligence when being confidently wrong stops holding it back.

rokkamokka9mo ago

The interesting question here is if a statistical model like GPTs actually can encode this is a meaningful way. Nobody has quite found it yet, if so

ACCount379mo ago

They can, and they already do it somewhat. We've found enough to know that.

This pathway is present even in base models, but only results in consistent "I don't know" if AI was trained for reduced hallucinations.

Those anti-hallucination capabilities are fragile, easy to damage in training, and do not fully generalize.

meindnoch9mo ago

Not just AI.

1970-01-019mo ago

kemcho9mo ago

The angle that being to detect confidently wrong, which then helps kicks off new learning is interesting.

Has anyone had any success with continuous learning type AI products? Seems like there’s a lot of hype around RL to specialise.

ACCount379mo ago

There's no "hype" because continuous learning is algorithmically hard and computationally intensive.

nyeah9mo ago

PG pointed this out a while back. He said that AIs were great at generating typical online comments. (NB I don't know which site's comments he might have been referring to.)

squigz9mo ago

I've said from the beginning that until an LLM can determine and respond with "I do not know that", their usefulness will be limited and they cannot be trusted.

gloosx9mo ago

We had many examples of AIs which tried to learn from feedback in the public domain. They all quickly becoming racist nazis for some reason.

AlecSchueler9mo ago

What are examples other than Grok which apparently had nazi sympathies hardcoded in the system prompt?

mediumsmart9mo ago

I sometimes take an answer that does not work, open a new chat, paste the thing and ask "why does this not do (whatever its supposed to do) ?

mtkd9mo ago

The link is a sales pitch for some tech that uses MCPs ... see the platform overview on the product top menu

Because MCPs solve the exact issue the whole post is about

witnessme9mo ago

I like the DSL approach but can't imagine how practical and effective it is. Specially considering the cost.

frays9mo ago

Great article, words a lot of my experiences with AI (needing to tell it to make a plan, then I assess it)

ChrisMarshallNY9mo ago

My favorite is "Tested and Verified," then giving me code that won't even compile.

myahio9mo ago

Yep, this is why I'm skeptical about using LLMs as a learning tool

giancarlostoro9mo ago

tl;dr why dont llms just vet themselves with a new context window to see if they actually answered the question? The "reasoning" models don't always reason.

asadotzler9mo ago

giancarlostoro9mo ago

Interesting! Have not tried that

ACCount379mo ago

Because that would be twice as computationally intensive.

"Reasoning" models integrate some of that natively. In a way, they're trained to double check themselves - which does improve accuracy at the cost of compute.

merelysounds9mo ago

> "I will admit, to my slight embarrassment … when we made ChatGPT, I didn't know if it was any good," said Sutskever.

> "When you asked it a factual question, it gave you a wrong answer. I thought it was going to be so unimpressive that people would say, 'Why are you doing this? This is so boring!'" he added.

https://www.businessinsider.com/chatgpt-was-inaccurate-borin...

dankobgd9mo ago

i am pretty sure it has many more problems

_Algernon_9mo ago

Rolling weighted dice repeatedly to generate words isn't factually accurate. More at 11.

chpatrick9mo ago

It is if the weights are sufficiently advanced.

blueflow9mo ago

I find such statements frightening. Too many people can not tell the different between prevalence ("everybody does it") and factually correct.

chpatrick9mo ago

Nothing to do with dice though.

2 more replies

Zigurd9mo ago

chpatrick9mo ago

Sure, it's only as good as the training data. But human experts also output tokens with some statistical distribution. That doesn't mean anything.

2 more replies

nijave9mo ago

MCP and agents seem like a solutions but as far as I know maintaining sufficient context is still a problem

I.e. ability to plug in expert data sources

1 more reply

JamesSwift9mo ago

1 more reply

paul79869mo ago

And being overHyped with the doom and gloom of it's affects on society.

Such Makes me wonder how it generates the logos and website designs ... is it all just hocus pocus.. the Wizard of OZ?

nijave9mo ago

I don't know about replacing anyone but our UI/UX designers are claiming it's significantly faster than traditional mock ups

paul79869mo ago

amai9mo ago

Bayesian LLMs anyone?

jeffxtreme9mo ago

Does anyone know which XKCD comic the top image was? Or was it just created in the style of XKCD?

arduanika9mo ago

The latter, I think.

Randall Munroe has called this abomination "an insult to life itself". But that might be quoting him out of context.

jrm49mo ago

Genuine question from someone who thinks they understand the tech:

I don't get why I haven't seen a whole lot of (or any) of these models or tools "self reporting" on "confidence in their answer?"

This feels like it would be REALLY easy; these things predict likelihoods of tokens -- just, you know, give us that number?

SalariedSlave9mo ago

Anybody remember active learning? I'm old, and ML was much different back then, but this reminds me of grueling annotation work I had to do.

On a different note: is it just me or are some parts of this article oddly written? The sentence structure and phrasing read as confusing - which I find ironic, given the context.

1vuio0pswjnm79mo ago

1. If "confidently wrong" information is popular, if it "goes viral", then with few exceptions it will be promoted by these companies to drive traffic and increase ad services revenue.

Please note: I could be wrong.

dgfitz9mo ago

s/confidently//

s/AI/LLMs

The part where people call it AI is one of the greatest marketing tricks of the 2020s.

captainclam9mo ago

Wow, there really is an xkcd for everything.

Dwedit9mo ago

Those are original cartoons drawn in the style of XKCD. But strangely enough, in the second cartoon, the Megan clone seems to change from a thin stick figure to suddenly wearing clothes?

I'm not sure if the comic was AI-assisted or not. AI-generated images do not usually contain identical pixel data when a panel repeats.

ryukoposting9mo ago

Regardless of how the author made the comics, they're very weird.

j / k navigate · click thread line to collapse