The Unreliability of LLMs and What Lies Ahead (opens in new tab)

(verissimo.substack.com)

132 pointstalhof811mo ago159 comments

159 comments

ar81311mo ago

If I take a step back and think back to say a few (or 5) years ago, what LLMs can do is amazing. One has to acknowledge that (or at least, I do). But as a scientist it's been rather interesting to probe the jagged edge and unreliability, including using deep research tools, on any topic I know well.

If I read through the reports and summaries it generates, it seems at first glance correct - the jargon is used correctly, and physical phenomena referred to mostly accurately. But very quickly I realize that, even with the deep research features and citations, it's making a bunch of incorrect inferences that likely arise from certain concepts (words, really) co-occurring in documents but are actually physically not causally linked or otherwise fundamentally connected. In addition to some strange leading sentences and arguments made, this often ends up creating entirely inappropriate topic headings/ sections connecting things that really shouldn't be together.

One small example of course, but this type of error (usually multiple errors) shows up in both Gemini and OpenAI models, and even with some very specific prompts and multiple turns. And keeps happening for topics in the fields I work in in the physical sciences and engineering. I'm not sure one could RL hard enough to correct this sort of thing (and it is not likely worth the time and money), but perhaps my imagination is limited.

elictronic11mo ago

I think those in the computer science field see passable results of LLM use with respect to software and papers and start assuming other engineering fields should be easy.

They fail to understand other engineering fields documentation and process are awful. Not that computer science is good because they are even less rigorous.

The difference is other fields don’t log every single change they make into source control and have millions of open source projects to pull from. There aren’t billions of books on engineering to pull from like with language. The information is siloed and those with the keys now know what it’s worth.

crooked-v11mo ago

I'm reminded of the whole "vegetative electron microscopy" mess (https://www.sciencealert.com/a-strange-phrase-keeps-turning-...).

ar81311mo ago

That's wild! Now I want to go hunting for more such examples..

BlueTemplar11mo ago

You know who else is infamous for making errors due to shallow understanding ? (Non-specialized) journalists !

How do you find they compare?

11223311mo ago

Not OP, but here is my observations: The llm are uniformly dumb and not "understaging" across all spectrum of topics. It is counter-intuitive. By asking llm to simply blab ("write a story about ..") you notice it:

- mixes up pronouns (who is "you" or "he")

- cannot keep track of what is where.

- continuously plugs it's guidance slant ("lets cook dinner, Bob! It is paramount to strive for safety and cooperation while doing it!")

— language style is all over the place, comically so.

— when asked about the text it just generated, is able to give valid critique to itself (i.e. having that "insight" does not help the generation)

Journalists may have shallow understanding of topic, but they do not start referring to a person they write about as "me" halfway through.

LLM is uniformly dumb

esafak11mo ago

This is the model conflating correlation with causation. Perhaps with more data spurious correlations would disappear, but the 'right' way is to make the models learn causal, world models.

jvalencia11mo ago

Well, and I think the future of LLMs is not just in the pure LLM, but the agentic ones. LLMs with deterministic tools to ferret out specifics. We're only starting here but the results will be far better than what we do today.

1 more reply

brentm11mo ago

This is a good articulation of what is a real concern around the AI bull thesis.

If a calculator works great 99% of the time you could not use that calculator to build a bridge.

Using AI for more than code generation is still very difficult and requires a human in the loop to verify the results. Sometimes using AI ends up being less productive because you're spending all your time debugging it's outputs. It's great but also there are a lot of questions on if this technology will ultimately lead to the productivity gains that many think are guaranteed in next few years. There is a non zero chance it ends up actually hurting productivity because of all the time wasted trying to get it to produce magic results.

tveita11mo ago

> If a calculator works great 99% of the time you could not use that calculator to build a bridge.

We know for certain that certified lawyers have committed malpractice by using ChatGPT, in part because the made-up citations are relatively easy to spot. Malpractice by engineers might take a little more time to discover.

bagacrap11mo ago

Engineers' work is also externally verifiable, e.g. by unit tests for software, but I'm assuming by other sorts of automated protocols for civil engineering. I would hope a bridge is not built without triple checking the various outcomes.

2 more replies

oconnor66311mo ago

A pedantic but maybe-not-entirely-pedantic point: It depends on what you mean by 99%.

If the calculator has a little gremlin in it that rolls a random 100-sided die, and gives you the wrong answer every time it rolls a 1, then you certainly can use it to build a bridge. You just need to do each calculation say 10 or 20 times and take the majority answer :)

If the gremlin is clever, it might remember the wrong answers it gave you, and then it might give them to you again if you ask about the same numbers. In that case you might need to buy 10 or 20 calculators that all have different gremlins in them, but otherwise the process is the same.

Of course if all your gremlins consistently lie for certain inputs, you might need to do a lot of work to sample all over your input space and see exactly what sorts of numbers they don't like. Then you can breed a new generation of gremlins that...

franktankbank11mo ago

Yea I know, can't really understand why people have such a problem with this. Just ignore the wrong answers and be thankful when it gives you a right answer. Picky bastards.

tom_m11mo ago

I believe it absolutely will. I think eventually we'll get to a point where people will be measured on now well they can get the AI to behave and how good they are at keeping cost down.

My boss built an AI workflow that cost over $600 that does the same thing I already gave him that cost less than $30. He just wanted to use tools he found and did it his way. Now, this had some value, it got more people in the company exposed to AI and he learned from the experience. It's his prerogative as he's the owner of the company. Though he also isn't concerned about the cost and will continue to pay much more. For now. I think as time goes on this will be more scrutinized.

worldsayshi11mo ago

This doesn't seem like the first time engineers try to work with something useful that is only partially reliable.

The solution is to play at its strengths and reinforce it with other mediums. You don't build structures with pure concrete. You add rebar. You don't build ships out of only sail and you don't build rail with just iron. You compose materials in a way that makes sense.

LLMs are most useful when the output is immediately verifiable. So let's build frameworks that take that to core. Build everything around verification. And use LLMs for its strengths.

the_snooze11mo ago

>If a calculator works great 99% of the time you could not use that calculator to build a bridge.

That's happened before with far higher correctness rate than 99%, and it cost Intel $500M. Reliability and accuracy matter. https://en.wikipedia.org/wiki/Pentium_FDIV_bug

andrewmutz11mo ago

What we are seeing with our customers is that LLM errors are a very manageable problem. End users adapt pretty quickly to the idea that AI systems aren't perfect. In many cases AI products are doing tasks that used to be done by humans and these humans were making mistakes too, so the end user is used to the idea that the task will get accomplished with some non-zero error rate.

You just need to build your products in a manner where the user has the ability to easily double check the results whenever they like. Then they can audit as they see fit, in order to get used to the accuracy level and to apply additional scrutiny to cases that are very important to their business.

insane_dreamer11mo ago

> the user has the ability to easily double check the results whenever they like

if the user is able to so easily verify that the results are accurate, that means that they are able to generate accurate results through other means, which means they don't need the LLM in the first place

2 more replies

brentm11mo ago

Yea I just think the true unlock in productivity will come from not requiring a human in the loop.

vinni211mo ago

> If a calculator works great 99% of the time you could not use that calculator to build a bridge.

But if the alternative is doing calculations by hand (writing code manually) there is a higher chance of making mistakes.

Just like calculations are double checked while building bridges unit tests and code reviews should catch bugs introduced by LLM written code.

banannaise11mo ago

Code review is your last (and worst) line of defense. Humans are not good at needle-in-a-haystack tasks.

thorum11mo ago

Good article. Agree that general unreliability will continue to be an issue since it's fundamental to how LLMs work. However, it would surprise me if there was still a significant gap between single-turn and multi-turn performance in 18 months. Judging by improvements in the last few frontier model releases, I think the top AI labs have finally figured out how to train for multi-turn and agentic capabilities (likely RL) and just need to scale this up.

karn9711mo ago

Reasoning is just the worst kind of stop gap measure. The state that should emerge internally is forced through automating prompts. And you can clearly see this because the models rarely follow their own "reasoning". Its just auto self prompting

koakuma-chan11mo ago

They’re reliable enough for many use cases

bluefirebrand11mo ago

What this should be doing is exposing how those use cases are faulty, if they can accept such inconsistent and poorly defined outputs

ok12345611mo ago

MongoDB was basically "vibe coding" for RBDMs. After the hype cycle, there will be a wasteland of unmaintainable vibe-coded products that companies will have to pump unlimited amounts of money into to maintain.

Spivak11mo ago

I think we mythologize the relational model a bit too much to call nosql dbs vibe coding. DynamoDB is quite good and you can point to some very large customers using it successfully.

yencabulator11mo ago

MongoDB was bad for several reasons unrelated to the relational model.

boardwaalk11mo ago

Or we’ll just leave them behind and that’s fine. And I work day maintaining old stuff of varying quality. Conceptually, software composting.

Ostrogoth11mo ago

A few months ago I asked CGPT to create a max operating depth table for scuba diving based on various PPO2 limits and EAN gas profiles, just to test it on something I know (its a trivially easy calculation; and the formula is readily available online). It got it wrong…multiple times…even after correction and supplying the correct formula, the table was still repeatedly wrong (it did finally output a correct table). I just tried it again, with the same result. Obviously not something I would stake my life on anyway, but if it’s getting something so trivial wrong, I’m not inclined to trust it on more complex topics.

tom_m11mo ago

Well it doesn't really do math.

Ostrogoth11mo ago

Interesting; just went down a rabbit hole on LLM training and math. For this example, it could have simply copied a table from online, but I wasn’t aware how poorly some LLMs perform on even basic math functions. I’ve not run into that issue before.

eterm11mo ago

There are jobs out there that have always been unreliable.

A classic example is the Travel Agent. This was already a job driven to near-extinction just by Google, but LLMs are a nail in the travel agent coffin.

The job was always fuzzy. It was always unreliable. A travel agent recommendation was never a stamp of quality or guarentee of satisfaction.

But now, I can ask an LLM to compare and contrast two weeks in the Seychelles with two weeks in the Caribbean, have it then come up with sample itineraries and sample budgets.

Is it going to be accurate? No, it'll be messy and inaccurate, but sometimes a vibe check is all you ever wanted to confirm that yeah, you should blow your money on the Seychelles, or to confirm that actually, you were right to pick the Caribbean.

Or that actually, both are twice the amount you'd prefer to spend, where dear ChatGPT would be more suitable?

etc.

When it comes down to the nitty-gritty, does it start hallucinating hotels and prices? Sure, at that point you break out trip-advisor, etc.

But as a basic "I don't even know where I want to go on holiday ( vacation ), please help?" it's fantastic.

asadotzler11mo ago

If you don't care about reliability, repeatability and accuracy, they're great.

jimbokun11mo ago

This should be OpenAI's official slogan!

whyowhy348493911mo ago

Once they start making deals with the relevant organizations, book rooms, handle insurance, replacement hotels, etc, then they'll replace travel agents. These guys don't just Google a bunch of tickets you know.

eterm11mo ago

We're getting into semantics now, but I'm talking about the kind of person who used to sit in a physical store, waiting for someone to walk by and go into the travel agency.

In the 80's and 90's, this is how most people booked their holidays. It was labour intensive, people would spend some time talking with a travel agent in a store, who would have a good idea of the packages available, and be able to make recommendations and match people with holidays.

The remnants of agencies still provide the same services, but (for the most of us) it's all online, it's all tick-box based, and much of the protection is via ATOL/ABTA.

These services still exist, but they're no longer all over the high-street. Names like Thomas Cook, Lunn Poly, have either been absorbed (mostly by TUI), or collapsed, and largely disappeared from the high-street with just a few left. (Mostly Tui).

And those that are left, have been reduced, much like retail banking, to entering your details into the same websites and services available to anyone, and talking you through the results that the computer spits out, that you could have browsed yourself at home. The underpaid travel agent in the store isn't any better connected than you are. In fact, they're possibly even more pushy about pushing you toward the hotels with the best commission than the website is.

2 more replies

jimbokun11mo ago

Um, Google and travel sites already replaced travel agents a LONG time ago.

6511mo ago

Yes, which is why it's slightly confusing why programming is being pushed so hard to use with LLMs. For things that don't need completely accurate information, sure. But for programming, data, and factual information, it's surprising to see so many people using LLMs.

asadotzler11mo ago

Code runs or it doesn't, that's a sort of verification feedback that other use cases don't offer, at least not so immediately. Formal code verification is a thing, not so much for verification of say legal citations. Code is language with some well documented rules all over the training corpora. Many other use cases are hardly so well represented in model training. These are just a few of many, many reasons that code is an easier problem than most.

1 more reply

liveoneggs11mo ago

I have used it on three big family vacations already and it's definitely a place where "AI" shines in usefulness. It did recommend some out-of-business hotels and things but the broad strokes were good enough to save hours of work.

akomtu11mo ago

LLMs can't evaluate their own output. LLMs suggest possibilities, but can't evaluate them. Imagine an insane man who is rumbling something smart, but doesn't self-reflect. The evaluation is done against some framework of values that are considered true: the rules of a board game, the language syntax or something else. LLMs also can't fabricate evaluation because the latter is a rather rigid and precise model, a unlike natural language. Otherwise you could set up two LLMs questioning each other.

candiddevmike11mo ago

Isn't this kind of the hope/dream of multi-agent systems where one LLM "coordinates" among others or checks the responses? In my experience it works about as well as you're describing.

izabera11mo ago

oh boy do i have the paper for you https://proceedings.neurips.cc/paper_files/paper/2014/file/f...

mdp202111mo ago

Sorry, what do GANs have to do with this? It is not the same kind of "evaluation".

And anyway, there is no need to have two networks to iteratively refine output: one suffices (like we naturally are meant to do).

josefritzishere11mo ago

It's hard to say "never" in technology. History isn't really on your side. However, LLMs have largely proven to be good at things computers were are already good at: repetitive tasks, parallel processing, and data analysis. There's nothing magical about an LLM that seems to be defeating the traditional paradigm. Increasingly I lean toward an implosion of the hype cycle for AI.

ToucanLoucan11mo ago

LLMs are a legitimate technology with legitimate applications. However in a desperate bid for a new iPhone moment to assure Wall Street that the fantasy of infinite growth in a finite world is possible, they have utterly lost the plot regarding what statistical analysis of words at scale is capable of doing. Useless? Far from it. The basis for a 300 billion company with no meaningful products after almost a decade working on it? I have doubts.

I can't fathom a future where OpenAI for sure doesn't eat dirt, with Anthropic likely not far behind it. nVidia will likely come out fine, since it still has gamers to disappoint, and the infrastructure build out that did occur will crater the cost of GPUs at scale for smaller, smarter companies to take advantage of. So it will likely still kick around, but as another technology, not the second coming of Cyber Christ as it's been hyped to be.

rini1711mo ago

You seriously underestimate the appeal of burning cycles on GPUs to get something cool, if barely useful, out. Cryptocurrencies are still very much alive, too.

1 more reply

dist-epoch11mo ago

Funny, I don't remember any computer program in the past being able to explain a news article through the lens of one particular philosopher.

Or being able to explain the static physical forces in a picture that are keeping a structure from collapsing.

Or recommend me a python library which does X, Y and Z with constraints A, B and C.

But I guess you can file all the above under "data analysis".

GuinansEyebrows11mo ago

it is the result of data analysis. the computer program isn't explaining anything, or recommending anything. it's simply presenting the results of querying data analyzed at scale and returning the "most likely" result (as determined by the system prompt and human input from developers and users of the program). "most likely" is still a super-fuzzy grey area.

https://www.plough.com/en/topics/life/technology/computers-c...

keybrd-intrrpt11mo ago

It's all just electricity and binary bits, nothing new here...

/s?

wintermutestwin11mo ago

What I don’t understand is, how can a liar be good at data analysis?

rienbdj11mo ago

If you give an LLM the data in the prompt and then ask it to extract information from that data it does pretty well. This is the premise of RAG. Where LLMs do poorly is when you ask it for information you haven’t given it.

ToucanLoucan11mo ago

It works great if all you're looking for is an output, with not a care for what it is. So if you're trying to generate slop children's books to shit onto Amazon, it's awesome. If you want to give your boss a huge bloated report on your daily activities, works great. If you want to phone in an assignment that doesn't add value to your education, LLM will do that. If you want a header image for your LinkedIn post that you don't want to pay for, generate it. Who cares.

This isn't even an indictment, not really. I'm just reading between the lines here regarding when/how it's used. Nobody with intentionality uses these things. Nobody who CARES what they're making uses these things. And again, I want to emphasize, this is not an attack. There are tons of things I do in my work life that I utterly do not give a shit about, and LLMs have been a blessing for it. Not my code, fuck no. But all the ancillary crap, absolutely.

tom_m11mo ago

Unreliability doesn't matter for some people because their bar was already that low. Unfortunately this is the way of the world and quality has and will continue to suffer. LLMs mostly accelerate this problem... hopefully they get good enough to help solve it.

willk35711mo ago

Has anyone experimented with an ensemble + synthesizer approach for reliability? I'm thinking: make n identical requests to get diverse outputs, then use a separate LLM call to synthesize/reconcile the distinct results into a final answer. Seems like it could help with the consistency issues discussed here by leveraging the natural variance in LLM outputs rather than fighting it. Any experience with this pattern?

worik11mo ago

LLMs are a tool to extend human capabilities. They are not intelligent agents that can replace humans

Not very hard to understand, except it seems to be

cmiles7411mo ago

The field where LLMs are most successful, software development, is also a place where many software developers are paid to use LLMs. I have colleagues who are reluctant to express their skepticism publicly for just this reason.

baxtr11mo ago

This. 100%.

I think and say this all the time. But people keep continue to say that AI will take all our jobs and I’m so utterly confused by this.

Sometimes I wonder if I have gone mad or everyone else.

bluefirebrand11mo ago

Companies are salivating over the idea of cutting staff and replacing them with AI tools, so it's not exactly farfetched to think AI might lead to a lot of unemployment, at least for a while

Every type of automation ever invented has led to massive job cuts and yes, some sectors actually did not ever recover

1 more reply

turtletontine11mo ago

Well. In an ideal world, LLMs would be used this way, as a tool to help automate the bullshit and let the person driving worry about other stuff.

But I never see them actually used this way. At the big institution end, companies and universities will continue to force AI tools on their employees in heavy handed and poorly thought out ways, and use it as an excuse to fire people whenever budgets get tight (or investors demand higher profits). At the opposite scale, with individual users, it’s really alarming how rapidly people seem to stop thinking with their own brain and offload all critical thinking to an LLM. That’s not “extending your capabilities,” that’s letting all your skills atrophy while you train a machine to be your shitty replacement.

AlienRobot11mo ago

I'm no AI fan, but articles talking about the shortcomings of LLM's seem to have to be complaining that forks aren't good for drinking soup.

Don't use LLM's to do 2 + 2. Don't use LLM's to ask how many r's are in strawberry.

For the love of God. It's not actual intelligence. This isn't hard. It just randomly spits out text. Use it for what it's good at instead. Text.

Instead of hunting for how to do things in programming using an increasingly terrible search engine, I just ask ChatGPT. For example, this is something I've asked ChatGPT in the past:

    in typescript, I have a type called IProperty<T>, how do I create a function argument that receives a tuple of IProperty<T> of various T types and returns a tuple of the T types of the IProperty in order received?

This question that's such an edge case that I wasn't even sure how to word properly actually yielded the answer I was looking for.

    function extractValues<T extends readonly IProperty<any>[]>(
      props: [...T]
    ): { [K in keyof T]: T[K] extends IProperty<infer U> ? U : never } {
      return props.map(p => p.get()) as any;
    }

This doesn't look unrealiable to me. It actually feels pretty useful. I just need [...T] there and infer there.

DanHulton11mo ago

The thing is, I have spent the last year being told that I will VERY SOON be able to use a fork to drink soup, and better than any spoon has ever been able to, and in fact pretty soon spoons will be completely outclassed anyway, and I'M the idiot for doubting this.

Articles like this are still very much needed, to push back against that narrative, regularly, until it DOES become as obvious to everyone as it is to you.

AlienRobot11mo ago

My impression is that the only people telling others they can drink soup with forks are the people who sell the forks.

Even this isn't new. A few years ago we had people who sold knives telling everybody you could use knives to drink soup. And in some cases they weren't even kitchen knives, they were switchblades.

bluefirebrand11mo ago

> Don't use LLM's to do 2 + 2. Don't use LLM's to ask how many r's are in strawberry

But use them to do more important things that require more precision and accuracy?

No thanks

batshit_beaver11mo ago

You use LLMs to _discover_ how to approach important problems. You don't necessarily need to use the output verbatim. Same as StackOverflow and Google.

yongjik11mo ago

When you employ your developers at $200K/yr you won't trust them to tell you the first one hundred digits of pi, but you'll trust them with your business logic, which is much more important and mission-critical to you.

Same thing.

1 more reply

coliveira11mo ago

The problem is exactly how the public will learn "not to ask 2+2". When you have a well trained professional using an LLM it's all great. They know how to separate hallucination from actually good results as you do. The problem lies with the general public and new workers who will, no questions about it, use the AI generated results as some sort of truth.

AlienRobot11mo ago

Maybe use an LLM to detect when the public is asking the wrong question and display a message saying "As a large language model, I don't know how to count."

Marazan11mo ago

People need to stop recommending forks to drink soup with.

uludag11mo ago

So many times I've asked questions just like this and gotten complete nonsense incorrect answers. In fact, you have no guarantees whatsoever that even the typescript question you asked will always return a sensible answer.

I'm by no means saying that LLMs aren't useful. They're just not reliably useful.

mjburgess11mo ago

I think I'm settling on a "Gell-mann Amnesia" explanation of why people are so rabidly committed to the "acceptable veracity" of LLM output. When you don't know the facts, you're easily mislead by plausible-sounding analysis, and having been mislead -- a certain default prejudice to existing beliefs takes over. There's a significant asymmetry of effort in belief change vs. acquisition. I think there's also an ego-protection effect here too: if I have to change my belief then I was wrong.

There a socratically-minded people who are more addicted to that moment of belief change, and hence overall vastly more sceptical -- but I think this attitude is extremely marginal. And probably requires a lot of self-training to be properly inculcated into it.

In any case, with LLMs, people really seem to hate the idea that their beliefs about AI and their reliance of LLM output could be systematically mistaken. All the while, when shown output in an area of their expertise, realising immediately that its full of mistakes.

This, of course, makes LLMs a uniquely dangerous force in the health of our social knowledge-conductive processes.

helloplanets11mo ago

You need to be pushing much more data in than you're getting out. 40k tokens of input can result in 400 actual quality tokens of output. Not giving enough input to work off of will result in regressed output.

It's basically like a funnel, which can also be used the other way around if the user is okay with quirky side effects. It feels like a lot of people are using the funnel the wrong way around and complaining that it's not working.

mjburgess11mo ago

Sure, if you have a high-quality starting point and need refinement.

The issue is that the vast majority of user-facing LLM use cases are where people don't have these high-quality starting points. They don't have 40k tokens to make 400.

1 more reply

asadotzler11mo ago

Bullshit works on lots of people. Seeming to be true, or even just plausible, is enough for most people. This is why powerful bullshit machines are dangerous tools.

mjburgess11mo ago

If people were easy enough to convince that they had been deceived, then I'd not mind so much. It's the extraordinary lengths people will go to in order to protect the bullshit they acquired with far less scepticism. Genuinely wild leaps of logic, shallowness of reasoning, on-the-face-of-it non-sequiturs, claims offered as great defeaters which require only a single moment of reflection to see through.

This is the problem. The problem is how bullshit conscripts its dupes into this self-degradation and bad faith dialogue with others.

And of course, how there are mechanisms in society (LLMs now one of them) which correlate this self-degrading shallowness of reasoning -- so that all at once an expert is faced with millions of people with half-baked notions and a great desire to preserve them.

1 more reply

jeisc11mo ago

AI does not know what is fake or real any more than we do. It uses our shaky data to make predictions.

johnea11mo ago

> Internally, it uses a sophisticated, multi-path strategy, approximating the sum with one heuristic while precisely determining the final digit with another. Yet, if asked to explain its calculation, the LLM describes the standard 'carry the one' algorithm taught to humans.

So, the LLM isn't just wrong, it also lies...

mjburgess11mo ago

The LLM has no relevant capacities, either to tell the truth or to lie. In generates "appropriate" text, given a history of cases of appropriate textual structures.

It is the person who reads this text as-if written by a person who imparts these capacities to the machine, who treats the text as meaningful. But almost no text the LLM generates could be said to be meaningful, if any.

In the sense that if a two year old were taught to say, "the magnitude of the charge on the electron is the same as the charge on the proton", one would not suppose the two year old meant what was said.

Since the LLM has no interior representational model of the world, only a surface of text tokens laid out as-if it did, its generation of text never comes into direct contact with a system of understanding that text. Therefore the LLM has no capacities ever implied by its use of language, it only appears to.

This appearance may be good enough for some use cases, but as an appearance, it's highly fragile.

johnea11mo ago

One could always argue that the lie is in the ear of the receiver 8-/

I would argue, that if the output of the LLM is to be interpreted as natural speech, and the output makes an authoritative statement, which is factually incorrect, but stated as if it were true, this is a lie.

The problem is that the tech is presented as if it did have the internal state, that you accurately describe it not having.

The lie in this example, is when it is prompted to describe the process by which it reached a result, and that description has no resemblance to the actual process by which it reached the result.

This isn't a misrepresentation of some external facts, but a complete fabrication, that does not represent how it reached that result, at all.

However many users will accept this information, since it only involves internal aspects of the tool itself.

The fact that the LLM doesn't have this introspective information, is part of exactly why LLMs are NOT intelligence, artificial or otherwise.

And yet they are being presented as such, also, a lie...

GuB-4211mo ago

A LLM can't self-reflect. It doesn't know what happens in its own circuits. If you ask it, it will either tell you what it knows (from the articles about LLMs it has ingested), and if it doesn't, it will hallucinate something, as it is often the case.

Since the LLM has no knowledge on how LLMs do addition, it will pick something that seems to makes sense, and it picked the "carry the one" algorithm. New generations of LLMs will probably do better now that they have access to a better answer for that specific question, but it doesn't mean that they have become more insightful.

johnea11mo ago

Please see the reply to the comment above...

psychoslave11mo ago

No, because the LLM is a tool without any feeling and consciousness, like the article rightfully point out. It doesn't have the possibility to scrutinize it's own internals, nor the possibility to wonder if that would be something relevant to do.

Those who lie (possibly even to themselves) are those who pretend that mimicry if stretched enough will surpass the actual thing, and foster the deceptive psychological analogies like "hallucinate".

johnea11mo ago

The LLM doesn't have a brain, it doesn't have consciousness, therefore it doesn't "hallucinate"; it just produces factually incorrect results.

It's just wrong, and then gives misleading explanations of how it got the wrong answer, following the same process that led to the wrong answer in the first place. Lying is a subset of being wrong.

The tech has great applications, why hype the stuff it doesn't do well? Or apply terms that misrepresent the process the s/w uses?

One might say the use of the word "hallucinate" is an analogy, but it's a poor analogy, which further misleads the lay public in what is actually happening inside the LLM, and how it's results are generated.

If you want to assert that "hallucinate" is an analogy, then "lying" is also an analogy.

If every prompt that ever went into an LLM was prefixed with: "Tell me a made up story about: ...", then the user expectation would be more in line with what the output represents.

I'm not averse to the tech in general, but I am against the rampant misrepresentation that's going on...

glial11mo ago

Talking about "truth" or "lies" with LLMs isn't helpful.

johnea11mo ago

Could you get the CEO of Goggle or OpenAI to state that clearly in a press announcement? 8-)

Although "isn't helpful" is rather dodgy wording. "Helpful" for who? "Helpful" in what way?

I think most users would find it helpful if the output was not presented as correct, when it's incorrect.

If every prompt that ever went into an LLM was prefixed with: "tell me a made up story about:", then the user expectation would be more in line with what the output represents.

But, that's not the way the corps are describing it, is it?

consumer45111mo ago

I have been using LLM coding tools to make stuff which I had no chance of making otherwise. They are MVPs, and if anything ever got traction I am very aware that I would need to hire a real dev. For now, I am basically a PM and QA person.

What really concerns me is that the big companies on whose tools we all rely are starting to push a lot of LLM generated code without having increased their QA.

I mean, everybody cut QA teams in recent years. Are they about to make a comeback once big orgs realize that they are pushing out way more bugs?

Am I way off base here?

bionhoward11mo ago

Can’t we make this deterministic with techniques like Jax’s RNG seed?

smeeger11mo ago

hallucinations are essentially the only thing keeping all knowledge workers from being made permanently redundant. if that doesnt make you a little concerned then you are a fool. and the predictions of all the experts in 2010 is that what is currently happening right in front of us could never happen within a hundred years. why are the predictions of experts more reliable now? anyone who dismisses the risks is just a sorry fool

bgnn11mo ago

I'm a knowledge worker (electrical engineer) but not one bit worried about being replaced by AI in yhe foreseeable future. It does not only neet to be reliable, but also should be able to create, as in create physically working complex systems for me to be worried. I have not seen anything remotely close this yet.

I believe AI/ML will eventually get there but definitely not with LLMs or hoarding the whole internet. Most of the human know-how isn't on internet!

Oh, I guess I'm a fool.

smeeger11mo ago

you are. just change a few words around and you would be reading the confidently incorrect predictions of essentially all scientists and engineers in 2010. you say LLMs wont get us there… and you personally would probably have said word2vec couldnt get us past the turing test… and here we are. citing the existence of a current technology as evidence that another technology, related or not, cannot exist, is lazy and stupid. the simple fact is that there has been an explosion in the progress recently… a corresponding explosion of funding and the specific purpose of every single dollar of research is to create AGI, whether through LLMs or some other framework. to dismiss this situation as totally unconcerning is literally FOOLISH

1 more reply

lapsis_beeftech11mo ago

Large language models reliably produce misinformation that appears plausible only because it mimics human language. They are dangerous toys that cannot be made into tools that are safe to use.

godelski11mo ago

I think this misses some of the core problems and it suggests there are some more straight forward solutions. We have no solutions to this and the way we're treating this means we aren't going to come up with solutions.

Problem 1: Training

Using any method like RLHF, DPO, or such guarantees that we train our models to be deceptive.

This is because our metric is the Justice Potter metric: I know it when I see it. Well, you're assuming that this accurate. The original case was about defining porn and well... I don't think it is hard to see how people even disagree on this. Go on Reddit and ask if girls in bikinis are safe for work or not. But it gets worse. At times you'll be presented with the choice between two lies. One lie you know is a lie and the other lie you don't know it is. So which do you choose? Obviously the latter! This means we optimize our models to deceive us. This is true too when we come to the choice between truth and a lie we do not know is a lie. They both look like truths.

This will be true even in completely verifiable domains. The problem comes down to truth not having infinite precision. A lot of truth is contextually dependent. Things often have incredible depth, which is why we have experts. As you get more advanced those nuances matter more and more.

Problem 2: Metrics and Alignment

All metrics are proxies. No ifs, ands, or buts. Every single one. You cannot obtain direct measurements which are perfectly aligned with what you intend to measure.

This can be easily observed with even simple forms of measurements like measuring distance. I studied physics and worked as an (aerospace) engineer prior to coming to computing. I did experimental physics, and boy, is there a fuck ton more complexity to measuring things than you'd guess. I have a lot of rules, calipers, micrometers and other stuff at my house. Guess what, none of them actually agree on measurements. They all are pretty close, but they do differ within their marked precision levels. I'm not talking about my ruler with mm hatch marks being off by <1mm, but rather >1mm. RobertElderSoftware illustrates some of this in this fun video[0]. In engineering, if you send a drawing to a machinist and it doesn't have tolerances, you have actually not provided them measurements.

In physics, you often need to get a hell of a lot more nuanced. If you want to get into that, go find someone that works in an optics lab. Boy does a lot of stuff come up that throws off your measurements. It seems straight forward, you're measuring distances.

This gets less straightforward once we talk about measuring things that aren't concrete. What's a high fidelity image? What is a well written sentence? What is artistic? What is a good science theory? None of these even have answers and are highly subjective. The result of that is your precision is incredibly low. In other words, you have no idea how you align things. It is fucking hard in well defined practical areas, but the stuff we're talking about isn't even close to well defined. I'm sorry, we need more theory. And we need it fast. Ad hoc methods will get you pretty far, but you'll quickly hit a wall if you aren't pushing the theory alongside it. The theory sits invisible in the background, but it is critical to advancements.

We're not even close to figuring this shit out... We don't even know if it is possible! But we should figure out how to put bounds, because even bounding the measurements to certain levels of error provides huge value. These are certainly possible things to accomplish, but we aren't devoting enough time to them. Frankly, it seems many are dismissive. But you can't discuss alignment without understanding these basic things. It only gets more complicated, and very fast.

[0] https://www.youtube.com/watch?v=EstiCb1gA3U

jmathai11mo ago

My experience with LLm-based chat is so different from what the article (and some friends) describe.

I use LLM chat for a wide range of tasks including coding, writing, brainstorming, learning, etc.

It’s mostly right enough. And so my usage of it has only increased and expanded. I don’t know how less right it needs to be or how often to reduce my usage.

Honestly, I think it’s hard to change habits and LLM chat, at its most useful, is attempting to replace decades long habits.

Doesn’t mean quality evaluation is bad. It’s what got us where we are today and what will help us get further.

My experience is anecdotal. But I see this divide in nearly all discussions about LLM usage and adoption.

bluefirebrand11mo ago

> It’s mostly right enough.

Honestly this is why your experience is different: your expectations are different (and likely lower). I never find they are "mostly right enough", I find they are "mostly wrong in ways that range from subtle mistakes to extremely incorrect". The more subtly they are wrong, the worse I rate their output actually, because that is what costs me more time when I try to use them

I want tools that save me time. When I use LLMs I have to carefully write the prompts, read and understand, evaluate, and iterate on the output to get "close enough" then fix it up to be actually correct.

By the time I've done all of that, I probably could have just written it from scratch.

The fact is that typing speed has basically never been the bottleneck for developer productivity, and LLMs basically don't offer much except "generate the lines of code more quickly" imo

mjr0011mo ago

It's also what you're writing. The GP's commenter's bio shows they're a product lead, not a full-time software developer. To make some broad assumptions about what kind of code they're talking about: using an LLM for "write me a Python script that queries the Jira API for all tickets closed in the past week" is a much different task from "change the code in our 15 year old in-house accounting software to handle these tariffs", both in terms of the code that gets written as well as the consequences of the LLM getting it wrong.

To be clear this isn't a knock on anyone's work, but it does seem to be a source of why "pro-LLM" and "anti-LLM" groups tend to talk past each other.

2 more replies

empath7511mo ago

They save me a tremendous amount of time, you just need to be smart about what you try to get them to do. _Busy work_ is what you want to focus on, not anything that takes a ton of domain knowledge and intelligence.

Just as an example from today, i had a huge pile of yaml documents that needed to have some transformations done to them -- they were pretty simple and obvious, but I just went into cursor, give it a before and after and a few notes, and it wrote a python script in less than 10 seconds that converted everything exactly the way I needed. Did it save me a day of work? Probably not, but probably an hour or so of looking up python docs and iterating until i worked out all the syntax errors myself? An hour here and an hour there adds up to a _lot_ of saved time.

I spent more time just writing this comment then I did asking cursor to write and run that script for me.

Other things I had an LLM do for me just _today_ is fix a github action that was failing, and knock out a developer readme for a helm chart documenting what all the values do -- that's one of the kinds of things where it gets a lot of stuff wrong, but typing speed _is_ the bottleneck. It took me a minute or so to fix the stuff it misunderstood, but the formatting and the bulk of it was fine.

2 more replies

throwacct11mo ago

This. I use LLMs for some tasks, but for more complex issues, I do it myself. I tried to use it for a project by defining each task as clearly as possible, and I spent weeks trying to come up with something useful. Mind you, I achieved 80% of what I wanted after iterating and "telling" the chat that their answers were wrong, and going over the code to double-check if everything was okay. Now I use it for specific, simple tasks if these are work-related, and then use it for random kinds of stuff that I can verify by going to the actual source.

1 more reply

nomel11mo ago

From what I can tell, rather than a simple difference in expectation (which could explain your positive experience vs others), it seems to be a "comfort within uncertainty" difference that, from what I can tell, is a personality trait!

You're comfortable with the uncertainty, and accommodate it in your use and expectations. You're left feeling good about the experience, within that uncertainty. Others are repelled by uncertainty, so will have a negative experience, regardless of how well it may work for a subset of tasks they try, because that repulsive uncertainty is always present.

I think it would be interesting (and possibly very useful/profitable for the marketing/UI departments of companies that use AI) to find the relation between perceived AI usefulness and the results of some of the "standard" personality tests.

TheOtherHobbes11mo ago

It's not comfort with uncertainty, it's discomfort with the predictable effects of uncertainty.

I don't want to have to waste time tidying up after an unreliable software tool which is being sold as saving me time. I don't want to be misled by hallucinated fantasies that have no relationship to reality. (See also - lawyers getting laughed out of courtrooms because of this.)

I don't want to have to cancel a travel booking because an AI agent booked me a holiday in Angkor Wat when I wanted a train ticket to Crystal Palace in South London.

Hypotheticals? Not even slightly. Ask anyone who's lost their KDP author account on Amazon or been locked out of Meta because of AI moderation errors.

This is common sense, not some kind of personality flaw.

I'm happy using LLMs for coding and research, but it's also clear the technology is in perpetual beta - at best - and is being wildly oversold.

Normal software operating with this level of reliability would be called "very buggy."

But apparently LLMs get a pass because one day they might not be as buggy as they are today.

Which - if you think about it - is ridiculous, even by the usual standards of the software industry.

1 more reply

kenjackson11mo ago

I wonder if this is like dishwasher usage. As a kid growing up we never used the dishwasher. It was just the drying rack. The reason was you had to rinse off the big stuff anyways, and then the resulting quality of dishwashing was poor in it. You'd often get a fork with rice stuck between it still, which was unacceptable.

As a grown up now I use a dishwasher for everything that is permitted to go in it. I still have to rinse off plates first, and occasionally I do see rice between a fork that I have to then clean manually. But I'm not comfortable knowing that it won't clean as well as I could by hand, but it does a good enough job -- and in some ways a much better job (it uses much hotter water than I do by hand). I don't know if my mom could ever really be comfortable with it though.

3 more replies

leptons11mo ago

It's fine if LLMs are used casually, for things that don't affect anyone but the user. But when someone plugs an LLM into Social Security or other governmental bodies to take action on real human beings, then disaster awaits. Nobody is going to care if the LLM got it wrong if you're just chatting with it or writing some wonky code that doesn't matter in the real world, but when your government check is reduced or deleted by an LLM that is hallucinating, then the real problems start. These things should not be trusted with anything but the least consequential actions an individual would use it for.

gte525u11mo ago

^This - we're trying to use one to partially automate some system engineering type activities.

It's great for reviews where any given reviewer could be expected to have a misunderstanding of certain details or skip a section (RAG somewhat helps this) - but it's frustrating for artifact generation where missing details cascade through the project.

As great as the technology (right now) it seems so far from reliable business process automation.

foobiekr11mo ago

Charitably, your low expectations are probably the source of your finding them acceptable.

It’s also possible - and you should not take this as an insult, it’s just the way it is - you may not know enough about the subjects of your interactions to really spot how wrong they are.

However the cases you list - brainstorming - don’t really care about wrong answers.

Coding is in the eye of the beholder, but for anything that isn’t junk glue code, scripts or low-complexity web stuff, I find the output of LLMs just short of horrendous.

CuriouslyC11mo ago

The code that the best frontier models produce is definitely good if you prompt it with what you believe "good" means, with the caveat that code quality depends heavily on the language -- Python, Typescript/Javascript, Java and C are quite good, Rust, C++ and Go tend to be decent to weak depending on the specific model, and other languages are poor.

2 more replies

fellowniusmonk11mo ago

I really don't understand people who are down on LLM.

In terms of code output. I have gone from the productivity of being a Sr. Engineer to a team with .8 of a Sr. Engineer, 5 Jr. Engineers and One dude solely dedicated to reading/creating documentation.

Unlike a lot of my fellow engineers who are also from traditional CS backgrounds and haven't worked in revenue restricted startup environments, I also have been VERY into interpreted languages like ruby in the past.

Now compiled languages are even better, I think from a velocity perspective compiled languages are now incredibly on par for prototyping velocity and have had their last weakness removed.

It's both exciting and scary, I can't believe how people are still sleep walking in this environment and don't realize we are in a different world. Once again the human inability to "gut reason" about exponentials is going to screw us all over.

One terribly overlooked thing I've noticed that I think explains the differing takes. Foundation of my position here: https://www.nature.com/articles/s41598-020-60661-8

Within the population that writes code there are a small number of successful people who approach the topic in a ~purely mathematical approach, and a small number of successful people that approach writing code in a ~purely linguistic approach. Most people fall somewhere in the middle.

Those who are on the MOST extreme end of the mathematic side and are linguistically bereft HATE LLM's and effectively cannot use them.

My guess is that HN population will tend to show stronger reactions against LLM's because it was heavily seeded with functional programmers which I think has a concentration of the successful extremely math focused. I worked for several years in a purely functional shop and that was my observation: Elixir, Haskell, Ramda.

Just my speculation.

whyowhy348493911mo ago

There is this interesting thing called the Paradox of Automation where increasing automation increases the importance of human intervention. We are trying this out on a societal level. It will be.. interesting, to say the least.

Also, congratulations on becoming a team. I sure hope you have the mental bandwidth to check all that output carefully. If so, doubly congrats, because you might be the smartest human that ever lived.

1 more reply

yoyohello1311mo ago

> Those who are on the MOST extreme end of the mathematic side and are linguistically bereft HATE LLM's and effectively cannot use them.

This is an interesting observation. It at least aligns with my experience. I wouldn't say I'm "linguistically bereft" lol, but I do lean more toward the "functional programming is beautiful" side. I even have a degree in math. I'm not totally down on LLM coding, but I do fall more on unfavorable feelings side. I mostly just hate the idea of having a bunch of code I don't fully understand, but also am responsible for.

I do use them, and find them helpful. But the idea of fully giving control of my codebase to LLM agents, like some people are suggesting, repels me.

1 more reply

light_hue_111mo ago

> It’s mostly right enough

What do you use it for?

In my space, "mostly right enough" isn't useful. Particularly when that means that the errors are subtle and I might miss them. I can't write whitepapers that tell people to do things that would result in major losses.

strangattractor11mo ago

IMHO it's a great summarizing search engine. I now don't have to click on a link to go to that original source - Gemini just hands me a useful summary. Ask AI to do something specific that requires GI (General Intelligence) your milage may vary. So as OpenAI and Google suck in all your content (creators) you are going to find yourself derive less and less revenue generated by visits to your site. Just sayin.

foobiekr11mo ago

Gemini routinely inaccurately reports the contents in the summary. I have found it actually reversing things on a regular basis. The summary says no and the source says yes.

hooverd11mo ago

DuckDuckGo, which uses Bing I think, now has Bing's AI summaries instead of the goddamn content in search results, which makes evaluating the search results at a glance useless!

1 more reply

j / k navigate · click thread line to collapse

159 comments

ar81311mo ago

elictronic11mo ago

I think those in the computer science field see passable results of LLM use with respect to software and papers and start assuming other engineering fields should be easy.

They fail to understand other engineering fields documentation and process are awful. Not that computer science is good because they are even less rigorous.

crooked-v11mo ago

I'm reminded of the whole "vegetative electron microscopy" mess (https://www.sciencealert.com/a-strange-phrase-keeps-turning-...).

ar81311mo ago

That's wild! Now I want to go hunting for more such examples..

BlueTemplar11mo ago

You know who else is infamous for making errors due to shallow understanding ? (Non-specialized) journalists !

How do you find they compare?

11223311mo ago

- mixes up pronouns (who is "you" or "he")

- cannot keep track of what is where.

- continuously plugs it's guidance slant ("lets cook dinner, Bob! It is paramount to strive for safety and cooperation while doing it!")

— language style is all over the place, comically so.

— when asked about the text it just generated, is able to give valid critique to itself (i.e. having that "insight" does not help the generation)

Journalists may have shallow understanding of topic, but they do not start referring to a person they write about as "me" halfway through.

LLM is uniformly dumb

esafak11mo ago

This is the model conflating correlation with causation. Perhaps with more data spurious correlations would disappear, but the 'right' way is to make the models learn causal, world models.

jvalencia11mo ago

1 more reply

brentm11mo ago

This is a good articulation of what is a real concern around the AI bull thesis.

If a calculator works great 99% of the time you could not use that calculator to build a bridge.

tveita11mo ago

> If a calculator works great 99% of the time you could not use that calculator to build a bridge.

bagacrap11mo ago

2 more replies

oconnor66311mo ago

A pedantic but maybe-not-entirely-pedantic point: It depends on what you mean by 99%.

franktankbank11mo ago

Yea I know, can't really understand why people have such a problem with this. Just ignore the wrong answers and be thankful when it gives you a right answer. Picky bastards.

tom_m11mo ago

I believe it absolutely will. I think eventually we'll get to a point where people will be measured on now well they can get the AI to behave and how good they are at keeping cost down.

worldsayshi11mo ago

This doesn't seem like the first time engineers try to work with something useful that is only partially reliable.

LLMs are most useful when the output is immediately verifiable. So let's build frameworks that take that to core. Build everything around verification. And use LLMs for its strengths.

the_snooze11mo ago

>If a calculator works great 99% of the time you could not use that calculator to build a bridge.

That's happened before with far higher correctness rate than 99%, and it cost Intel $500M. Reliability and accuracy matter. https://en.wikipedia.org/wiki/Pentium_FDIV_bug

andrewmutz11mo ago

insane_dreamer11mo ago

> the user has the ability to easily double check the results whenever they like

2 more replies

brentm11mo ago

Yea I just think the true unlock in productivity will come from not requiring a human in the loop.

vinni211mo ago

> If a calculator works great 99% of the time you could not use that calculator to build a bridge.

But if the alternative is doing calculations by hand (writing code manually) there is a higher chance of making mistakes.

Just like calculations are double checked while building bridges unit tests and code reviews should catch bugs introduced by LLM written code.

banannaise11mo ago

Code review is your last (and worst) line of defense. Humans are not good at needle-in-a-haystack tasks.

thorum11mo ago

karn9711mo ago

koakuma-chan11mo ago

They’re reliable enough for many use cases

bluefirebrand11mo ago

What this should be doing is exposing how those use cases are faulty, if they can accept such inconsistent and poorly defined outputs

ok12345611mo ago

Spivak11mo ago

I think we mythologize the relational model a bit too much to call nosql dbs vibe coding. DynamoDB is quite good and you can point to some very large customers using it successfully.

yencabulator11mo ago

MongoDB was bad for several reasons unrelated to the relational model.

boardwaalk11mo ago

Or we’ll just leave them behind and that’s fine. And I work day maintaining old stuff of varying quality. Conceptually, software composting.

Ostrogoth11mo ago

tom_m11mo ago

Well it doesn't really do math.

Ostrogoth11mo ago

eterm11mo ago

There are jobs out there that have always been unreliable.

A classic example is the Travel Agent. This was already a job driven to near-extinction just by Google, but LLMs are a nail in the travel agent coffin.

The job was always fuzzy. It was always unreliable. A travel agent recommendation was never a stamp of quality or guarentee of satisfaction.

But now, I can ask an LLM to compare and contrast two weeks in the Seychelles with two weeks in the Caribbean, have it then come up with sample itineraries and sample budgets.

Or that actually, both are twice the amount you'd prefer to spend, where dear ChatGPT would be more suitable?

etc.

When it comes down to the nitty-gritty, does it start hallucinating hotels and prices? Sure, at that point you break out trip-advisor, etc.

But as a basic "I don't even know where I want to go on holiday ( vacation ), please help?" it's fantastic.

asadotzler11mo ago

If you don't care about reliability, repeatability and accuracy, they're great.

jimbokun11mo ago

This should be OpenAI's official slogan!

whyowhy348493911mo ago

eterm11mo ago

We're getting into semantics now, but I'm talking about the kind of person who used to sit in a physical store, waiting for someone to walk by and go into the travel agency.

The remnants of agencies still provide the same services, but (for the most of us) it's all online, it's all tick-box based, and much of the protection is via ATOL/ABTA.

2 more replies

jimbokun11mo ago

Um, Google and travel sites already replaced travel agents a LONG time ago.

6511mo ago

asadotzler11mo ago

1 more reply

liveoneggs11mo ago

akomtu11mo ago

candiddevmike11mo ago

Isn't this kind of the hope/dream of multi-agent systems where one LLM "coordinates" among others or checks the responses? In my experience it works about as well as you're describing.

izabera11mo ago

oh boy do i have the paper for you https://proceedings.neurips.cc/paper_files/paper/2014/file/f...

mdp202111mo ago

Sorry, what do GANs have to do with this? It is not the same kind of "evaluation".

And anyway, there is no need to have two networks to iteratively refine output: one suffices (like we naturally are meant to do).

josefritzishere11mo ago

ToucanLoucan11mo ago

rini1711mo ago

You seriously underestimate the appeal of burning cycles on GPUs to get something cool, if barely useful, out. Cryptocurrencies are still very much alive, too.

1 more reply

dist-epoch11mo ago

Funny, I don't remember any computer program in the past being able to explain a news article through the lens of one particular philosopher.

Or being able to explain the static physical forces in a picture that are keeping a structure from collapsing.

Or recommend me a python library which does X, Y and Z with constraints A, B and C.

But I guess you can file all the above under "data analysis".

GuinansEyebrows11mo ago

https://www.plough.com/en/topics/life/technology/computers-c...

keybrd-intrrpt11mo ago

It's all just electricity and binary bits, nothing new here...

/s?

wintermutestwin11mo ago

What I don’t understand is, how can a liar be good at data analysis?

rienbdj11mo ago

ToucanLoucan11mo ago

tom_m11mo ago

willk35711mo ago

worik11mo ago

LLMs are a tool to extend human capabilities. They are not intelligent agents that can replace humans

Not very hard to understand, except it seems to be

cmiles7411mo ago

baxtr11mo ago

This. 100%.

I think and say this all the time. But people keep continue to say that AI will take all our jobs and I’m so utterly confused by this.

Sometimes I wonder if I have gone mad or everyone else.

bluefirebrand11mo ago

Companies are salivating over the idea of cutting staff and replacing them with AI tools, so it's not exactly farfetched to think AI might lead to a lot of unemployment, at least for a while

Every type of automation ever invented has led to massive job cuts and yes, some sectors actually did not ever recover

1 more reply

turtletontine11mo ago

Well. In an ideal world, LLMs would be used this way, as a tool to help automate the bullshit and let the person driving worry about other stuff.

AlienRobot11mo ago

I'm no AI fan, but articles talking about the shortcomings of LLM's seem to have to be complaining that forks aren't good for drinking soup.

Don't use LLM's to do 2 + 2. Don't use LLM's to ask how many r's are in strawberry.

For the love of God. It's not actual intelligence. This isn't hard. It just randomly spits out text. Use it for what it's good at instead. Text.

Instead of hunting for how to do things in programming using an increasingly terrible search engine, I just ask ChatGPT. For example, this is something I've asked ChatGPT in the past:

    in typescript, I have a type called IProperty<T>, how do I create a function argument that receives a tuple of IProperty<T> of various T types and returns a tuple of the T types of the IProperty in order received?

This question that's such an edge case that I wasn't even sure how to word properly actually yielded the answer I was looking for.

    function extractValues<T extends readonly IProperty<any>[]>(
      props: [...T]
    ): { [K in keyof T]: T[K] extends IProperty<infer U> ? U : never } {
      return props.map(p => p.get()) as any;
    }

This doesn't look unrealiable to me. It actually feels pretty useful. I just need [...T] there and infer there.

DanHulton11mo ago

Articles like this are still very much needed, to push back against that narrative, regularly, until it DOES become as obvious to everyone as it is to you.

AlienRobot11mo ago

My impression is that the only people telling others they can drink soup with forks are the people who sell the forks.

Even this isn't new. A few years ago we had people who sold knives telling everybody you could use knives to drink soup. And in some cases they weren't even kitchen knives, they were switchblades.

bluefirebrand11mo ago

> Don't use LLM's to do 2 + 2. Don't use LLM's to ask how many r's are in strawberry

But use them to do more important things that require more precision and accuracy?

No thanks

batshit_beaver11mo ago

You use LLMs to _discover_ how to approach important problems. You don't necessarily need to use the output verbatim. Same as StackOverflow and Google.

yongjik11mo ago

Same thing.

1 more reply

coliveira11mo ago

AlienRobot11mo ago

Maybe use an LLM to detect when the public is asking the wrong question and display a message saying "As a large language model, I don't know how to count."

Marazan11mo ago

People need to stop recommending forks to drink soup with.

uludag11mo ago

I'm by no means saying that LLMs aren't useful. They're just not reliably useful.

mjburgess11mo ago

This, of course, makes LLMs a uniquely dangerous force in the health of our social knowledge-conductive processes.

helloplanets11mo ago

mjburgess11mo ago

Sure, if you have a high-quality starting point and need refinement.

The issue is that the vast majority of user-facing LLM use cases are where people don't have these high-quality starting points. They don't have 40k tokens to make 400.

1 more reply

asadotzler11mo ago

Bullshit works on lots of people. Seeming to be true, or even just plausible, is enough for most people. This is why powerful bullshit machines are dangerous tools.

mjburgess11mo ago

This is the problem. The problem is how bullshit conscripts its dupes into this self-degradation and bad faith dialogue with others.

1 more reply

jeisc11mo ago

AI does not know what is fake or real any more than we do. It uses our shaky data to make predictions.

johnea11mo ago

So, the LLM isn't just wrong, it also lies...

mjburgess11mo ago

The LLM has no relevant capacities, either to tell the truth or to lie. In generates "appropriate" text, given a history of cases of appropriate textual structures.

This appearance may be good enough for some use cases, but as an appearance, it's highly fragile.

johnea11mo ago

One could always argue that the lie is in the ear of the receiver 8-/

The problem is that the tech is presented as if it did have the internal state, that you accurately describe it not having.

The lie in this example, is when it is prompted to describe the process by which it reached a result, and that description has no resemblance to the actual process by which it reached the result.

This isn't a misrepresentation of some external facts, but a complete fabrication, that does not represent how it reached that result, at all.

However many users will accept this information, since it only involves internal aspects of the tool itself.

The fact that the LLM doesn't have this introspective information, is part of exactly why LLMs are NOT intelligence, artificial or otherwise.

And yet they are being presented as such, also, a lie...

GuB-4211mo ago

johnea11mo ago

Please see the reply to the comment above...

psychoslave11mo ago

Those who lie (possibly even to themselves) are those who pretend that mimicry if stretched enough will surpass the actual thing, and foster the deceptive psychological analogies like "hallucinate".

johnea11mo ago

The LLM doesn't have a brain, it doesn't have consciousness, therefore it doesn't "hallucinate"; it just produces factually incorrect results.

It's just wrong, and then gives misleading explanations of how it got the wrong answer, following the same process that led to the wrong answer in the first place. Lying is a subset of being wrong.

The tech has great applications, why hype the stuff it doesn't do well? Or apply terms that misrepresent the process the s/w uses?

If you want to assert that "hallucinate" is an analogy, then "lying" is also an analogy.

If every prompt that ever went into an LLM was prefixed with: "Tell me a made up story about: ...", then the user expectation would be more in line with what the output represents.

I'm not averse to the tech in general, but I am against the rampant misrepresentation that's going on...

glial11mo ago

Talking about "truth" or "lies" with LLMs isn't helpful.

johnea11mo ago

Could you get the CEO of Goggle or OpenAI to state that clearly in a press announcement? 8-)

Although "isn't helpful" is rather dodgy wording. "Helpful" for who? "Helpful" in what way?

I think most users would find it helpful if the output was not presented as correct, when it's incorrect.

If every prompt that ever went into an LLM was prefixed with: "tell me a made up story about:", then the user expectation would be more in line with what the output represents.

But, that's not the way the corps are describing it, is it?

consumer45111mo ago

What really concerns me is that the big companies on whose tools we all rely are starting to push a lot of LLM generated code without having increased their QA.

I mean, everybody cut QA teams in recent years. Are they about to make a comeback once big orgs realize that they are pushing out way more bugs?

Am I way off base here?

bionhoward11mo ago

Can’t we make this deterministic with techniques like Jax’s RNG seed?

smeeger11mo ago

bgnn11mo ago

I believe AI/ML will eventually get there but definitely not with LLMs or hoarding the whole internet. Most of the human know-how isn't on internet!

Oh, I guess I'm a fool.

smeeger11mo ago

1 more reply

lapsis_beeftech11mo ago

Large language models reliably produce misinformation that appears plausible only because it mimics human language. They are dangerous toys that cannot be made into tools that are safe to use.

godelski11mo ago

Problem 1: Training

Using any method like RLHF, DPO, or such guarantees that we train our models to be deceptive.

Problem 2: Metrics and Alignment

All metrics are proxies. No ifs, ands, or buts. Every single one. You cannot obtain direct measurements which are perfectly aligned with what you intend to measure.

[0] https://www.youtube.com/watch?v=EstiCb1gA3U

jmathai11mo ago

My experience with LLm-based chat is so different from what the article (and some friends) describe.

I use LLM chat for a wide range of tasks including coding, writing, brainstorming, learning, etc.

It’s mostly right enough. And so my usage of it has only increased and expanded. I don’t know how less right it needs to be or how often to reduce my usage.

Honestly, I think it’s hard to change habits and LLM chat, at its most useful, is attempting to replace decades long habits.

Doesn’t mean quality evaluation is bad. It’s what got us where we are today and what will help us get further.

My experience is anecdotal. But I see this divide in nearly all discussions about LLM usage and adoption.

bluefirebrand11mo ago

> It’s mostly right enough.

By the time I've done all of that, I probably could have just written it from scratch.

The fact is that typing speed has basically never been the bottleneck for developer productivity, and LLMs basically don't offer much except "generate the lines of code more quickly" imo

mjr0011mo ago

To be clear this isn't a knock on anyone's work, but it does seem to be a source of why "pro-LLM" and "anti-LLM" groups tend to talk past each other.

2 more replies

empath7511mo ago

I spent more time just writing this comment then I did asking cursor to write and run that script for me.

2 more replies

throwacct11mo ago

1 more reply

nomel11mo ago

TheOtherHobbes11mo ago

It's not comfort with uncertainty, it's discomfort with the predictable effects of uncertainty.

I don't want to have to cancel a travel booking because an AI agent booked me a holiday in Angkor Wat when I wanted a train ticket to Crystal Palace in South London.

Hypotheticals? Not even slightly. Ask anyone who's lost their KDP author account on Amazon or been locked out of Meta because of AI moderation errors.

This is common sense, not some kind of personality flaw.

I'm happy using LLMs for coding and research, but it's also clear the technology is in perpetual beta - at best - and is being wildly oversold.

Normal software operating with this level of reliability would be called "very buggy."

But apparently LLMs get a pass because one day they might not be as buggy as they are today.

Which - if you think about it - is ridiculous, even by the usual standards of the software industry.

1 more reply

kenjackson11mo ago

3 more replies

leptons11mo ago

gte525u11mo ago

^This - we're trying to use one to partially automate some system engineering type activities.

As great as the technology (right now) it seems so far from reliable business process automation.

foobiekr11mo ago

Charitably, your low expectations are probably the source of your finding them acceptable.

It’s also possible - and you should not take this as an insult, it’s just the way it is - you may not know enough about the subjects of your interactions to really spot how wrong they are.

However the cases you list - brainstorming - don’t really care about wrong answers.

Coding is in the eye of the beholder, but for anything that isn’t junk glue code, scripts or low-complexity web stuff, I find the output of LLMs just short of horrendous.

CuriouslyC11mo ago

2 more replies

fellowniusmonk11mo ago

I really don't understand people who are down on LLM.

Now compiled languages are even better, I think from a velocity perspective compiled languages are now incredibly on par for prototyping velocity and have had their last weakness removed.

One terribly overlooked thing I've noticed that I think explains the differing takes. Foundation of my position here: https://www.nature.com/articles/s41598-020-60661-8

Those who are on the MOST extreme end of the mathematic side and are linguistically bereft HATE LLM's and effectively cannot use them.

Just my speculation.

whyowhy348493911mo ago

1 more reply

yoyohello1311mo ago

> Those who are on the MOST extreme end of the mathematic side and are linguistically bereft HATE LLM's and effectively cannot use them.

I do use them, and find them helpful. But the idea of fully giving control of my codebase to LLM agents, like some people are suggesting, repels me.

1 more reply

light_hue_111mo ago

> It’s mostly right enough

What do you use it for?

strangattractor11mo ago

foobiekr11mo ago

Gemini routinely inaccurately reports the contents in the summary. I have found it actually reversing things on a regular basis. The summary says no and the source says yes.

hooverd11mo ago

DuckDuckGo, which uses Bing I think, now has Bing's AI summaries instead of the goddamn content in search results, which makes evaluating the search results at a glance useless!

1 more reply

j / k navigate · click thread line to collapse