Today's Large Language Models Are Essentially BS Machines (opens in new tab)

(quandyfactory.com)

57 pointstomlin2y ago46 comments

46 comments

It's silly to flag this submission. Back in April researchers at Stanford reported that less than half of the results from AI-powered search corresponded to verifiable facts. What do we call the remaining portion. "BS" seems reasaonable.

https://aiindex.stanford.edu/report/

"As internet pioneer and Google researcher Vint Cerf said Monday, AI is "like a salad shooter," scattering facts all over the kitchen but not truly knowing what it's producing. "We are a long way away from the self-awareness we want," he said in a talk at the TechSurge Summit."

https://www.cnet.com/tech/computing/bing-ai-bungles-search-r...

ftxbro2y ago

probably it's flagged because it's so obviously true

j16sdiz2y ago

Does HN have a way to "unflag" a submission?

Or, what does a flag in HN actually do?

httpz2y ago

To be honest, I hated writing essays in English classes because I felt like I'm forced to write BS to fill up the space when my argument can be summed up in several bullet points.

Since I'm not a student anymore, I can just give ChatGPT a few bullet points and ask it to write a paragraph for me. As an engineer who doesn't like writing "fluff", it's great I can now outsource the BS part of writing.

jjice2y ago

As an engineer I'd hope you wouldn't have to write fluff. Brevity (while retaining full content) should be praised.

I'm interested what parts of your job require the fluff? Is it communication with non engineering teams?

httpz2y ago

Not necessarily technical writings but more like emails to ask for something, emails to decline something or even a birthday card.

It's also great for writing a professional sounding complaint letter to your utility company.

pclmulqdq2y ago

A lot of people seem to think that details that other people think are relevant are actually "fluff."

DennisP2y ago

Yep it's great for work emails. Incoming too, since they can summarize a long email into bullet points.

The future is people typing bullet points, expanding into polished prose for transmission, and compressing down to bullet points on the other end.

1 more reply

mlsu2y ago

So what?

Today, ChatGPT helped me write a driver.

The driver either compiles, or it doesn't; it compiled. The driver either reads a value from a register, or it doesn't; it read. The driver either causes the chip to physically move electrons in the real world in the way that I want it to, or it doesn't.

The real world does not distinguish between bullshit or not. Things either work or they do not. They either are one way, or they are another way. ChatGPT produces things that work in reality. We humans live in reality. Reality is what matters.

I notice a thread through all of the breathless panicking about LLMs: it does not correspond to REALITY. It's a panic about a fiction. The fiction that the content of text is reality itself. The fiction that the LLM can somehow recursively improve itself. The fiction that the map is the territory.

dekhn2y ago

During the big GPT-4 news cycle I think a bunch of folks posted claims that were outrageously good- "language model passes medical exams better than humans", etc. When I looked into them, in nearly all cases, the claims were boosted far beyond the reality. And the reality seemed much more consistent with a fairly banal interpretation: LLMs produce realistic looking text but have no real ability to distinguish truth from fabrication (which is a step beyond bullshit!).

The one example that still interests me is math problem solving. Can next-token predictors really solve generalized math problems as well as children? https://arxiv.org/abs/2110.14168

haimez2y ago

LLM’s are spitting out responses based on their inputs. It is (or was) shockingly effective, but there is no generalized math processing going on. That’s not what LLM’s are, that’s not how they work.

dekhn2y ago

And yet, trained on a large corpora of correct math statements, they produce responses that are more often right than wrong (I am taking this for true- it might not be)- which simply raises more questions about the nature of math.

1 more reply

ggm2y ago

To me, this is the quintessential risk: It's plausible enough it will fool somebody with authority to act, but lacking competency to understand the information is low grade. Boom! "oh man.. but the computer said it was ok"

thghtihadanacct2y ago

We already barely question the BS spilled by politicians and corporations. Now they have a scapegoat that wont ever die.

j16sdiz2y ago

People have already accepted they can't do anything to stop the bullshitting.

It's not only in America, not only in government or large corporate. It's everywhere.

boringg2y ago

100% going to happen in the near future if it hasn't already happened.

HillRat2y ago

Those attorneys who blindly used ChatGPT to generate a brief regarding MC99 liability (a field they had no experience in) are a good example, I think. Of course, in that case opposing counsel started looking at the cites and quickly had questions for them...

1 more reply

austinkhale2y ago

I like to think of all responses from LLM's like the top-rated post on Stack Overflow or a top five blog post from a Google search. It's helpful information that _may_ be correct but needs to be verified. A lot of the time, it's spot on. Some percentage of the time, it's straight up incorrect. You have to be willing to compare various sources of data and find what's accurate. It's a nice, easy-to-use starting point, essentially.

dmezzetti2y ago

While there is truth here, they can be quite effective as a logic engine vs a fact engine. One of the most popular LLM use cases is retrieval augmented generation (RAG), where the LLM is limited by a provided context.

Do you need 7B/13B/33B/77B parameters to do this? That is a question up for debate and something I'm exploring with the concept of micro/nano models (https://neuml.hashnode.dev/train-a-language-model-from-scrat...). There is the sense that today's LLMs could be overkill for a problem such as RAG.

danenania2y ago

Using LLMs to write code, particularly in a statically typed language, is a good way to get a sense for how accurate they are, since most mistakes/hallucinations are readily apparent.

I've been using GPT-4 to write code almost daily for months now, and I'd estimate that it is maybe 80-90% accurate in general, with the caveat that the quality of the prompt can have a major impact on this. If the prompt is vague, you're unlikely to get good results on the first try. If the prompt is very thorough and precise, and relevant context is included, it can often nail even fairly complex tasks in one shot.

Regardless of what the accuracy number is, it strikes me as pretty silly to call them "BS Machines". It's like calling human programmers "bug machines". Yeah, we do produce a lot of bugs, but we somehow seem to get a quite a bit of working software out the door.

GPT-4 isn't perfect and people should certainly be aware that it makes mistakes and makes things up, but it also produces quite a lot of extremely useful output across many domains. I know it's made me more productive. Honestly, I can't think of any programming language, framework, technique, or product that has increased my productivity so quickly or dramatically in the 17 years I've been programming. Nothing else even comes close. Pretty good for a BS machine.

btown2y ago

Even if you take the headline at face value (and IMO it's rather unfair)... the incredible saving grace of LLMs is that you have a plurality of BS machines, with different flavors of BS, whose outputs can be wired together.

Sure, the first-order output of today's generalist LLMs outputting one token at a time do seem to meet meet diminishing returns on factuality at approximately the level of a college freshman pulling an all-nighter. Not a great standard, that. But if you took an entire class of those tired freshmen, gave their outputs to an independent group of tired freshmen unfamiliar with the material, and told the second group to identify, in a structured manner, commonalities and discrepancies and topics they'd look up in an encyclopedia and things they'd like to escalate to a human expert on, and so on... all of a sudden, you can start to build structured knowledge about the topic, and an understanding of what is and isn't likely to be a hallucination.

One might argue that the right kind of model architecture and RLHF could bake this into the LLM itself - but you don't need to wait for that research to be brought into production to create a self-correcting system-of-systems today.

lemmsjid2y ago

I noticed he mentioned asking Bing, so I directly asked ChatGPT “who is Ryan McGreal?” and it wrote, “ Ryan McGreal is a writer, urbanist, and the editor of Raise the Hammer, an online publication that focuses on urban issues, transportation, and sustainability. He has been actively involved in advocating for better urban planning and transportation policies in Hamilton, Ontario, Canada. Please note that my knowledge is based on information available up to September 2021, so there may have been developments or changes in his work or activities since then.”

That appears more or less correct

So to give chatgpt an opportunity to hallucinate similarly to the article, I followed up with, “Did he write for the nyt?” and it replied, “ I do not have any information indicating that Ryan McGreal has written for The New York Times (NYT). His work primarily focuses on urban issues and transportation, as mentioned earlier, and he is associated with Raise the Hammer, a local publication in Hamilton, Ontario, Canada. It’s possible that he may have contributed to other publications, but I do not have specific information regarding his contributions to The New York Times.”

While I have seen ChatGPT make stuff up I do think it’s useful to compare specific results across LLMs before using particular examples to make holistic statements.

j16sdiz2y ago

Try it in other way: Ask something big (with lots of text, in this case: NYT) first .

Ask in this order:

1) what is NYT (New York times)?

2) Who is Ryan McGreal?

3) Did he wrote for NYT?

This build up more context for hallucinating

ariym2y ago

They're not focused on being informationally accurate, they're optimized to be articulate

borissk2y ago

Any time I asked ChatGPT or another GPT a question regarding science (haven't asked any questions on other topics) I got a mostly correct answer back. And I've asked a few hundred by this point. This includes state of the art research covered in just one or a few articles.

So I'm curious why my personal experience doesn't match all the complains about hallucinations.

chefandy2y ago

I think the usefulness is pretty domain-specific. Every time I've given it a citation for a not-super-famous court opinion, it very confidently told me about a plausible-sounding case that never happened between people and companies that never existed.

coliveira2y ago

I think that an AI-powered world will create a population that doesn't know how to distinguish truth from lies. People already believe that AI has some powerful hidden knowledge that they need to use, even when the AI model is spilling garbage. In the future, they will also be incapable to separate what AI models tell from reality.

mikeg82y ago

This is currently happening with people googling their way onto pseudo-science webpages/ blogs and interpreting that that content as “fact”.

borissk2y ago

People already can hardly distinguish truth from lie. One Donald Trump lies constantly. Brexit referendum in UK was driven by a ton of lies, many people still believe them.

mcint2y ago

Most people, most of the time are just BS machines. Obligatory -- but also question of the standards, presupposed purpose. Many dreams for what AI can be, can do, can provide sounds similar in the hoped futures they enable. That does not mean that the particular next-step goals of designers and implementers of different systems will achieve the same ends.

These ones are premised on regurgitating inputs. That they can imitate more than one observer's interpretation of truth at one time. More the better.

slavetologic2y ago

These models will be astounding in five years. Any hot take like this is click bait. And it's never from the people actually pushing the models forwards. Always onlookers

naniwaduni2y ago

I'll believe this when the models stop looking like machine translation did six years ago.

joshspankit2y ago

Counterpoint:

Humans have been incentivized to essentially be BS machines.

From low-quality blog posts to the highest-grossing marketing and everything in between (including many published books and scientific papers): BS makes enough money that it’s low-effort gives a decent ROI.

Of course an AI trained on a large human corpus is going to produce BS. It’s just doing what it learned.

yawnxyz2y ago

I'm surprised it doesn't touch on "creativity" which is a form of BS. So is being able to summarize or extract books and papers.

Unless it's mechanical work, it requires some form of BS, and that's why we've traditionally been so much better at this than machines. We've never been able to create "BS machines" before, so this completely shifts the paradigm.

ryan_mc_g_real2y ago

I would argue that creativity involves generating new ideas through a combination of divergent thinking (to imagine new associations between unrelated things) and convergent thinking (to bring a relational model from one domain into another), and is orthogonal to Frankfurt’s conception of BS as defined by indifference to the objective truth or falsity of a fact claim.

ftxbro2y ago

imagine posting such a hot take in september 2023

ggm2y ago

They stole it from Abraham Lincoln's first tweet

thghtihadanacct2y ago

truth is truth whenever its brought to view

borissk2y ago

Any idea why was this flagged?

bpiche2y ago

Yeah good question, was looking forward to a healthy discussion

hooch2y ago

Agreed, it’s a properly reasoned essay.

nilslindemann2y ago

Flagging this is clearly wrong.

DienLe942y ago

"The internet is simply too slow to be useful".

Pxtl2y ago

Artificial Cliff Claven. Mansplaining as a Service. Truthiness on tap.

j / k navigate · click thread line to collapse

46 comments

1vuio0pswjnm72y ago

https://aiindex.stanford.edu/report/

https://www.cnet.com/tech/computing/bing-ai-bungles-search-r...

ftxbro2y ago

probably it's flagged because it's so obviously true

j16sdiz2y ago

Does HN have a way to "unflag" a submission?

Or, what does a flag in HN actually do?

httpz2y ago

To be honest, I hated writing essays in English classes because I felt like I'm forced to write BS to fill up the space when my argument can be summed up in several bullet points.

jjice2y ago

As an engineer I'd hope you wouldn't have to write fluff. Brevity (while retaining full content) should be praised.

I'm interested what parts of your job require the fluff? Is it communication with non engineering teams?

httpz2y ago

Not necessarily technical writings but more like emails to ask for something, emails to decline something or even a birthday card.

It's also great for writing a professional sounding complaint letter to your utility company.

pclmulqdq2y ago

A lot of people seem to think that details that other people think are relevant are actually "fluff."

DennisP2y ago

Yep it's great for work emails. Incoming too, since they can summarize a long email into bullet points.

The future is people typing bullet points, expanding into polished prose for transmission, and compressing down to bullet points on the other end.

1 more reply

mlsu2y ago

So what?

Today, ChatGPT helped me write a driver.

dekhn2y ago

The one example that still interests me is math problem solving. Can next-token predictors really solve generalized math problems as well as children? https://arxiv.org/abs/2110.14168

haimez2y ago

dekhn2y ago

1 more reply

ggm2y ago

thghtihadanacct2y ago

We already barely question the BS spilled by politicians and corporations. Now they have a scapegoat that wont ever die.

j16sdiz2y ago

People have already accepted they can't do anything to stop the bullshitting.

It's not only in America, not only in government or large corporate. It's everywhere.

boringg2y ago

100% going to happen in the near future if it hasn't already happened.

HillRat2y ago

1 more reply

austinkhale2y ago

dmezzetti2y ago

danenania2y ago

Using LLMs to write code, particularly in a statically typed language, is a good way to get a sense for how accurate they are, since most mistakes/hallucinations are readily apparent.

btown2y ago

lemmsjid2y ago

That appears more or less correct

While I have seen ChatGPT make stuff up I do think it’s useful to compare specific results across LLMs before using particular examples to make holistic statements.

j16sdiz2y ago

Try it in other way: Ask something big (with lots of text, in this case: NYT) first .

Ask in this order:

1) what is NYT (New York times)?

2) Who is Ryan McGreal?

3) Did he wrote for NYT?

This build up more context for hallucinating

ariym2y ago

They're not focused on being informationally accurate, they're optimized to be articulate

borissk2y ago

So I'm curious why my personal experience doesn't match all the complains about hallucinations.

chefandy2y ago

coliveira2y ago

mikeg82y ago

This is currently happening with people googling their way onto pseudo-science webpages/ blogs and interpreting that that content as “fact”.

borissk2y ago

People already can hardly distinguish truth from lie. One Donald Trump lies constantly. Brexit referendum in UK was driven by a ton of lies, many people still believe them.

mcint2y ago

These ones are premised on regurgitating inputs. That they can imitate more than one observer's interpretation of truth at one time. More the better.

slavetologic2y ago

These models will be astounding in five years. Any hot take like this is click bait. And it's never from the people actually pushing the models forwards. Always onlookers

naniwaduni2y ago

I'll believe this when the models stop looking like machine translation did six years ago.

joshspankit2y ago

Counterpoint:

Humans have been incentivized to essentially be BS machines.

Of course an AI trained on a large human corpus is going to produce BS. It’s just doing what it learned.

yawnxyz2y ago

I'm surprised it doesn't touch on "creativity" which is a form of BS. So is being able to summarize or extract books and papers.

ryan_mc_g_real2y ago

ftxbro2y ago

imagine posting such a hot take in september 2023

ggm2y ago