People are just as bad as my LLMs (opens in new tab)

(wilsoniumite.com)

203 pointsWilsoniumite1y ago164 comments

164 comments

> ...a lot of the safeguards and policy we have to manage humans own unreliability may serve us well in managing the unreliability of AI systems too.

It seems like an incredibly bad outcome if we accept "AI" that's fundamentally flawed in a way similar to if not worse than humans and try to work around it rather than relegating it to unimportant tasks while we work towards a standard of intelligence we'd otherwise expect from a computer.

LLMs certainly appear to be the closest to real AI that we've gotten so far. But I think a lot of that is due to the human bias that language is a sign of intelligence and our measuring stick is unsuited to evaluate software specifically designed to mimic the human ability to string words together. We now have the unreliability of human language processes without most of the benefits that comes from actual human level intelligence. Managing that unreliability with systems designed for humans bakes in all the downsides without further pursuing the potential upsides from legitimate computer intelligence.

sigpwned1y ago

I don’t disagree. But I also wonder if there even is an objective “right” answer in a lot of cases. If the goal is for computers to replace humans in a task, then the computer can only get the right answer for that task if humans agree what the right answer is. Outside of STEM, where AI is already having a meaningful impact (at least in my opinion), I’m not sure humans actually agree that there is a right answer in many cases, let alone what the right answer is. From that perspective, correctness is in the eye of the beholder (or the metric), and “correct” AI is somewhere between poorly defined and a contradiction.

Also, I think it’s apparent that the world won’t wait for correct AI, whatever that even is, whether or not it even can exist, before it adopts AI. It sure looks like some employers are hurtling towards replacing (or, at least, reducing) human headcount with AI that performs below average at best, and expecting whoever’s left standing to clean up the mess. This will free up a lot of talent, both the people who are cut and the people who aren’t willing to clean up the resulting mess, for other shops that take a more human-based approach to staffing.

I’m looking forward to seeing which side wins. I don’t expect it to be cut-and-dry. But I do expect it to be interesting.

tharkun__1y ago

Does "knowing what today is" count as "Outside STEM"? Coz my interactions with LLMs are certainly way worse than most people.

Just tried it:

   tell me the current date please

   Today's date is October 3, 2023.

Sorry ChatGPT, that's just wrong and your confidence in the answer is not helpful at all. It's also funny how different versions of GPT I've been interacting with always seem to return some date in October 2023, but they don't all agree on the exact day. If someone knows why, please do tell!

Most real actual human people would either know the date, check their phone or their watch or be like "Oh, that's a good question lol!". But somehow GPTs always be the 1% of people that will lie to know the answer to whatever question you ask them. You know, the kind that evening talk shows will ask ask. Questions like "how do do chickens lay eggs" and you get all sorts of totally completely b0nkers but entirely "confidently told" answers. And of course they only show the ones that give the b0nkers con-man answers. Or the obviously funnily stupid people.

Of course absent access to a "get the current date" function it makes sense why an LLM would behave like it does. But it also means: not AGI, sorry.

6 more replies

eternityforest1y ago

Perhaps that kind of thing could help us finally move on from the "stupid should hurt" mindset to a real safety culture, where we value fault tolerance.

We like to pretend humans can reliably execute basic tasks like telling left from right or counting to ten, or reading a four digit number, and we assume that anyone who fails at these tasks is "not even trying"

But people do make these kinds of mistakes all the time, and some of them lead to patients having the wrong leg amputated.

A lot of people seem to see fault tolerance as cheating or relying on crutches, it's almost like they actively want mistakes to result in major problems.

If we make it so that AI failing to count the Rs doesn't kill anyone, that same attitude might help us build our equipment so that connecting the red wire to R2 instead of R3 results in a self test warning instead of a funeral announcement.

Obviously I'm all for improving the underlying AI tech itself ("Maintain Competence" is a rule in crew resource management), but I'm not a super big fan of unnecessary single points of failure.

Rhapso1y ago

Lower quality is fine economically as long as it has a good enough reduction in cost to match

michaelteter1y ago

No thank you.

You've just explained "race to the bottom". We've had enough of this race, and it has left us with so many poor services and products.

1 more reply

dartos1y ago

Amen.

People’s unawareness of their own personification bias with LLMs is wild.

pbreit1y ago

I would say people are much, much worse.

Compare that to the weight we place on "experts" many of whom are hopelessly compromised or dragged by mountains of baggage.

itchyjunk1y ago

What is your measure of intelligence?

throw48472851y ago

If I was smarter, I could probably come up with a Kantian definition. Something about our capacity to model subjective representations as a coherent experience of the world within a unified space-time. Unfortunately, it's been a long time since I tried to read A Critique of Pure Reason, and I never understood it very well anyway. Even though my professor was one of the top Kant scholars, he admitted that reading Kant is a huge slog.

So I'll leave it to Skeeter to explain.

https://www.youtube.com/watch?v=W9zCI4SI6v8

cudgy1y ago

The ability to create novel solutions without a priori knowledge.

2 more replies

rainsford1y ago

I honestly don't have a great one, which is less worrying than it might otherwise be since I'm not sure anyone else does either. But in a human context, I think intelligence requires some degree of creativity, self-motivation, and improvement through feedback. Put a bunch of humans on an island with various objects and the means for survival and they're going to do...something. Over enough time they're likely to do a lot of unpredictable somethings and turn coconuts into rocket ships or whatever. Put a bunch of LLMs on an equivalent island with equivalent ability to work with their environment and they're going to do precisely nothing at all.

On the computer side of things, I think at a minimum I'd want intelligence capable of taking advantage of the fact that it's a deterministic machine capable of unerringly performing various operations with perfect accuracy absent a stray cosmic ray or programming bug. Star Trek's Data struggled with human emotions and things like that, but at least he typically got the warp core calculations correct. Accepting LLMs with the accuracy of a particularly lazy intern feels like it misses the point of computers entirely.

lo_zamoyski1y ago

I think using the word “intelligence” when speaking of computers, beyond a kind of figure of speech, is anthropomorphizing, and it is a common pseudoscientific habit that must go.

What is most characteristic about human intelligence is the ability to abstract from particular, concrete instances of things we experience. This allows us to form general concepts which are the foundation of reason. Analysis requires concepts (as concepts are what are analyzed), inference requires concepts (as we determine logical relations between them).

We could say that computers might simulate intelligent behavior in some way or other, but this is observer relative not an objective property of the machine, and it is a category mistake to call computers intelligent in any way that is coherent and not the result of projecting qualities onto things that do not possess them.

What makes all of this even more mystifying is that, first, the very founding papers of computer science speak of effective methods, which is by definition about methods that are completely mechanical and formal, and this stripped of the substantive conceptual content it can be applied to. Historically, this practically meant instructions given to human computers who merely completed them without any comprehension of what they were participating in. Second, computers are formal models, not physical machines. Physical machines simulate the computer formalism, but are not identical with the formalism. And as Kripke and Searle showed, there is no way in which you can say that a computer is objectively calculating anything! When we use a computer to add two numbers, you cannot say that the computer is objectively adding two numbers. It isn’t. The addition is merely an interpretation of a totally mechanistic and formal process that has been designed to be interpretable in such ways. It is analogous to reading a book. A book does not objectively contains words. It contains shaped blots of pigment on sheets of cellulose that have been assigned a conventional meaning in a culture and language. In other words, you being the words, the concepts, to the book. You bring the grammar. The book itself doesn’t have them.

So we must stop confusing figurative language with literal language. AI, LLMs, whatever can be very useful, but it isn’t even wrong to call them intelligent in any literal sense.

3 more replies

tehsauce1y ago

There has been some good research published on this topic of how RLHF, ie aligning to human preferences easily introduces mode collapse and bias into models. For example, with a prompt like: "Choose a random number", the base pretrained model can give relatively random answers, but after fine tuning to produce responses humans like, they become very biased towards responding with numbers like "7" or "42".

robwwilliams1y ago

I assume 42 is a joke from deep history and The Hitchhiker’s Guide. Pretty amusing to read the Wikipedia entry:

https://en.wikipedia.org/wiki/42_(number)

sedatk1y ago

Douglas Adams picked 42 randomly though. :)

1 more reply

moffkalast1y ago

It's very funny that people hold the autoregressive nature of LLMs against them, while being far more hardline autoregressive themselves. It's just not consciously obvious.

antihipocrat1y ago

I wonder whether we hold LLMs to a different standard because we have a long term reinforced expectation for a computer to produce an exact result?

One of my first teachers said to me that a computer won't ever output anything wrong, it will produce a result according to the instructions it was given.

LLMs do follow this principle as well, it's just that when we are assessing the quality of output we are incorrectly comparing it to the deterministic alternative, and this isn't really a valid comparison.

absolutelastone1y ago

I think people tend to just not understand what autoregressive methods are capable of doing generally (i.e., basically anything an alternative method can do), and worse they sort of mentally view it as equivalent to a context length of 1.

aidos1y ago

Why is that? Whenever I’m giving examples I almost always use 7, something ending in a 7 or something in the 70s

p1necone1y ago

1 and 10 are on the boundary, that's not random so those are out.

5 is exactly halfway, that's not random enough either, that's out.

2, 4, 6, 8 are even and even numbers are round and friendly and comfortable, those are out too.

9 feels too close to the boundary, it's out.

That leaves 3 and 7, and 7 is more than 3 so it's got more room for randomness in it right?

Therefore 7 is the most random number between 1 and 10.

2 more replies

da_chicken1y ago

The theory I've heard is that the more prime a number is, the more random it feels. 13 feels more awkward and weird, and it doesn't come up naturally as often as 2 or 3 do in everyday life. It's rare, so it must be more random! I'll give you the most random number I can think of!

People tend to avoid extremes, too. If you ask for a number between 1 and 10, people tend to pick something in the middle. Somehow, the ordinal values of the range seem less likely.

Additionally, people tend to avoid numbers that are in other ranges. Ask for a number from 1 to 100, and it just feels wrong to pick a number between 1 and 10. They asked for a number between 1 and 100. Not this much smaller range. You don't want to give them a number they can't use. There must be a reason they said 100. I wonder if the human RNG would improve if we started asking for numbers between 21 and 114.

2 more replies

Ethee1y ago

Veritasium actually made a video on this concept about a year ago: https://www.youtube.com/watch?v=d6iQrh2TK98

d4mi3n1y ago

My guess is that we bias towards numbers with cultural or personal significance. 7 is lucky in western cultures and is religiously significant (see https://en.wikipedia.org/wiki/7#Culture). 42 is culturally significant in science fiction, though that's a lot more recent. There are probably other examples, but I imagine the mean converges on numbers with multiple cultural touchpoints.

1 more reply

d0liver1y ago

I like prime numbers. Non-primes always feel like they're about to fall apart on me.

mynameismon1y ago

Can you share any links about this?

Shorel1y ago

They choose 37 =)

thechao1y ago

Which is weird, because I thought we'd all agreed that the random number was 4?

https://xkcd.com/221/

lxe1y ago

It's almost as if we trained LLMs on text produced by people.

MrMcCall1y ago

I love the posters that make fun of those corporate motivational posters.

My favorite is:

  No one is as dumb as all of us.

And they trained their PI* on that giant turd pile.

* Pseudo Intelligence

LoganDark1y ago

I don't count LLMs as intelligent. To a certain degree they can be a component of intelligence, but I don't count an LLM on its own.

2 more replies

smallnix1y ago

Is my understanding wrong that LLMs are trained to emulate observed human behavior in their training data?

From that follows that LLMs fit to produce all kinds of human biases. Like preferring the first choice out of many, and the last our of many (primacy biases). Funnily the LLM might replicate the biases slightly wrong and by doing so produce new derived biases.

Terr_1y ago

I'd say it's closer to emulating human documents.

In most cases, The LLM itself is a name-less and ego-less clockwork Document-Maker-Bigger. It is being run against a hidden theater-play script. The "AI assistant" (of whatever brand-name) is a fictional character seeded into the script, and the human unwittingly provides lines for a "User" character to "speak". Fresh lines for the other character are parsed and "acted out" by conventional computer code.

That character is "helpful and kind and patient" in much the same way way that another character named Dracula is a "devious bloodsucker". Even when form is really good, it isn't quite the same as substance.

The author/character difference may seem subtle, but I believe it's important: We are not training LLMs to be people we like, we are training them to emit text describing characters and lines that we like. It also helps in understanding prompt injection and "hallucinations", which are both much closer to mandatory features than bugs.

ziaowang1y ago

This understanding is incomplete in my opinion. LLMs are more than emulating observed behavior. In the pre-training phase tasks like masked language model indeed train the model to mimic what they read (which of course contains lots of bias); but in the RLHF phase, the model tries to generate the best response judged by human evaluations (who tries to eliminate as much bias as possible in the process). In other words, they are trained to meet human expectations in this later phase.

But human expectations are also not bias-free (e.g. from the preferring-the-first-choice phenomenon)

Xelynega1y ago

I don't understand what you are saying.

How can the RLHF phase eliminate bias if it uses a process(human input) that has the same biases as the pre-training(human input)?

1 more reply

nthingtohide1y ago

Not only that if future AI distrusts humanity it is because history, literature and fiction is full of such scenarios and AI will learn those patterns and associated emotions from those texts. Humanity together will be responsible for creating a monster (if that scenario happens).

rawandriddled1y ago

>Humanity together

Together? It would be, 1. AI programmers, 2. AI techbros and a distant 3. AI fiction/history/literature. Foo who never used the internet: not responsible. Bar who posted pictures on Facebook: not responsible. Baz who wrote machine learning, limited dataset algorithms (webmd): not responsible. Etc.

mplewis1y ago

LLMs don't emulate human behavior. They spit out chunks of words in an order that parrots some of their training data.

dkenyser1y ago

Correct me if I'm wrong, but I feel like we're splitting hairs here.

> spits out chunks of words in an order that parrots some of their training data.

So, if the data was created by humans then how is that different from "emulating human behavior?"

Genuinely curious as this is my rough interpretation as well.

3 more replies

educasean1y ago

Is this just pedantry or is there some insight to be gleaned by the distinction you made?

2 more replies

icelancer1y ago

Are you sure that humans are much more than this in terms of spoken/written language?

sumeno1y ago

This is a more pedantic and meme-y way of saying the same thing.

henlobenlo1y ago

This is the "anyone can be a mathematician meme". People who hang around elite circles have no idea how dumb the average human is. The average human hallucinates constantly.

bawolff1y ago

So if you give a bunch of people a boring task, pay them the same regardless of if they treat it seriously or not - the end result is they do a bad job!

Hardly a shocker. I think this say more about the experimental design then it does about AI & humans.

markbergz1y ago

For anyone interested in these LLM pairwise sorting problems, check out this paper: https://arxiv.org/abs/2306.17563

The authors discuss the person 1 / doc 1 bias and the need to always evaluate each pair of items twice.

If you want to play around with this method there is a nice python tool here: https://github.com/vagos/llm-sort

fpgaminer1y ago

The paper basically sums to suggesting (and analyzing) these otpions:

* Comparing all possible pair permutations eliminates any bias since all pairs are compared both ways, but is exceedingly computationally expensive. * Using a sorting algorithm such as Quicksort and Heapsort is more computationally efficient, and in practice doesn't seem to suffer much from bias. * Sliding window sorting has the lowest computation requirement, but is mildly biased.

The paper doesn't seem to do any exploration of the prompt and whether it has any impact on the input ordering bias. I think that would be nice to know. Maybe assigning the options random names instead of ordinals would reduce the bias. That said, I doubt there's some magic prompt that will reduce the bias to 0. So we're definitely stuck with the options above until the LLM itself gets debiased correctly.

jayd161y ago

If the question inherently allows for "no-preference" to be valid but that is not a possible answer then you've left it to the person or llm to deal with that. If a human is not allowed to specify no preference why would you expect uniform results when you don't even ask for it? You only asked to pick the best. Even if they picked perfectly, its not defined in the task to make sure you select draws in a random way.

velcrovan1y ago

interleaving a bunch of people's comments and then asking the LLM to sort them out and rank them…seems like a poor method. The whole premise seems silly, actually. I don't think there's any lesson to draw here other than that you need to understand the problem domain in order to get good results from an LLM.

isaacremuant1y ago

So many articles like this HN have a catchy title and then a short article that doesn't really conclude the title.

The experiment itself is so fundamentally flawed it's hard to begin criticizing it. HN comments as a predictor of good hiring material is just as valid as social media profile artifacts or sleep patterns.

Just because you produce something with statistics (with or without LLMs) and have nice visuals and narratives doesn't mean is valid or rigorous or "better than nothing" for decision making.

Articles like this keep making it to the top of HN because HN is behaving like reddit where the article is read by few and the gist of the title debated by many.

le-mark1y ago

Human level artificial intelligence has never had much appeal to me, there are enough idiots in the world, why do we need artificial ones? Ie if average machine intelligence mirrored human IQ distribution?

roywiggins1y ago

Owners would love to be able to convert capital directly into products without any intermediate labor[0]. Fire your buildings full of programmers and replace them with a server farm that only gets faster and more efficient over time? That's a great position to be in, if you own the IP and/or server farm.

[0] https://qntm.org/mmacevedo

devit1y ago

The "person one" vs "person two" bias seems trivially solvable by running each pair evaluation twice with each possible labelling and the averaging the scores.

Although of course that behavior may be a signal that the model is sort of guessing randomly rather than actually producing a signal.

harrisonjackson1y ago

Agreed on the second part. Correcting for bias this way might average out the scores but not in a way that correctly evaluates the HN comments.

The LLM isn't performing the desired task.

It sounds possible to cancel out the comments where reversing the labels swaps the outcome because of bias. That will leave the more "extreme" HN comments that it consistently scored regardless of the label. But that may not solve for the intended task still.

rahimnathwani1y ago

  The LLM isn't performing the desired task.

It's 'not performing the task', in the same way that the humans ranking voice attractiveness are 'not performing the task'.

I wouldn't treat the output as complete garbage, just because it's somewhat biased by an irrelevant signal.

jopsen1y ago

But an LLM can't be held accountable.. neither can most employees, but we often forget that :)

simne1y ago

> But an LLM can't be held accountable.. neither can most employees

Yes and no.

Yes, this is really problem, because at current level of technologies, some thing are inexpensive only if done in large numbers (factor of scale), so for example, just could not exist one person who could be accountable for machine like Boeing-747 (~500 human-years of work per plane).

Unfortunately, modern automobile is considered large system, made from thousands parts, so again, not exist one person to know everything.

And no, Germans said "Ordnung muss sein", which in modern management mean, constant clear organization of the game of the whole team is more important than the success of individual players.

Or, in simple words, right organization, controlled by rules is considered enough reliable to be accountable.

And for example in automobile industry, now normal to consider accountable whole organization.

And for example, Daimler officials few years ago said, Daimler safety systems will use Daimler view on robotic laws - priority will be safety of people inside vehicle. You may know, traditionally used Lem robotic laws, which have totally different view, separated from inside vs outside approach. In civil aviation using approach, to just use simple designs or design with evidence of reliability.

Sure, government regulators could decide something even more original, will see.

Any way, as technology emerge, accountability of machines will be sure subject of many discussions.

jayd161y ago

You can certainly send people to jail for negligence.

leptons1y ago

Employees get fired all the time, and the more wrong answers I get from an LLM, the less I use it until it's never used again.

malfist1y ago

Jokes on you, my org at work just adopted KPIs about how many AI suggestions engineers accept

rainsford1y ago

But an LLM doesn't understand "never used again" as a consequence and the threat of it is useless as a motivation to improve (also because LLMs have no concept of "motivation" or "threats" or anything else).

1 more reply

satisfice1y ago

People are alive. They have rights and responsibilities. They can be held accountable. They are not "just as bad" as your LLMs.

icelancer1y ago

> They can be held accountable

Is this a universal phenomenon where you've worked? Consider yourself very lucky.

andrewmcwatters1y ago

I don’t understand the significance of performing tests like these.

To me it’s literally the same as testing one Markov chain against another.

megadata1y ago

At least LLMs are very often ready so acknowledge they might be wrong.

It can be incredibly hard to get a person to acknowledge that they might be remotely wrong on a topic they really care about.

Or, for some people, the thought that they might be wrong about anything attall is just like blasphemy to them.

Xelynega1y ago

Is this not just because aggressive material was filtered out of training data and the system prompts usually include some preamble about being polite?

"Acknowledging they might be wrong" makes them sound like more than token predictors trained on polite sounding text.

roywiggins1y ago

Most of the reason LLMs will "admit they're wrong" is because they've been trained not to argue too hard, and to not hold strong preferences. It's a sort of customer service personality.

When you don't do that sufficiently you run the risk of producing the "Sydney" personality that Bing Chat had, which would argue back, and could go totally feral defending its incorrect beliefs about the world, to the point of insulting and belittling the user.

oldherl1y ago

It's just because people tend to put the "original" result in the first place and the "improved" result in the second place in many scientific studies. LLM and humans are learning that and assume that the second one is the better one.

K0balt1y ago

I know this is only adjacent to OP’s point, but I do find it somewhat ironic that it is easy to find people who are just as unreliable and incompetent at answering questions correctly as a 7b model, but also a lot less knowledgeable.

Also, often less capable of carrying on a decent conversation.

I’ve noticed an periconcious urge when talking to people to judge them against various models and quants, or to decide they are truly SOTA.

I need to touch grass a bit more, I think.

sponnath1y ago

A decent conversation about what?

K0balt1y ago

I should have said minimal, not decent.

K0balt1y ago

Trivial discussion of anything

soared1y ago

Wouldn’t the same outcome be achieved much more simply by giving LLMs a two choices (colors, numbers, whatever), asking “pick one” and assessing the results in the same way?

ramity1y ago

You absolutely can. Deterministic inference is achievable, but it isn't as performant. The reason why sadly boils down to floating point math.

vivzkestrel1y ago

should have started naming them from person 4579 and see if it still exhibits the bias

djaouen1y ago

Yes, but a consensus of people beats LLMs every time. For now, at least.

bxguff1y ago

Kind of an odd metric to try to base this process off of. are more comments inherently better? is it responding to buzz words? Makes sense talking about hiring algos / resume scanners in part one and if anything this elucidates some of the trouble with them.

1 more reply

th0ma51y ago

No they are not randomly wrong or right without perspective unless they have some kind of brain injury. So that's against the title but the rest of their point is interesting!

raincole1y ago

What a clickbait title.

TL;DR: the author found a very, very specific bias that is prevalent in both humans and LLMs. That is it.

mdp20211y ago

Very nice article. But the title, and the idea, is the very frequent "racist" form of the proper "People [can be] just as bad as my LLMs".

Now: some people can't count. Some people hum between words. Some people set fire to national monuments. Reply: "Yes we knew", and "No, it's not necessary".

And: if people could lift the tons, we would not have invented cranes.

Very, very often in these pages I meet people repeating "how bad people are". That is "how bad people can be", and "and we would have guessed these pages are especially visited by engineers, who must be already aware of the importance of technical boosts" - so, besides the point relevant to the fact that the median does not represent the whole set, the other point relevant to the fact that tools are not measured on reaching mediocre results.

th0ma51y ago

Racist is the wrong word probably maybe ... antisocial in that it is against society.

mdp20211y ago

I wrote '"racist"', because the proponents insist that "all members of class X are also Y" (there where X does not imply Y as an "analytic judgement" - it is not entailed from the nature of X but accidental to X ... yet seen by the proponents as a constant property of X): that is the character of racism. "All wood is odorous", etc.

Those who insist that "all humans are <slur>" are "racist" against humanity (against the "human race", if you wish).

That spirit is in the refusal to see exceptions and to recognize that there can be exceptions.

1 more reply

consumer4511y ago

Maybe misanthropic?

1 more reply

j / k navigate · click thread line to collapse

164 comments

rainsford1y ago

> ...a lot of the safeguards and policy we have to manage humans own unreliability may serve us well in managing the unreliability of AI systems too.

sigpwned1y ago

I’m looking forward to seeing which side wins. I don’t expect it to be cut-and-dry. But I do expect it to be interesting.

tharkun__1y ago

Does "knowing what today is" count as "Outside STEM"? Coz my interactions with LLMs are certainly way worse than most people.

Just tried it:

   tell me the current date please

   Today's date is October 3, 2023.

Of course absent access to a "get the current date" function it makes sense why an LLM would behave like it does. But it also means: not AGI, sorry.

6 more replies

eternityforest1y ago

Perhaps that kind of thing could help us finally move on from the "stupid should hurt" mindset to a real safety culture, where we value fault tolerance.

But people do make these kinds of mistakes all the time, and some of them lead to patients having the wrong leg amputated.

A lot of people seem to see fault tolerance as cheating or relying on crutches, it's almost like they actively want mistakes to result in major problems.

Obviously I'm all for improving the underlying AI tech itself ("Maintain Competence" is a rule in crew resource management), but I'm not a super big fan of unnecessary single points of failure.

Rhapso1y ago

Lower quality is fine economically as long as it has a good enough reduction in cost to match

michaelteter1y ago

No thank you.

You've just explained "race to the bottom". We've had enough of this race, and it has left us with so many poor services and products.

1 more reply

dartos1y ago

Amen.

People’s unawareness of their own personification bias with LLMs is wild.

pbreit1y ago

I would say people are much, much worse.

Compare that to the weight we place on "experts" many of whom are hopelessly compromised or dragged by mountains of baggage.

itchyjunk1y ago

What is your measure of intelligence?

throw48472851y ago

So I'll leave it to Skeeter to explain.

https://www.youtube.com/watch?v=W9zCI4SI6v8

cudgy1y ago

The ability to create novel solutions without a priori knowledge.

2 more replies

rainsford1y ago

lo_zamoyski1y ago

I think using the word “intelligence” when speaking of computers, beyond a kind of figure of speech, is anthropomorphizing, and it is a common pseudoscientific habit that must go.

So we must stop confusing figurative language with literal language. AI, LLMs, whatever can be very useful, but it isn’t even wrong to call them intelligent in any literal sense.

3 more replies

tehsauce1y ago

robwwilliams1y ago

I assume 42 is a joke from deep history and The Hitchhiker’s Guide. Pretty amusing to read the Wikipedia entry:

https://en.wikipedia.org/wiki/42_(number)

sedatk1y ago

Douglas Adams picked 42 randomly though. :)

1 more reply

moffkalast1y ago

It's very funny that people hold the autoregressive nature of LLMs against them, while being far more hardline autoregressive themselves. It's just not consciously obvious.

antihipocrat1y ago

I wonder whether we hold LLMs to a different standard because we have a long term reinforced expectation for a computer to produce an exact result?

One of my first teachers said to me that a computer won't ever output anything wrong, it will produce a result according to the instructions it was given.

absolutelastone1y ago

aidos1y ago

Why is that? Whenever I’m giving examples I almost always use 7, something ending in a 7 or something in the 70s

p1necone1y ago

1 and 10 are on the boundary, that's not random so those are out.

5 is exactly halfway, that's not random enough either, that's out.

2, 4, 6, 8 are even and even numbers are round and friendly and comfortable, those are out too.

9 feels too close to the boundary, it's out.

That leaves 3 and 7, and 7 is more than 3 so it's got more room for randomness in it right?

Therefore 7 is the most random number between 1 and 10.

2 more replies

da_chicken1y ago

People tend to avoid extremes, too. If you ask for a number between 1 and 10, people tend to pick something in the middle. Somehow, the ordinal values of the range seem less likely.

2 more replies

Ethee1y ago

Veritasium actually made a video on this concept about a year ago: https://www.youtube.com/watch?v=d6iQrh2TK98

d4mi3n1y ago

1 more reply

d0liver1y ago

I like prime numbers. Non-primes always feel like they're about to fall apart on me.

mynameismon1y ago

Can you share any links about this?

Shorel1y ago

They choose 37 =)

thechao1y ago

Which is weird, because I thought we'd all agreed that the random number was 4?

https://xkcd.com/221/

lxe1y ago

It's almost as if we trained LLMs on text produced by people.

MrMcCall1y ago

I love the posters that make fun of those corporate motivational posters.

My favorite is:

  No one is as dumb as all of us.

And they trained their PI* on that giant turd pile.

* Pseudo Intelligence

LoganDark1y ago

I don't count LLMs as intelligent. To a certain degree they can be a component of intelligence, but I don't count an LLM on its own.

2 more replies

smallnix1y ago

Is my understanding wrong that LLMs are trained to emulate observed human behavior in their training data?

Terr_1y ago

I'd say it's closer to emulating human documents.

ziaowang1y ago

But human expectations are also not bias-free (e.g. from the preferring-the-first-choice phenomenon)

Xelynega1y ago

I don't understand what you are saying.

How can the RLHF phase eliminate bias if it uses a process(human input) that has the same biases as the pre-training(human input)?

1 more reply

nthingtohide1y ago

rawandriddled1y ago

>Humanity together

mplewis1y ago

LLMs don't emulate human behavior. They spit out chunks of words in an order that parrots some of their training data.

dkenyser1y ago

Correct me if I'm wrong, but I feel like we're splitting hairs here.

> spits out chunks of words in an order that parrots some of their training data.

So, if the data was created by humans then how is that different from "emulating human behavior?"

Genuinely curious as this is my rough interpretation as well.

3 more replies

educasean1y ago

Is this just pedantry or is there some insight to be gleaned by the distinction you made?

2 more replies

icelancer1y ago

Are you sure that humans are much more than this in terms of spoken/written language?

sumeno1y ago

This is a more pedantic and meme-y way of saying the same thing.

henlobenlo1y ago

This is the "anyone can be a mathematician meme". People who hang around elite circles have no idea how dumb the average human is. The average human hallucinates constantly.

bawolff1y ago

So if you give a bunch of people a boring task, pay them the same regardless of if they treat it seriously or not - the end result is they do a bad job!

Hardly a shocker. I think this say more about the experimental design then it does about AI & humans.

markbergz1y ago

For anyone interested in these LLM pairwise sorting problems, check out this paper: https://arxiv.org/abs/2306.17563

The authors discuss the person 1 / doc 1 bias and the need to always evaluate each pair of items twice.

If you want to play around with this method there is a nice python tool here: https://github.com/vagos/llm-sort

fpgaminer1y ago

The paper basically sums to suggesting (and analyzing) these otpions:

jayd161y ago

velcrovan1y ago

isaacremuant1y ago

So many articles like this HN have a catchy title and then a short article that doesn't really conclude the title.

Just because you produce something with statistics (with or without LLMs) and have nice visuals and narratives doesn't mean is valid or rigorous or "better than nothing" for decision making.

Articles like this keep making it to the top of HN because HN is behaving like reddit where the article is read by few and the gist of the title debated by many.

le-mark1y ago

roywiggins1y ago

[0] https://qntm.org/mmacevedo

devit1y ago

The "person one" vs "person two" bias seems trivially solvable by running each pair evaluation twice with each possible labelling and the averaging the scores.

Although of course that behavior may be a signal that the model is sort of guessing randomly rather than actually producing a signal.

harrisonjackson1y ago

Agreed on the second part. Correcting for bias this way might average out the scores but not in a way that correctly evaluates the HN comments.

The LLM isn't performing the desired task.

rahimnathwani1y ago

  The LLM isn't performing the desired task.

It's 'not performing the task', in the same way that the humans ranking voice attractiveness are 'not performing the task'.

I wouldn't treat the output as complete garbage, just because it's somewhat biased by an irrelevant signal.

jopsen1y ago

But an LLM can't be held accountable.. neither can most employees, but we often forget that :)

simne1y ago

> But an LLM can't be held accountable.. neither can most employees

Yes and no.

Unfortunately, modern automobile is considered large system, made from thousands parts, so again, not exist one person to know everything.

And no, Germans said "Ordnung muss sein", which in modern management mean, constant clear organization of the game of the whole team is more important than the success of individual players.

Or, in simple words, right organization, controlled by rules is considered enough reliable to be accountable.

And for example in automobile industry, now normal to consider accountable whole organization.

Sure, government regulators could decide something even more original, will see.

Any way, as technology emerge, accountability of machines will be sure subject of many discussions.

jayd161y ago

You can certainly send people to jail for negligence.

leptons1y ago

Employees get fired all the time, and the more wrong answers I get from an LLM, the less I use it until it's never used again.

malfist1y ago

Jokes on you, my org at work just adopted KPIs about how many AI suggestions engineers accept

rainsford1y ago

1 more reply

satisfice1y ago

People are alive. They have rights and responsibilities. They can be held accountable. They are not "just as bad" as your LLMs.

icelancer1y ago

> They can be held accountable

Is this a universal phenomenon where you've worked? Consider yourself very lucky.

andrewmcwatters1y ago

I don’t understand the significance of performing tests like these.

To me it’s literally the same as testing one Markov chain against another.

megadata1y ago

At least LLMs are very often ready so acknowledge they might be wrong.

It can be incredibly hard to get a person to acknowledge that they might be remotely wrong on a topic they really care about.

Or, for some people, the thought that they might be wrong about anything attall is just like blasphemy to them.

Xelynega1y ago

Is this not just because aggressive material was filtered out of training data and the system prompts usually include some preamble about being polite?

"Acknowledging they might be wrong" makes them sound like more than token predictors trained on polite sounding text.

roywiggins1y ago

Most of the reason LLMs will "admit they're wrong" is because they've been trained not to argue too hard, and to not hold strong preferences. It's a sort of customer service personality.

oldherl1y ago

K0balt1y ago

Also, often less capable of carrying on a decent conversation.

I’ve noticed an periconcious urge when talking to people to judge them against various models and quants, or to decide they are truly SOTA.

I need to touch grass a bit more, I think.

sponnath1y ago

A decent conversation about what?

K0balt1y ago

I should have said minimal, not decent.

K0balt1y ago

Trivial discussion of anything

soared1y ago

Wouldn’t the same outcome be achieved much more simply by giving LLMs a two choices (colors, numbers, whatever), asking “pick one” and assessing the results in the same way?

ramity1y ago

You absolutely can. Deterministic inference is achievable, but it isn't as performant. The reason why sadly boils down to floating point math.

vivzkestrel1y ago

should have started naming them from person 4579 and see if it still exhibits the bias

djaouen1y ago

Yes, but a consensus of people beats LLMs every time. For now, at least.

bxguff1y ago

1 more reply

th0ma51y ago

No they are not randomly wrong or right without perspective unless they have some kind of brain injury. So that's against the title but the rest of their point is interesting!

raincole1y ago

What a clickbait title.

TL;DR: the author found a very, very specific bias that is prevalent in both humans and LLMs. That is it.

mdp20211y ago

Very nice article. But the title, and the idea, is the very frequent "racist" form of the proper "People [can be] just as bad as my LLMs".

Now: some people can't count. Some people hum between words. Some people set fire to national monuments. Reply: "Yes we knew", and "No, it's not necessary".

And: if people could lift the tons, we would not have invented cranes.

th0ma51y ago

Racist is the wrong word probably maybe ... antisocial in that it is against society.

mdp20211y ago

Those who insist that "all humans are <slur>" are "racist" against humanity (against the "human race", if you wish).

That spirit is in the refusal to see exceptions and to recognize that there can be exceptions.

1 more reply

consumer4511y ago

Maybe misanthropic?

1 more reply

j / k navigate · click thread line to collapse