undefined | Better HN

0 pointsdavebren4d ago0 comments

Ordinary humans do novel things all the time. Where do you think LLMs got all the training data that their responses come from?

0 comments

virgildotcodes4d ago

You're not quite addressing the question. More and more of the training data is now synthetic.

To be very specific - what novel things did the majority of the ~8 bil humans on Earth do say, yesterday, that you wouldn't otherwise dismiss as non-intelligent rehashing of the same tired patterns they always inhabit were those same actions attributed to LLMs?

What I'm getting at is that I think you're falling into the trap of thinking of the rare geniuses of human history, and furthermore their rare moments of accomplishment (relative to the long span of their lifetimes filled mostly without these accomplishments) when you think of "human intelligence", which is of course far overstating what actual human intelligence is.

davebrenOP4d ago

Synthetic training data is carefully crafted by humans. The rare geniuses of human history use a different magnitude and configuration of the same kind of human intelligence that posted a dad joke on a site that got scraped into the training set and repeated, convincing people that it is intelligent like humans.

> that you wouldn't otherwise dismiss as non-intelligent rehashing of the same tired patterns they always inhabit were those same actions attributed to LLMs?

Regardless of whether something's been done before people still come up with them on their own without directly copying or amalgamating several copies. Pretty much every skilled profession includes figuring things out on the fly through the use of general reasoning that doesn't involve pattern matching against millions of examples.

virgildotcodes4d ago

> Synthetic training data is carefully crafted by humans.

Much, if not the majority of synthetic data is AI generated. Human experts then evaluate samples of the data, but nothing like the entire corpus which can be trillions of tokens of generated material.

See here where Qwen team discusses synthesizing trillions of tokens for their pre training dataset - https://arxiv.org/html/2505.09388v1

> The rare geniuses of human history use a different magnitude and configuration of the same kind of human intelligence

I agree. What I don’t see any strong evidence for is that this intelligence is unique to humans. Nor do I see how it could ever be anything other than recombinations of existing data with random mutation. Where else would the building blocks for each invention come from, divine insight? We build on the shoulders of giants etc etc

Worth noting, as a sidebar, that we’re having this discussion on a post mentioning a novel breakthrough made by AI over a topic that many brilliant human mathematicians including Erdos himself failed to do.

> Regardless of whether something's been done before people still come up with them on their own without directly copying or amalgamating several copies.

I’m not even saying it in the “there’s nothing new under the sun” sense.

If you follow an average person’s day from beginning to end. Let’s say in Bangkok or NYC or Paris, at which part of the day are they not simply repeating a variation of something they’ve done many times before, or seen others around them do before, or read about others doing before, or heard about others doing before, watched others do before on TV etc etc

What you have left, how is it distinguishable, without reasoning backwards from the desired conclusion of human exceptionalism, from turning up the temperature on an LLM query?

How many data points does a human parse when they attempt to stand up as a toddler? Sight, sound, sensation from every limb and body part, inner ear, internal thought processes at the time conscious and unconscious related to the moment and attempting to interpret it in relation to all that it’s experienced to this point, including all prior attempts and whatever retained associated data, a hard to even comprehend stream of data, coming in continuously over however many minutes, hours, etc of attempts.

The stream of data the brain is processing from both external and internal sources from birth is incredibly rich, and if we attempted to represent the full depth of it it would far outweigh the size of any corpus models are being trained on now.

I think what may be genuinely missing from AI is the type of data that doesn’t translate completely into text. The audio and images/video we feed in are a totally incomplete slice of the POV of say even a single average human through their lifetime, and bereft of all the associated data a human has access to in the moment (sensory etc).

I think this tends more towards the world models that Yann Lecun et al are promoting as the key to more capable AI.

necovek4d ago

You seem to be missing their point (which I agree with). The type of intelligence we are equipped with allows us not to have the level of memory an LLM does and still complete tasks that are novel to us every single day. Like navigating a shopping cart through tricky coridors in a store, coming up with a dad joke as in sibling example, combining a set of tools to achieve something we have never seen before, etc.

LLMs approximate a lot of that very well by simply having seen it before.

Also watch kids develop language: they learn patterns with much less training data than LLMs.

virgildotcodes4d ago

I addressed much of this in my response to a sibling comment, but a few more here:

> novel to us every single day. Like navigating a shopping cart through tricky coridors in a store

We have been practicing navigating the physical world for something like 16hrs/day every day from the moment of our birth. All the sensory data passing through our brains during that time is far larger than any dataset an LLM is trained on.

Humans navigating a shopping cart at a store have likely navigated the physical world before, pushed a shopping cart before, and in combination have navigated stores while pushing shopping carts before. Nevertheless, many still bump into objects all along the way.

Them succeeding at successive variations of store layouts is not novel unless we expand the definition of novel to mean any recombination whatsoever of pre existing concepts.

I’m certain that with all the intense usage of AI by hundreds of millions of people, there have been countless collections of words passed to LLMs so far that have never before been uttered in exactly such a sequence, let alone in the dataset.

I’m equally certain the LLMs have responded to those words with collections of its own that have also never been uttered in that exact sequence, responding to their unique context.

It is trivial to produce an example of this now yourself if you’d like.

The LLM we’re talking about, mentioned in the OP, has never seen this solution to this problem in its dataset. A large number of brilliant mathematicians were not able to discover this solution. They are themselves expressing that this is a novel breakthrough and had this come from a human it would be treated as such.

If the response to that is “well it’s just recombining concepts it already knows until it finds a solution that works” I would ask how that differs from what humans do?

necovek4d ago

You missed the core of my point: humans operate, including in the real world, on much less training data. Give a human a shopping cart and ask them to push it backwards, and they'll figure it out in a few minutes even if they've never done it before.

This is the bit that's missing that LLMs do approximate amazingly well through sheer training set size, but in my opinion, it puts a cap on what novel things they can achieve in comparison with humans.

To me, I've thought about a related "invention space" before: with us creating software to solve many problems people are facing, why are there not any perfect solutions for any problem (running a cafe? a CNC machine? ...), and we always need more software built to cover one small (novel?) change for a particular owner?

The world space is just so large that you need whatever this intelligence is humans (and animals) have to navigate it successfully — but LLMs do not intrinsically.

Whether they can be so large that it does not matter in 99.99% of cases is to be seen.

1 more reply

j / k navigate · click thread line to collapse

0 comments

virgildotcodes4d ago

You're not quite addressing the question. More and more of the training data is now synthetic.

davebrenOP4d ago

> that you wouldn't otherwise dismiss as non-intelligent rehashing of the same tired patterns they always inhabit were those same actions attributed to LLMs?

virgildotcodes4d ago

> Synthetic training data is carefully crafted by humans.

See here where Qwen team discusses synthesizing trillions of tokens for their pre training dataset - https://arxiv.org/html/2505.09388v1

> The rare geniuses of human history use a different magnitude and configuration of the same kind of human intelligence

> Regardless of whether something's been done before people still come up with them on their own without directly copying or amalgamating several copies.

I’m not even saying it in the “there’s nothing new under the sun” sense.

What you have left, how is it distinguishable, without reasoning backwards from the desired conclusion of human exceptionalism, from turning up the temperature on an LLM query?

I think this tends more towards the world models that Yann Lecun et al are promoting as the key to more capable AI.

necovek4d ago

LLMs approximate a lot of that very well by simply having seen it before.

Also watch kids develop language: they learn patterns with much less training data than LLMs.

virgildotcodes4d ago

I addressed much of this in my response to a sibling comment, but a few more here:

> novel to us every single day. Like navigating a shopping cart through tricky coridors in a store

Them succeeding at successive variations of store layouts is not novel unless we expand the definition of novel to mean any recombination whatsoever of pre existing concepts.

I’m equally certain the LLMs have responded to those words with collections of its own that have also never been uttered in that exact sequence, responding to their unique context.

It is trivial to produce an example of this now yourself if you’d like.

If the response to that is “well it’s just recombining concepts it already knows until it finds a solution that works” I would ask how that differs from what humans do?

necovek4d ago

The world space is just so large that you need whatever this intelligence is humans (and animals) have to navigate it successfully — but LLMs do not intrinsically.

Whether they can be so large that it does not matter in 99.99% of cases is to be seen.

1 more reply

j / k navigate · click thread line to collapse