undefined | Better HN

0 pointsnightski3y ago0 comments

It's a terrible analogy because the entire point of ML systems is to generalize well to new data, not to reproduce the original data as accurate as possible with a space/time tradeoff.

0 comments

majormajor3y ago

I don't think you can describe the math in this context as "generalize well to new data."

ChatGPT certainly can't generate new data. It's not gonna correctly tell you today who won the World Series in 2030. It's not going to write a poem in the style of someone who hasn't been born yet.

But it can interpolate between and through a bunch of existing data that's on the web to produce novel mixes of it. I find the "blurring those things together" analogy pretty compelling there, in the same way that blurring or JPEG-compressing something isn't going to give you a picture of a new event but it might change what you appear to see in the data you already had.

(Obviously it's not exactly the same, that's why it's an analogy and not a definition. As an analogy, it works much better if you ignore much of what you know about the implementation details of both of them. It's not trying to teach someone how to build it, but to teach a lay person how to think about the output.)

nightskiOP3y ago

It absolutely can generate new data, it does so all the time. If you are claiming otherwise I think we need a more formal definition of what you mean by new data.

Are you suggesting because it can't predict the future it can't generate novel data?

majormajor3y ago

It's not just the future, though the examples I gave were future oriented.

But it's all very interpolation/summarization-focused.

A "song lyrics in the style of Taylor Swift" isn't an actual song by Taylor Swift.

A summary of the history of Texas isn't actually vetted by any historian to ensure accuracy.

The answer to a math problem may not be correct.

To me, those things don't qualify as "new data." They aren't suitable for future training as-is. Sometimes for a simple reason: they aren't facts, using the dictionary "facts and statistics collected together for reference or analysis" definition of data. So very simply "not new data."

Sometimes in a blurrier way - the song lyrics, for instance, could be touching, or poignant, or "true" in a Keats sense[0] - but if the internet gets full of GPT-dreams and future models are trained on that, you could slide down further and further into an uncanny valley, especially since most of the time you don't get one of those amazing poignant ones. Most of the time I've gotten something bland.

[0] "What the imagination seizes as beauty must be truth"

1 more reply

PaulHoule3y ago

The thing is that generalization is good enough to make people squee and not notice that the output is wrong but not good enough to get the right answer.

If it were going to produce ‘explainable’ correct answers for most of what it does that would be a matter of looking up the original sources to make sure they really say what it thinks they do. I mean, I can say, “there’s this paper that backs up my point” but I have to go look it up to get the exact citation at the very least.

williamcotton3y ago

There is definitely a misconception about how to use a tool like ChatGPT.

If you give it an analytic prompt like "turn this baseball box score into an entertaining outline" it will reliably act as a translator because all of the facts about the game are contained in the prompt.

If you give it a synthetic prompt like "give me quotes from the broadcasters" it will reliably acts as a synthesizer because none of the facts of the transcript are in the prompt.

This ability to perform as a synthesizer is what you are identifying here as "good enough to make people squee and not notice that the output is wrong but not good enough to get the right answer", which is correct, but sometimes fiction is useful!

If all web pages were embedded in ChatGPT's 1536 dimensional vector space and used for analytic augmentation then a tool would more reliably be able to translate a given prompt. The UI could also display the URLs of the nearest-neighbor source material was used to augment the prompt. That seems to be what Bing/Edge has in store.

PaulHoule3y ago

That's a touch beyond state of the art but we might get there.

If there was one big problem w/ today's LLMs it is that the attention window is too short to hold a "complete" document. I can put the headline of an HN submission through BERT and expect BERT to capture it but there is (as of yet) no way to cut up a document up into 512 (BERT) or 4096 (ChatGPT) token slices and then mash those embeddings together to make an embedding that can do all the things the model is trained to do on a smaller data set. I'm sure we will see larger models, but it seems a scalable embedding that grows with the input text would be necessary to move to the next level.

1 more reply

j / k navigate · click thread line to collapse

0 comments

majormajor3y ago

I don't think you can describe the math in this context as "generalize well to new data."

ChatGPT certainly can't generate new data. It's not gonna correctly tell you today who won the World Series in 2030. It's not going to write a poem in the style of someone who hasn't been born yet.

nightskiOP3y ago

It absolutely can generate new data, it does so all the time. If you are claiming otherwise I think we need a more formal definition of what you mean by new data.

Are you suggesting because it can't predict the future it can't generate novel data?

majormajor3y ago

It's not just the future, though the examples I gave were future oriented.

But it's all very interpolation/summarization-focused.

A "song lyrics in the style of Taylor Swift" isn't an actual song by Taylor Swift.

A summary of the history of Texas isn't actually vetted by any historian to ensure accuracy.

The answer to a math problem may not be correct.

[0] "What the imagination seizes as beauty must be truth"

1 more reply

PaulHoule3y ago

The thing is that generalization is good enough to make people squee and not notice that the output is wrong but not good enough to get the right answer.

williamcotton3y ago

There is definitely a misconception about how to use a tool like ChatGPT.

If you give it a synthetic prompt like "give me quotes from the broadcasters" it will reliably acts as a synthesizer because none of the facts of the transcript are in the prompt.

PaulHoule3y ago

That's a touch beyond state of the art but we might get there.

1 more reply

j / k navigate · click thread line to collapse