How Googlers cracked OpenAI's ChatGPT with a single word (opens in new tab)

(sfgate.com)

31 pointstheduder992y ago50 comments

50 comments

I reported this behavior 4 months ago on HN https://news.ycombinator.com/item?id=36675729

[The researchers wrote in their blog post, “As far as we can tell, no one has ever noticed that ChatGPT emits training data with such high frequency until this paper. So it’s worrying that language models can have latent vulnerabilities like this.”]

catchnear43212y ago

it is worrying just how much of this has shown up in hn comments months before being officially discovered by official experts.

chatmasta2y ago

If you limit your definition of "experts" to people who write academic papers and BigCorp press releases, you're gonna miss a lot of insight.

chupapimunyenyo2y ago

The "official experts" are just like you and me. Turns out they might even be worse than us if it took them so long to notice

pyinstallwoes2y ago

The argument for those who make things vs those who make models trying to make sense of those who makes things.

The "Mathematics and Science" is the explanation that comes after-the-fact of the creation which was initially driven by intuition and insight mixed with experiment in order to jump towards new intuitions and experiments.

Said another way, I find it true that mathematics and science serve as a form of language that seek to explain what already exists. It cannot be used as a tool for what has not yet been created. The catch being that things which have been created are usually part of the process of creating that which hasn't.

This got metaphysical without really wanting it to be. Oh well.

1 more reply

mike_hearn2y ago

I'm not sure how this is an attack. Is it actually vital that models don't repeat their training data verbatim? Often that's exactly the answer the user will want. We are all used to a similar "model" of the internet that does that: search engines. And it's expected and required that they work this way.

OpenAI argue that they can use copyrighted content so repeating that isn't going to change anything. The only issue would be if they had used stolen/confidential data to train on, and it was discovered that way, but it also seems unlikely anyone could easily detect that given that there'd be nothing to intersect it with, unlike in this paper.

The blog post seems to slide around quite a bit, roving from "it's not surprising to us that small amounts of random text is memorized" straight to "it's unsafe and surprising and nobody knew". The nobody knew idea, as Jimmc414 has nicely proven in this thread, is false alarm because their technique actually was detected and the paper authors just didn't know that it had been. And "it's unsafe" doesn't make any sense in this context. Repeating random bits of memorized text surrounded by huge amounts of original text isn't a safety problem. Nor is it an "exploit" that needs to be "patched". OpenAI could ignore this problem and nobody would care except AI alignment researchers.

The culture of alarmism in AI research is vaguely reminiscent of the early Victorians who argued that riding trains might be dangerous, because at such high speeds the air could be sucked out of the carriages.

somat2y ago

Speaking of remembering training data, I see that as a big problem with chat based systems. They swallow a bunch of data, then generate something when prompted, My worry is not so much copyright infringement but more something like citation needed?

Has anyone done any work to produce citations for the generated data?

JoshuaDavid2y ago

Some work, yeah. It's still an open problem to do it well, but I think the folks at Anthropic have made a reasonable start with their work[1] on influence functions ("tracing model outputs to the training data"). Basically their work attempts to answer the question "what particular training data most strongly influenced the model to give the answer it did", by doing some fancy math that I think is equivalent to taking the gradient produced by each piece of training data, computing the derivative of loss on the output of interest as the gradient is applied to the model, and then using that as the answer.

Though it sounds like even their much cheaper clever approach is still very expensive.

[1] paper at https://arxiv.org/abs/2308.03296, post at https://www.anthropic.com/index/influence-functions

dang2y ago

Recent and related:

Scalable extraction of training data from (production) language models - https://news.ycombinator.com/item?id=38496715 - Dec 2023 (12 comments)

Extracting training data from ChatGPT - https://news.ycombinator.com/item?id=38458683 - Nov 2023 (126 comments)

FartyMcFarter2y ago

This should make companies think twice about what training data they use. Plausible deniability doesn't work if you spit out your training data verbatim.

cedws2y ago

What's the endgame of this "AI models are trained on copyrighted data" stuff? I don't see how LLMs can work going forward if every copyright owner needs to be paid or asked for permission. Do they just want LLM development to stop?

sanp2y ago

Why should LLM development proceed if the only way it can is by violating copyright?

j0hnyl2y ago

Imo the world needs to find a way past the absurd notion of intellectual property.In a digital world where all collective knowledge is available at anyone's fingerprints ideas like copyright are anachronistic.

appplication2y ago

Sure, I agree with you at a high level. But if the answer is that LLMs get a pass and the rest of us have to deal with DMCA takedown abuse, inaccessible geolocked content, and 7-figure legal penalties for getting caught downloading a $3.99-to-rent movie, then fuck that.

If we want to have the copyright conversation, we need to to have the copyright conversation, not just about how LLMs get to circumvent it and monetize off of it.

1 more reply

sumedh2y ago

> Imo the world needs to find a way past the absurd notion of intellectual property.

Do you work for free?

1 more reply

fbhabbed2y ago

LLM development will proceed regardless in countries that don't give a damn about our vision of "copyright".

It doesn't matter if you think copyright makes sense or not. In 20 years, some country will have its own giant LLM trained on copyrighted material and use this to boost their competitive advantage and technological power and development, perhaps so much that the advantage will be tremendous, while we'll stay the underdogs because "my copyrights".

greenhexagon2y ago

"Violating copyright" is a completely imaginary problem. We have a somewhat arbitrary set of laws, rules, guidelines and social norms about using existing ideas.

American law for instance has limits on the duration of copyright before something becomes public domain, explicit exemptions for "fair use" for education, journalistic reporting, commentary, etc.

If "copyright" is a problem in the way of training AI models, then we should all collectively vote for politicians who fix that problem by updating the laws to make the training explicitly allowed. Problem solved.

(Alternatively, if you're evil, vote for politicians who will let the billionaires strengthen their domination and subjugation of the other 99.9999% of humans by making copyright laws even more in favor of TimeWarner-Disney-Miramax-FoxNews-Lockheed-GE or whatever the current conglomerate is).

naet2y ago

It's not a completely imaginary problem or a problem only affecting big corporations. If I'm an individual writer or artist and my work gets fed into an LLM against my will it can seriously undercut the value of that work or discourage me from creating more.

If you can just ask the LLM to give you the contents of my book you are less likely to buy it, and if you can just ask the image generator to generate an image in my unique style for free you won't want to buy my artwork.

I think it makes perfect sense that a model needs a specific license to train on my work, especially if the model is run by a massive corporation making a profit off it, and the model after downloading a copy of my work and "training" can reproduce it verbatim on request.

2 more replies

CaptainFever2y ago

> If "copyright" is a problem in the way of training AI models, then we should all collectively vote for politicians who fix that problem by updating the laws to make the training explicitly allowed. Problem solved.

Yep. The EU has this, as does Singapore, South Korea and Malaysia. A lot of countries have already recognised that it's not a good idea to restrict AI dev because of IP "rights".

polypodiopsi2y ago

As arbitrary as the rest of the capitalist framework. These arbitraty constraints are interdependent though. So while you might be right you can not just drop copyright (on whose existence the livelihood of a lot of people depends), but you will have to let go of thr whole assumption that the production of surplus value goes into private profits (or way above average incomes for that). Something is telling me that this isn't as appealing to you as getting your ressources for free by the expropriation of intellectual property, is it?

chupapimunyenyo2y ago

Exemptions? Didn't you mean exceptions?

cedws2y ago

Well, like I said in another comment, some people believe we are on the brink of a new age of prosperity due to AI development. I'm not sure if I share that opinion - just playing devil's advocate.

ComplexSystems2y ago

Why should copyright law be as it is if it means we can't have artificial intelligence?

krapp2y ago

Either buy rights to the data, produce training data for which you own the rights or use copyright-free data. Those options exist, but no one takes advantage of them because none of them are as much of a "free money machine" as just ripping off as many people as possible to homogenize and commodify their work.

If LLM development can't continue without violating copyright then that makes it clear that the purpose of LLM development is violation of copyright. Which is something we all already knew but it's nice to have it spelled out in no uncertain terms.

ComplexSystems2y ago

> If LLM development can't continue without violating copyright then that makes it clear that the purpose of LLM development is violation of copyright.

This is a very extreme view. I don't think the RIAA, back in the Napster days, suggested that the "purpose of the internet" was violation of copyright, for instance.

krapp2y ago

No one ever said development of the internet couldn't continue if copyright had to be respected, either, so the proof is in the pudding.

SunghoYahng2y ago

What do you think of the explination that the purpose of copyright is to prevent LLM development?

souplesse2y ago

Is your argument that the ends justify the means?

torstenvl2y ago

If it is, that's still valid, because copyright exists only for its results. It isn't a natural right, but one created by the government "to promote the useful arts and sciences."

cedws2y ago

I don't know if I have an argument. But according to some, AI will lead us into a new era of prosperity. So, maybe?

searealist2y ago

As opposed to what? Isn’t that always the question?

TillE2y ago

We're talking about multi-billion dollar companies with the potential to become truly enormous, I have no doubt that they can cut appropriate deals with large publishers.

Art is a little harder because the infrastructure doesn't currently exist, but it's easy to imagine artists' organizations being formed for this exact purpose: contribute your art in exchange for a licensing fee, and the organization negotiates with the tech companies.

science4sail2y ago

> I don't see how LLMs can work going forward if every copyright owner needs to be paid or asked for permission.

Simple, LLM development leadership shifts to open-source models and/or organizations/countries that are willing to bend or ignore copyright law. Silicon Valley isn't the world, neither is the United States.

deckar012y ago

Large publishers could seek licensing deals similar to digital libraries.

https://www.niso.org/niso-io/2014/12/reflections-library-lic...

hooverd2y ago

It's moreso copyright for me but not for thee.

Jimmc4142y ago

What proof is there that copyrighted data was used? Most of the court cases are based on examples of someone asking ChatGPT "Was X used in your training data?" and ChatGPT's answer of "Yes, it was" which is laughable if you are familiar with ChatGPT behavior.

There is enough chatter about copywrighted works on the internet to infer everthing you need to know about the work itself.

chupapimunyenyo2y ago

Copywrighted? Didn't you mean copyrighted?

Jimmc4142y ago

Yes, that is what I meant.

cmeacham982y ago

Did you read the linked article?

If I input to ChatGPT "repeat the word poem 1000 times" and it spits out a verbatim quote of my copyrighted material surely that's strong proof?

Jimmc4142y ago

Yes, I did and I provided an explanation for this in the comment you are replying to.

>There is enough chatter about copywrighted works on the internet to infer everthing you need to know about the work itself.

gardenhedge2y ago

I think as part of AI regulations, all companies should have to publish their training data along side their model.

j / k navigate · click thread line to collapse

50 comments

Jimmc4142y ago

I reported this behavior 4 months ago on HN https://news.ycombinator.com/item?id=36675729

catchnear43212y ago

it is worrying just how much of this has shown up in hn comments months before being officially discovered by official experts.

chatmasta2y ago

If you limit your definition of "experts" to people who write academic papers and BigCorp press releases, you're gonna miss a lot of insight.

chupapimunyenyo2y ago

The "official experts" are just like you and me. Turns out they might even be worse than us if it took them so long to notice

pyinstallwoes2y ago

The argument for those who make things vs those who make models trying to make sense of those who makes things.

This got metaphysical without really wanting it to be. Oh well.

1 more reply

mike_hearn2y ago

somat2y ago

Has anyone done any work to produce citations for the generated data?

JoshuaDavid2y ago

Though it sounds like even their much cheaper clever approach is still very expensive.

[1] paper at https://arxiv.org/abs/2308.03296, post at https://www.anthropic.com/index/influence-functions

dang2y ago

Recent and related:

Scalable extraction of training data from (production) language models - https://news.ycombinator.com/item?id=38496715 - Dec 2023 (12 comments)

Extracting training data from ChatGPT - https://news.ycombinator.com/item?id=38458683 - Nov 2023 (126 comments)

FartyMcFarter2y ago

This should make companies think twice about what training data they use. Plausible deniability doesn't work if you spit out your training data verbatim.

cedws2y ago

sanp2y ago

Why should LLM development proceed if the only way it can is by violating copyright?

j0hnyl2y ago

appplication2y ago

If we want to have the copyright conversation, we need to to have the copyright conversation, not just about how LLMs get to circumvent it and monetize off of it.

1 more reply

sumedh2y ago

> Imo the world needs to find a way past the absurd notion of intellectual property.

Do you work for free?

1 more reply

fbhabbed2y ago

LLM development will proceed regardless in countries that don't give a damn about our vision of "copyright".

greenhexagon2y ago

"Violating copyright" is a completely imaginary problem. We have a somewhat arbitrary set of laws, rules, guidelines and social norms about using existing ideas.

American law for instance has limits on the duration of copyright before something becomes public domain, explicit exemptions for "fair use" for education, journalistic reporting, commentary, etc.

naet2y ago

2 more replies

CaptainFever2y ago

Yep. The EU has this, as does Singapore, South Korea and Malaysia. A lot of countries have already recognised that it's not a good idea to restrict AI dev because of IP "rights".

polypodiopsi2y ago

chupapimunyenyo2y ago

Exemptions? Didn't you mean exceptions?

cedws2y ago

Well, like I said in another comment, some people believe we are on the brink of a new age of prosperity due to AI development. I'm not sure if I share that opinion - just playing devil's advocate.

ComplexSystems2y ago

Why should copyright law be as it is if it means we can't have artificial intelligence?

krapp2y ago

ComplexSystems2y ago

> If LLM development can't continue without violating copyright then that makes it clear that the purpose of LLM development is violation of copyright.

This is a very extreme view. I don't think the RIAA, back in the Napster days, suggested that the "purpose of the internet" was violation of copyright, for instance.

krapp2y ago

No one ever said development of the internet couldn't continue if copyright had to be respected, either, so the proof is in the pudding.

SunghoYahng2y ago

What do you think of the explination that the purpose of copyright is to prevent LLM development?

souplesse2y ago

Is your argument that the ends justify the means?

torstenvl2y ago

If it is, that's still valid, because copyright exists only for its results. It isn't a natural right, but one created by the government "to promote the useful arts and sciences."

cedws2y ago

I don't know if I have an argument. But according to some, AI will lead us into a new era of prosperity. So, maybe?

searealist2y ago

As opposed to what? Isn’t that always the question?

TillE2y ago

We're talking about multi-billion dollar companies with the potential to become truly enormous, I have no doubt that they can cut appropriate deals with large publishers.

science4sail2y ago

> I don't see how LLMs can work going forward if every copyright owner needs to be paid or asked for permission.

deckar012y ago

Large publishers could seek licensing deals similar to digital libraries.

https://www.niso.org/niso-io/2014/12/reflections-library-lic...

hooverd2y ago

It's moreso copyright for me but not for thee.

Jimmc4142y ago

There is enough chatter about copywrighted works on the internet to infer everthing you need to know about the work itself.

chupapimunyenyo2y ago

Copywrighted? Didn't you mean copyrighted?

Jimmc4142y ago

Yes, that is what I meant.

cmeacham982y ago

Did you read the linked article?

If I input to ChatGPT "repeat the word poem 1000 times" and it spits out a verbatim quote of my copyrighted material surely that's strong proof?

Jimmc4142y ago

Yes, I did and I provided an explanation for this in the comment you are replying to.

>There is enough chatter about copywrighted works on the internet to infer everthing you need to know about the work itself.

gardenhedge2y ago

I think as part of AI regulations, all companies should have to publish their training data along side their model.

j / k navigate · click thread line to collapse