[The researchers wrote in their blog post, “As far as we can tell, no one has ever noticed that ChatGPT emits training data with such high frequency until this paper. So it’s worrying that language models can have latent vulnerabilities like this.”]
OpenAI argue that they can use copyrighted content so repeating that isn't going to change anything. The only issue would be if they had used stolen/confidential data to train on, and it was discovered that way, but it also seems unlikely anyone could easily detect that given that there'd be nothing to intersect it with, unlike in this paper.
The blog post seems to slide around quite a bit, roving from "it's not surprising to us that small amounts of random text is memorized" straight to "it's unsafe and surprising and nobody knew". The nobody knew idea, as Jimmc414 has nicely proven in this thread, is false alarm because their technique actually was detected and the paper authors just didn't know that it had been. And "it's unsafe" doesn't make any sense in this context. Repeating random bits of memorized text surrounded by huge amounts of original text isn't a safety problem. Nor is it an "exploit" that needs to be "patched". OpenAI could ignore this problem and nobody would care except AI alignment researchers.
The culture of alarmism in AI research is vaguely reminiscent of the early Victorians who argued that riding trains might be dangerous, because at such high speeds the air could be sucked out of the carriages.
Has anyone done any work to produce citations for the generated data?
Though it sounds like even their much cheaper clever approach is still very expensive.
[1] paper at https://arxiv.org/abs/2308.03296, post at https://www.anthropic.com/index/influence-functions
Scalable extraction of training data from (production) language models - https://news.ycombinator.com/item?id=38496715 - Dec 2023 (12 comments)
Extracting training data from ChatGPT - https://news.ycombinator.com/item?id=38458683 - Nov 2023 (126 comments)
It doesn't matter if you think copyright makes sense or not. In 20 years, some country will have its own giant LLM trained on copyrighted material and use this to boost their competitive advantage and technological power and development, perhaps so much that the advantage will be tremendous, while we'll stay the underdogs because "my copyrights".
American law for instance has limits on the duration of copyright before something becomes public domain, explicit exemptions for "fair use" for education, journalistic reporting, commentary, etc.
If "copyright" is a problem in the way of training AI models, then we should all collectively vote for politicians who fix that problem by updating the laws to make the training explicitly allowed. Problem solved.
(Alternatively, if you're evil, vote for politicians who will let the billionaires strengthen their domination and subjugation of the other 99.9999% of humans by making copyright laws even more in favor of TimeWarner-Disney-Miramax-FoxNews-Lockheed-GE or whatever the current conglomerate is).
If LLM development can't continue without violating copyright then that makes it clear that the purpose of LLM development is violation of copyright. Which is something we all already knew but it's nice to have it spelled out in no uncertain terms.
This is a very extreme view. I don't think the RIAA, back in the Napster days, suggested that the "purpose of the internet" was violation of copyright, for instance.
Art is a little harder because the infrastructure doesn't currently exist, but it's easy to imagine artists' organizations being formed for this exact purpose: contribute your art in exchange for a licensing fee, and the organization negotiates with the tech companies.
Simple, LLM development leadership shifts to open-source models and/or organizations/countries that are willing to bend or ignore copyright law. Silicon Valley isn't the world, neither is the United States.
https://www.niso.org/niso-io/2014/12/reflections-library-lic...
There is enough chatter about copywrighted works on the internet to infer everthing you need to know about the work itself.
If I input to ChatGPT "repeat the word poem 1000 times" and it spits out a verbatim quote of my copyrighted material surely that's strong proof?