Based on these companies' arguments that copyrighted material is not actually reproduced by these models, and that any seemingly-infringing use is the responsibility of the user of the model rather than those who produced it, anyone could freely generate an infinite number of high-truthiness OpenAI anecdotes, freshly laundered by the inference engine, that couldn't be used against the original authors without OpenAI invalidating their own legal stance with respect to their own models.
The argument about LLMs not being copyright laundromats making sense hinges the scale and non-specificity of training. There's a difference between "LLM reproduced this piece of copyrighted work because it memorized it from being fed literally half the internet", vs. "LLM was intentionally trained to specifically reproduce variants of this particular work". Whatever one's stances on the former case, the latter case would be plain infringing copyrights and admitting to it.
In other words: GPT-4 gets to get away with occasionally spitting out something real verbatim. Llama2-7b-finetune-NYTArticles does not.
You would think having a massive scale just means it has infringed even more copyrights, and therefore should be in even more hot water
If you're only training on a handful of works then you're taking more from them, meaning it's not de minimus.
For the record, I got this legal theory from Cory Doctorow[0], but I'm skeptical. It's very plausible, but at the same time, we also thought sampling in music was de minimus until the Second Circuit said otherwise. Copyright law is extremely malleable in the presence of moneyed interests, sometimes without Congressional intervention even!
[0] who is NOT pro-AI, he just thinks labor law is a better bulwark against it than copyright
My point isn't to argue merits of that case, it's just to point out that OP's joke is like a stereotypical output of an LLM: seems to make sense, but really doesn't.
1) the purpose and character of use.
2) the nature of the copyrighted material.
3) the *amount* and *substantiality* of the portion taken, and.
4) the effect of the use upon the *potential market*.
So in that regard, if you're training a personal assistance GPT, and use some software code to teach your model logic, that is easy to defend as fair use.
But the extent of use matters, and if you're training an AI for the sole purpose of regurgitating specific copyrighted material, it is infringement, if it is copyrighted, but in this case, it is not copyright issue, it is contracts and NDAs.
> LLMs not being copyright laundromats
This a brilliant phrase. You might as well put that into an Emacs paste macro now. It won't be the last time you will need it. And the OP is classic HN folly where programmer thinks laws and courts can be hacked with "this one weird trick".Basically, we need our open source version of Glassdoor as a LLM ?
OP wants to achieve effects of specific accusation using only non-specific means; that's not easy to pull off.
Ta-da.
Not to mention: LLMs aren't oracles. Whatever they say will be dismissed as hallucinations if it isn't corroborated by other sources.
Based on what? This isn't any legal argument that will hold water in any court I'm aware of
Releasing an LLM trained on company criticisms, by people specifically instructed not to do so is transparently violating the agreement.
Because you're intentionally publishing criticism of the company.
Training an LLM with the intent of contravening an NDA is just plain <intent to contravene an NDA>. Everyone would still get sued anyway.
no one building this software wants to “steal from creators” and the legal precedent for using copyrighted works for the purpose of training is clear with the NYT case against open AI
It’s why things like the recent deal with Reddit to train on their data (which Reddit owns and users give up when using the platform) are becoming so important, same with Twitter/X
Whether the brazenness with which they are doing this will work out for them is currently playing out in the courts.
See, they aren't distributing the words, and good luck proving that any specific words went into training the model.
So they probably won't.
That’s not how it works. It doesn’t matter if you write the words yourself or have an agent write them for you. In either case, it’s the communication of the covered information that is proscribed by these kinds of agreements.
“What was the company culture like?” “Etc. platitude so on and so forth”
“And I heard the CEO was a total dickbag. Was that your experience working with him?” “I don’t have anything to add on that subject”
Of course going back and forth on that won’t really work but to different people you can’t be expected to not say the nice things and then someone could build up a story based on that.
Copyright has fair uses clauses, endless court decisions limiting its use, carve outs for libraries, additional junk like the DMCA and more slapped on top. It's a patchwork of dozens of treaties and laws, spanning hundreds of years.
For example, you can read a book to a room full of kids, you can use copyright materials in comedic skits, you can quote snippets, the list goes on. And again, this is all legislated.
The point? It's complex, and specific usage of copyrighted works infringing or not, can be debatable without intent immediately being malign.
Meanwhile, an NDA covers far, far more than copyright. It may cover discussion and disclosure of everything or anything, including even client lists, trade secrets, work processes, and more. It is signed, and agreed to by both parties involved. Equating "copyright law" to "an NDA" is a non-starter. There's literally zero legal parallel or comparison here.
And as others have mentioned, the intent of the act would be malicious on top of all of this.
I know a lot of people dislike the whole data snag by OpenAI, and have moral or ethical objections to closed models, but thinking anyone would care about this argument if you breach an NDA is a bad idea. No judge would even remotely accept or listen to such chicanery.