undefined | Better HN

0 pointsmwigdahl1y ago0 comments

The best approach to circumventing the nondisclosure agreement is for the affected employees to get together, write out everything they want to say about OpenAI, train an LLM on that text, and then release it.

Based on these companies' arguments that copyrighted material is not actually reproduced by these models, and that any seemingly-infringing use is the responsibility of the user of the model rather than those who produced it, anyone could freely generate an infinite number of high-truthiness OpenAI anecdotes, freshly laundered by the inference engine, that couldn't be used against the original authors without OpenAI invalidating their own legal stance with respect to their own models.

0 comments

TeMPOraL1y ago

Clever, but no.

The argument about LLMs not being copyright laundromats making sense hinges the scale and non-specificity of training. There's a difference between "LLM reproduced this piece of copyrighted work because it memorized it from being fed literally half the internet", vs. "LLM was intentionally trained to specifically reproduce variants of this particular work". Whatever one's stances on the former case, the latter case would be plain infringing copyrights and admitting to it.

In other words: GPT-4 gets to get away with occasionally spitting out something real verbatim. Llama2-7b-finetune-NYTArticles does not.

bluefirebrand1y ago

Seems absurd that somehow the scale being massive makes it better somehow

You would think having a massive scale just means it has infringed even more copyrights, and therefore should be in even more hot water

NewJazz1y ago

My US history teacher taught me something important. He said that if you are going to steal and don't want to get in trouble, steal a whole lot.

3 more replies

kmeisthax1y ago

So, the law has this concept of 'de minimus' infringement, where if you take a very small amount - like, way smaller than even a fair use - the courts don't care. If you're taking a handful of word probabilities from every book ever written, then the portion taken from each work is very, very low, so courts aren't likely to care.

If you're only training on a handful of works then you're taking more from them, meaning it's not de minimus.

For the record, I got this legal theory from Cory Doctorow[0], but I'm skeptical. It's very plausible, but at the same time, we also thought sampling in music was de minimus until the Second Circuit said otherwise. Copyright law is extremely malleable in the presence of moneyed interests, sometimes without Congressional intervention even!

[0] who is NOT pro-AI, he just thinks labor law is a better bulwark against it than copyright

4 more replies

tempodox1y ago

Almost reminds one of real life: The big thieves get away and have a fan base while the small ones get prosecuted as criminals.

TeMPOraL1y ago

You may or may not agree with it, but that's the only thing that makes it different - scale and non-specificity. Same thing that worked for search engines, for example.

My point isn't to argue merits of that case, it's just to point out that OP's joke is like a stereotypical output of an LLM: seems to make sense, but really doesn't.

omeid21y ago

It may not make a lot of sense but it follows the "fair use" doctrine. Which is generally based on the following 4 factors:

1) the purpose and character of use.

2) the nature of the copyrighted material.

3) the *amount* and *substantiality* of the portion taken, and.

4) the effect of the use upon the *potential market*.

So in that regard, if you're training a personal assistance GPT, and use some software code to teach your model logic, that is easy to defend as fair use.

But the extent of use matters, and if you're training an AI for the sole purpose of regurgitating specific copyrighted material, it is infringement, if it is copyrighted, but in this case, it is not copyright issue, it is contracts and NDAs.

blksv1y ago

It is the same scale argument that allows you to publish a photo of a procession without written consent from every participant.

throwaway20371y ago

    > LLMs not being copyright laundromats

This a brilliant phrase. You might as well put that into an Emacs paste macro now. It won't be the last time you will need it. And the OP is classic HN folly where programmer thinks laws and courts can be hacked with "this one weird trick".

calvinmorrison1y ago

But they can, just look at AirBnB, Uber, etc.

2 more replies

makeitdouble1y ago

My take away is that we should talk about our experience in companies at a large enough scale that it becomes non specific in principle, and not targeted at a single company.

Basically, we need our open source version of Glassdoor as a LLM ?

TeMPOraL1y ago

This exists, it's called /r/antiwork :).

OP wants to achieve effects of specific accusation using only non-specific means; that's not easy to pull off.

adra1y ago

Which has been established in court where?

TeMPOraL1y ago

And it matters how? I didn't say the argument is correct or approved by court, or that I even support it. I'm saying what the argument, which OP referenced, is about, and how it differs from their proposal.

sundalia1y ago

+1, this is just the commenter saying what they want without an actual court case

1 more reply

romwell1y ago

Cool, just feed the ChatGPT+ the same half the Internet plus OpenAI founders' anecdotes about the company.

Ta-da.

TeMPOraL1y ago

And be rightfully sacked for maliciously burning millions of dollars on a retrain to purposefully poison the model?

Not to mention: LLMs aren't oracles. Whatever they say will be dismissed as hallucinations if it isn't corroborated by other sources.

1 more reply

tadfisher1y ago

To definitively prove this either way, they'll have to make their source code and model available (maybe under subpoena and/or gag order), so don't expect this issue to be actually tested in court (so long as the defendants have enough VC money).

aprilthird20211y ago

> In other words: GPT-4 gets to get away with occasionally spitting out something real verbatim. Llama2-7b-finetune-NYTArticles does not.

Based on what? This isn't any legal argument that will hold water in any court I'm aware of

dorkwood1y ago

How many sources do you need to steal from for it to no longer be considered stealing? Two? Three? A hundred?

TeMPOraL1y ago

1 more reply

8note1y ago

The scale of two people should be large enough to make it ambiguous who spilled the beans at least

anigbrowl1y ago

It's not a copyright violation if you voluntarily provide the training material...

XorNot1y ago

I don't know why copyright is getting involved here. The clause is about criticizing the company.

Releasing an LLM trained on company criticisms, by people specifically instructed not to do so is transparently violating the agreement.

Because you're intentionally publishing criticism of the company.

andyjohnson01y ago

Clever, but the law is not a machine or an algorithm. Intent matters.

Training an LLM with the intent of contravening an NDA is just plain <intent to contravene an NDA>. Everyone would still get sued anyway.

bazoom421y ago

It is a classic geek fallacy to think you can hack the law with logic tricks.

andyjohnson01y ago

Indeed it is. Obligatory xkcd - https://xkcd.com/1494/

jeffreygoesto1y ago

But then training a commercial model is done with the intent to not pay the original authors, how is that different?

repeekad1y ago

> done with the intent to not pay the original authors

no one building this software wants to “steal from creators” and the legal precedent for using copyrighted works for the purpose of training is clear with the NYT case against open AI

It’s why things like the recent deal with Reddit to train on their data (which Reddit owns and users give up when using the platform) are becoming so important, same with Twitter/X

1 more reply

mpweiher1y ago

Chutzpah. And that the companies doing it are multi-billion dollar companies who can afford the finest legal representation money can buy.

Whether the brazenness with which they are doing this will work out for them is currently playing out in the courts.

kdnvk1y ago

It’s not done with the intent to infringe copyright.

1 more reply

judge20201y ago

NDAs don’t touch the copyright of your speech / written works you produce after leaving, they just make it breach of contract to distribute those words.

elicksaur1y ago

Following the legal defense of these companies, the employees wouldn’t be distributing any words. They’re distributing a model.

JumpCrisscross1y ago

They’re disseminating the information. Form isn’t as important as it is for copyright.

cqqxo4zV46cp1y ago

Please just stop. It’s highly unlikely that any relevant part of any reasonably structured NDA has any material relevance to copyright. Why do developers think that they can just intuit this stuff? This is one step away from being a more trendy “stick the constitution to the back of my car in lieu of a license place” lunacy.

1 more reply

otabdeveloper41y ago

Technically, no words are being distributed here. (At least according to OpenAI lawyers.)

romwell1y ago

>they just make it breach of contract to distribute those words.

See, they aren't distributing the words, and good luck proving that any specific words went into training the model.

rlt1y ago

This would be hilarious and genius. Touché.

KoolKat231y ago

Lol this would be a great performative piece. Although not so sure it'd stand up to scrutiny. Openai could probably take them to court on the grounds of disclosure of trade secrets or something like that and force them to reveal its training data and thus potentially revealing its source.

nextaccountic1y ago

If they did so, they would open up themselves for lawsuits of people unhappy about OpenAI's own training data.

So they probably won't.

KoolKat231y ago

Good point

otterley1y ago

IAAL (but not your lawyer and this is not legal advice).

That’s not how it works. It doesn’t matter if you write the words yourself or have an agent write them for you. In either case, it’s the communication of the covered information that is proscribed by these kinds of agreements.

renewiltord1y ago

To be honest, you can just say “I don’t have anything to add on that subject” and people will get the impression. No one ever says that about companies they like so you know when people shut down that something was up.

“What was the company culture like?” “Etc. platitude so on and so forth”

“And I heard the CEO was a total dickbag. Was that your experience working with him?” “I don’t have anything to add on that subject”

Of course going back and forth on that won’t really work but to different people you can’t be expected to not say the nice things and then someone could build up a story based on that.

NoMoreNicksLeft1y ago

NDA's don't rely on copyright to protect the party who drafted it from disclosure. There might even be an argument to be made that training the LLM on it was disclosure, regardless of whether you release the LLM publicly or not. We all work in tech right? Why do even you people get intellectual property so wrong, every single time?

visarga1y ago

No need for LLM, anonymous letter does the same thing

throwaway20371y ago

On first blush, this sounds like a good idea. Thinking deeper, the company is so small that it will be easy to identify the author.

cqqxo4zV46cp1y ago

I’m going to break rank from everyone else and explicitly say “not clever”. Developers that think that they know how the levels system works are a dime a dozen. It’s both easy and useless to take some acquired-in-passing largely incorrect surface level understanding of a legal mechanic and “pwned with facts and logic!” in whichever way benefits you.

bboygravity1y ago

Genious. I'm praying for this to happen.

b1121y ago

Copyright != an NDA. Copyright is not an agreement between two entities, but a US federal law, with international obligations both ratified and not.

Copyright has fair uses clauses, endless court decisions limiting its use, carve outs for libraries, additional junk like the DMCA and more slapped on top. It's a patchwork of dozens of treaties and laws, spanning hundreds of years.

For example, you can read a book to a room full of kids, you can use copyright materials in comedic skits, you can quote snippets, the list goes on. And again, this is all legislated.

The point? It's complex, and specific usage of copyrighted works infringing or not, can be debatable without intent immediately being malign.

Meanwhile, an NDA covers far, far more than copyright. It may cover discussion and disclosure of everything or anything, including even client lists, trade secrets, work processes, and more. It is signed, and agreed to by both parties involved. Equating "copyright law" to "an NDA" is a non-starter. There's literally zero legal parallel or comparison here.

And as others have mentioned, the intent of the act would be malicious on top of all of this.

I know a lot of people dislike the whole data snag by OpenAI, and have moral or ethical objections to closed models, but thinking anyone would care about this argument if you breach an NDA is a bad idea. No judge would even remotely accept or listen to such chicanery.

Always421y ago

if I slaved away at openai for a year to get some equity, I don't think I would want to be the one to try this strategy

jahewson1y ago

Ha ha, but no. For starters, copyright falls under federal law and contacts under state law, so it’s not even possible to make this claim in the relevant court.

1 more reply

p0w3n3d1y ago

that's the evilest thing I can imagine - fighting with them with their own weapon

j / k navigate · click thread line to collapse

0 comments

TeMPOraL1y ago

Clever, but no.

In other words: GPT-4 gets to get away with occasionally spitting out something real verbatim. Llama2-7b-finetune-NYTArticles does not.

bluefirebrand1y ago

Seems absurd that somehow the scale being massive makes it better somehow

You would think having a massive scale just means it has infringed even more copyrights, and therefore should be in even more hot water

NewJazz1y ago

My US history teacher taught me something important. He said that if you are going to steal and don't want to get in trouble, steal a whole lot.

3 more replies

kmeisthax1y ago

If you're only training on a handful of works then you're taking more from them, meaning it's not de minimus.

[0] who is NOT pro-AI, he just thinks labor law is a better bulwark against it than copyright

4 more replies

tempodox1y ago

Almost reminds one of real life: The big thieves get away and have a fan base while the small ones get prosecuted as criminals.

TeMPOraL1y ago

You may or may not agree with it, but that's the only thing that makes it different - scale and non-specificity. Same thing that worked for search engines, for example.

My point isn't to argue merits of that case, it's just to point out that OP's joke is like a stereotypical output of an LLM: seems to make sense, but really doesn't.

omeid21y ago

It may not make a lot of sense but it follows the "fair use" doctrine. Which is generally based on the following 4 factors:

1) the purpose and character of use.

2) the nature of the copyrighted material.

3) the *amount* and *substantiality* of the portion taken, and.

4) the effect of the use upon the *potential market*.

So in that regard, if you're training a personal assistance GPT, and use some software code to teach your model logic, that is easy to defend as fair use.

blksv1y ago

It is the same scale argument that allows you to publish a photo of a procession without written consent from every participant.

throwaway20371y ago

    > LLMs not being copyright laundromats

calvinmorrison1y ago

But they can, just look at AirBnB, Uber, etc.

2 more replies

makeitdouble1y ago

My take away is that we should talk about our experience in companies at a large enough scale that it becomes non specific in principle, and not targeted at a single company.

Basically, we need our open source version of Glassdoor as a LLM ?

TeMPOraL1y ago

This exists, it's called /r/antiwork :).

OP wants to achieve effects of specific accusation using only non-specific means; that's not easy to pull off.

adra1y ago

Which has been established in court where?

TeMPOraL1y ago

sundalia1y ago

+1, this is just the commenter saying what they want without an actual court case

1 more reply

romwell1y ago

Cool, just feed the ChatGPT+ the same half the Internet plus OpenAI founders' anecdotes about the company.

Ta-da.

TeMPOraL1y ago

And be rightfully sacked for maliciously burning millions of dollars on a retrain to purposefully poison the model?

Not to mention: LLMs aren't oracles. Whatever they say will be dismissed as hallucinations if it isn't corroborated by other sources.

1 more reply

tadfisher1y ago

aprilthird20211y ago

> In other words: GPT-4 gets to get away with occasionally spitting out something real verbatim. Llama2-7b-finetune-NYTArticles does not.

Based on what? This isn't any legal argument that will hold water in any court I'm aware of

dorkwood1y ago

How many sources do you need to steal from for it to no longer be considered stealing? Two? Three? A hundred?

TeMPOraL1y ago

1 more reply

8note1y ago

The scale of two people should be large enough to make it ambiguous who spilled the beans at least

anigbrowl1y ago

It's not a copyright violation if you voluntarily provide the training material...

XorNot1y ago

I don't know why copyright is getting involved here. The clause is about criticizing the company.

Releasing an LLM trained on company criticisms, by people specifically instructed not to do so is transparently violating the agreement.

Because you're intentionally publishing criticism of the company.

andyjohnson01y ago

Clever, but the law is not a machine or an algorithm. Intent matters.

Training an LLM with the intent of contravening an NDA is just plain <intent to contravene an NDA>. Everyone would still get sued anyway.

bazoom421y ago

It is a classic geek fallacy to think you can hack the law with logic tricks.

andyjohnson01y ago

Indeed it is. Obligatory xkcd - https://xkcd.com/1494/

jeffreygoesto1y ago

But then training a commercial model is done with the intent to not pay the original authors, how is that different?

repeekad1y ago

> done with the intent to not pay the original authors

no one building this software wants to “steal from creators” and the legal precedent for using copyrighted works for the purpose of training is clear with the NYT case against open AI

It’s why things like the recent deal with Reddit to train on their data (which Reddit owns and users give up when using the platform) are becoming so important, same with Twitter/X

1 more reply

mpweiher1y ago

Chutzpah. And that the companies doing it are multi-billion dollar companies who can afford the finest legal representation money can buy.

Whether the brazenness with which they are doing this will work out for them is currently playing out in the courts.

kdnvk1y ago

It’s not done with the intent to infringe copyright.

1 more reply

judge20201y ago

NDAs don’t touch the copyright of your speech / written works you produce after leaving, they just make it breach of contract to distribute those words.

elicksaur1y ago

Following the legal defense of these companies, the employees wouldn’t be distributing any words. They’re distributing a model.

JumpCrisscross1y ago

They’re disseminating the information. Form isn’t as important as it is for copyright.

cqqxo4zV46cp1y ago

1 more reply

otabdeveloper41y ago

Technically, no words are being distributed here. (At least according to OpenAI lawyers.)

romwell1y ago

>they just make it breach of contract to distribute those words.

See, they aren't distributing the words, and good luck proving that any specific words went into training the model.

rlt1y ago

This would be hilarious and genius. Touché.

KoolKat231y ago

nextaccountic1y ago

If they did so, they would open up themselves for lawsuits of people unhappy about OpenAI's own training data.

So they probably won't.

KoolKat231y ago

Good point

otterley1y ago

IAAL (but not your lawyer and this is not legal advice).

renewiltord1y ago

“What was the company culture like?” “Etc. platitude so on and so forth”

“And I heard the CEO was a total dickbag. Was that your experience working with him?” “I don’t have anything to add on that subject”

Of course going back and forth on that won’t really work but to different people you can’t be expected to not say the nice things and then someone could build up a story based on that.

NoMoreNicksLeft1y ago

visarga1y ago

No need for LLM, anonymous letter does the same thing

throwaway20371y ago

On first blush, this sounds like a good idea. Thinking deeper, the company is so small that it will be easy to identify the author.

cqqxo4zV46cp1y ago

bboygravity1y ago

Genious. I'm praying for this to happen.

b1121y ago

Copyright != an NDA. Copyright is not an agreement between two entities, but a US federal law, with international obligations both ratified and not.

For example, you can read a book to a room full of kids, you can use copyright materials in comedic skits, you can quote snippets, the list goes on. And again, this is all legislated.

The point? It's complex, and specific usage of copyrighted works infringing or not, can be debatable without intent immediately being malign.

And as others have mentioned, the intent of the act would be malicious on top of all of this.

Always421y ago

if I slaved away at openai for a year to get some equity, I don't think I would want to be the one to try this strategy

jahewson1y ago

Ha ha, but no. For starters, copyright falls under federal law and contacts under state law, so it’s not even possible to make this claim in the relevant court.

1 more reply

p0w3n3d1y ago

that's the evilest thing I can imagine - fighting with them with their own weapon

j / k navigate · click thread line to collapse