undefined | Better HN

0 pointsthethimble10mo ago0 comments

“Copy” is ambiguous here. Of course data is copied during training. That said, OP is referring to whether the resulting model is able to produce verbatim copies of the data.

0 comments

xyzzy_plugh10mo ago

Why does it have to be verbatim? Seriously, this I don't understand.

If I produce a terrible shakycam recording of a film while sitting in a movie theater, it's not a verbatim copy, nor is it even necessarily representative of the original work -- muddied audio, audience sounds, cropped screen, backs of heads -- and yet it would be considered copyright infringement?

How many times does one need to compress the JPEG before it's fair use? I'm legitimately curious what the test is here.

stevenAthompson10mo ago

The purpose of copyright is it progress the arts and sciences. Not to guarantee profit. Guaranteeing profit is just the way we encourage people to progress the arts and sciences.

That is why so called derivative works are allowed (and even encouraged). If copyrighted material is ingested, modified or enhanced to add value, and then regurgitated that is legal, whereas copying it without adding value is not legal.

If derivative works weren't deemed acceptable copyright would have the opposite of it's intended effect and become an impediment to progress.

moregrist10mo ago

> That is why so called derivative works are allowed (and even encouraged).

Derivative works are not given a free pass from the normal constraints of copyright. You cannot legally publish books in the universe of A Song of Ice and Fire without permission from the author (and often publisher), calling them “derivative works.”

It’s why fan fiction is such a gray area for copyright and why some publishers have historically squashed it hard.

The exceptions for this are typically fair use, which requires multi-factor analysis by the judiciary and is typically decided on a case-by-case basis.

dragonwriter10mo ago

> That is why so called derivative works are allowed (and even encouraged).

Derivative works are not "allowed (and even encouraged)" without a license from the copyright holder. Creating a derivative work is an exclusive right of the copyright holder just like making verbatim copies and requires a license for anyone else, unless an exception to copyright protection (like fair use) applies.

https://www.law.cornell.edu/uscode/text/17/106

throwaway29010mo ago

Derivative works are not generally allowed in many jurisdictions. Try releasing a cover song without clearing it first etc. Even using a recognizable sample will bite you

Derivative works are tolerated in some cases like some manga or fanfics but it is a gray area and whenever the author or publisher wants to pursue it it is their full right to do it. Many do pursue it

(You can get inspired by something, and this is where some arguments can happen if you get inspired too mmmm literally, but no one will say with a straight face that inspiration is a thing that happens to software)

2 more replies

ndsipa_pomu10mo ago

> The purpose of copyright is it progress the arts and sciences. Not to guarantee profit.

That seems to go against the notion that copyright can last beyond the author's lifetime - most arts and science progress tends to reduce after death

__loam10mo ago

You do not understand fair use lol

nilsbunger10mo ago

There’s something called a substantive transformation test in copyright law. When you write a summary of a book, you don’t infringe on copyright because it’s a “substantial transformation”. This goes along with the idea that you can copyright the text but not the ideas it expresses.

When model training reads the text and creates weights internally, is that a substantial transformation? I think there’s a pretty strong argument that it is.

TheOtherHobbes10mo ago

No transformation is needed.

The point here is that book files have to be copied before they can be used for training. Copyright texts typically say something like "No unauthorised copying or transmission in any form (physical, electronic, etc.)"

Individuals who torrented music and video files have been bankrupted for doing exactly this.

The same laws should apply when a corporation downloads torrent files. What happens to them after they're downloaded is irrelevant to the argument.

If this is enforced (still to be seen...) it would be financially catastrophic for Meta, because there are set damages for works that have been registered for copyright protection - which most trad-pubbed books, and many self-pubbed books, are.

2 more replies

jayd1610mo ago

This is a leap in the argument. We've gone from the right to use a work to "unless the result is identical or close to it, we have full rights to all works.".

Seems like a big gap there.

1 more reply

mrgoldenbrown10mo ago

Even if you argue the LLM's are merely summarizing content, they still had to illegally download that content in the first place. The model can't read and simmarize the texts unless the text was illegally downloaded and copied. Piracy isn't suddenly legal just because you promise to delete the movie you downloaded after watching it.

triceratops10mo ago

The counterargument to that is model training is impossible without making copies. That's not true for humans.

2 more replies

meta_ai_x10mo ago

completely different scenarios. A pirated movie is marketed/sold as a copy of something, which is not fair use. An LLM just remembers/get inspired by what it consumes

xyzzy_plugh10mo ago

I don't believe that's correct. The existence of filters to block potentially copyrighted materials contained in LLM outputs proves that they don't just "get inspired."

It seems like it is very much a matter of fidelity.

crystal_revenge10mo ago

> An LLM just remembers/get inspired by what it consumes

As mentioned in another comment, LLMs (and most popular machine learning algorithms) can be viewed, correctly, as compression algorithms which leverage lossy encoding + interpolation to force a kind of generalization.

Your argument is that a video wouldn't count as pirated if the compression used for the pirated copy was lossy (or at least sufficiently lossy). The closest real world example would be the cases where someone records a the filming of a movie on their phone then uploads it. Such a copy is lossy enough that you can't produce anything really like the original, but my most definitions is still considered copyright.

1 more reply

amelius10mo ago

> I'm legitimately curious what the test is here.

The test is if a judge says it is fair use, nothing else.

The judge will take into account the human factor in this matter, e.g. things like who did the actual work, and who just used an algorithm (which is not the hard part anymore, code can be obtained on the internet for free). And we all know that DL is nowhere without huge amounts data.

realusername10mo ago

It's the same as a book or a film review, you can't get the film or the book back from it but the original material is still needed to produce it.

Needing the original material isn't enough for claiming copyright infringement as we have existing counter examples

__loam10mo ago

Movie reviews are fair use because they don't compete with the original work.

1 more reply

ilikehurdles10mo ago

If you read a book and later understand its plot but can only explain it in your own words, did you copy it?

The model isn’t storing the book.

mcny10mo ago

> If you read a book and later understand its plot but can only explain it in your own words, did you copy it?

I think that is the center of the conversation. What does it mean for a computer to "understand"? If I wrote some code that somehow transformed the text of the book and never return the verbatim text but somehow modified the output, I would likely not be spared because the ruling will likely be my transformation is "trivial".

Personally, I think we have several fixes we need to make:

  1. Abolish the CFAA. 
  2. Limit copyright to a maximum of 5 years from date of production with no extension possible for any reason. 
  3. Allow explicit carveout in copyright for transformational work. Explicitly allow format shifting, time shifting, yada yada. 
  4. Prohibit authors and publishers from including the now obviously false statements like "No part of this publication may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording" bla bla bla in their works. 
  5. I am sure I am missing some stuff here.

For brand protection, we already have trademark law. Most readers here already know this but We really should severe the artificial ties we have created between patents, trademarks, and copyright.

https://www.gnu.org/philosophy/not-ipr.en.html

waynesonfire10mo ago

I just happen do read the Phoenix Technologies wikipedia page a few days ago. This company is known for developing BIOS software for computers. Maybe you've seen their logo when you first turn on your computer.

In early computing, everything was closed sourced. Quoting the wikiepdia page,

To develop a legal BIOS, Phoenix used a clean room design. Engineers read the BIOS source listings in the IBM PC Technical Reference Manual. They wrote technical specifications for the BIOS APIs for a single, separate engineer—one with experience programming the Texas Instruments TMS9900, not the Intel 8088 or 8086—who had not been exposed to IBM BIOS source code.

The legal team at Phoenix deemed inappropriate to "recall source in their own words" for legal reasons.

My non-legal intuition is that these companies training their models are violating copyright. But, the stakes are too high--it's too big to fail if you will. If we don't do it, then our competitors will destroy us. How do you reconcile that?

2 more replies

kxrm10mo ago

The model doesn't "understand its plot". So I am not sure this is a good analogy.

1 more reply

freejazz10mo ago

Great question for a different litigation actually involving humans.

triceratops10mo ago

You didn't copy it because you're a human. A computer can't "read" without copying. It's how it works.

johanyc10mo ago

The keyword you’re looking for is “transformative”

OtherShrezzing10mo ago

>That said, OP is referring to whether the resulting model is able to produce verbatim copies of the data.

The NYTimes in 2023 was able to demonstrate that the models can reproduce entire articles verbatim[0] with minimal coercion.

[0]https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

WillPostForFood10mo ago

Perhaps this is not evidence that the NY Times article was copied, but that what the NY TImes writes is highly predictable.

OtherShrezzing10mo ago

The obvious test for this would be to have the models produce an article from before and after its cutoff date and see if the output is still verbatim.

It would be a remarkable quirk of statistics that, if given all text on the internet except for the NYTimes back catalogue, a model would produce any NYT article.

HWR_1410mo ago

"Of course data is copied during training" is copying. As far as I know, the law is consistent that temporary copies are also covered by the copyright act, and that's how some analogous cases were resolved.

Dylan1680710mo ago

Temporary copies are in the scope of copyright law, yes. But also, you are allowed to make them. Or reading a book via a computer would be illegal.

triceratops10mo ago

> But also, you are allowed to make them.

Not of physical media. You're allowed to make archival copies of digital media.

> Or reading a book via a computer would be illegal

No you purchased a license (or your library did, in the case of e-borrowing) to read the book on a computer. That makes it legal.

1 more reply

shawabawa310mo ago

If I buy a book I'm free to print as many copies as I want inside my house

It becomes illegal if I try to distribute those copies

So the question is, does distributing an AI that has been trained on Harry Potter count as distributing Harry Potter?

triceratops10mo ago

This is not correct. It's only true that no one will go to the effort of prosecuting you for keeping photocopies of books in your home. But copyright law doesn't allow you to do it.

dragonwriter10mo ago

> That said, OP is referring to whether the resulting model is able to produce verbatim copies of the data.

While a tool being used to create infringing copies of some other work (whether or not it is the source material used to create the tool, and whether or not the infringing material is also verbatim copies) is relevant to whether the tool vendor is liable for contributory infringement for the infringing use of the tool, the absence of a capacity for creating such copies isn't usually enough to say that copying to make the tool isn't infringing.

(That said, generative AI tools, including LLMs specifically, have been shown to have the capacity to make such copies, to the extent that vendors of hosted models are now putting additional checks on output to try to mitigate the frequency with which verbatim copies of substantial portions of training-set works are produced, so arguing that LLMs can't do that is silly.)

arh6810mo ago

> LLMs specifically, have been shown to have the capacity to make such copies

Exactly. I asked my Gemma how long of a quote it could give me of a given book, if I were the author & gave express permission, and I was a bit surprised it readily admitted it could

> Without Permission (Current Limit): Single sentence.

> With Broad Permission (Full Reproduction Allowed): I could theoretically quote the entire book.

Eye-opening (for me, at least).

crystal_revenge10mo ago

> That said, OP is referring to whether the resulting model is able to produce verbatim copies of the data.

Transformers are fundamentally large compression algorithms where the target of compression is not just to minimize reconstruction loss + compressed file size. In fact, basically all of machine learning used today can be viewed through the lens of learning a compression algorithm with added goals other than the usual.

By this logic if I create a lossy Jpeg of a copyrighted image it's not "copying" because the lossy compression.

jayd1610mo ago

So if they could produce verbatim segments, that would be a violation? The technology is certainly there and these companies need to work backwards to prevent that.

superkuh10mo ago

The US Federal government operates with the rule that if human eyes don't look at it it doesn't count as a copy or looking at it. This allows them to unconstitutionally spy and log all people's telecommunications. Applying it here it seems pretty clear that corps are within the established bounds. As are any human persons that want to train an LLM this way.

triceratops10mo ago

Someone engaged in large-scale unconstitutional spying does not give two fs about incidentally doing some copyright violations to achieve the spying. These are entirely orthogonal considerations.

moomin10mo ago

A good way of thinking about this is: consider the case where the data in question is illegal. Could you get into trouble for not only having access to it but also making copies of it?

There’s plenty of case law there…

lsaferite10mo ago

I would argue that as an individual, real person, obtaining content without a license and personally consuming that content is significantly different than a corporation doing the same. My rational is that distribution of that content is (or should be) the primary offense. If I work for a company and they direct me to collect a bunch of content without a license and then I pass that to other members in my team to train a model, I've now distributed that content at the direction of my employer. That should be the offense the company is tried for.

Using content to train an LLM is not copying the content. I'm ignoring the silly "but actually" arguments about the content being in RAM so it's "copying". It's using the content to generate a statistical model of token (word-ish) relationships and probabilities. If you write content that is so original in it's wording and I train an LLM against it, then there is certainly the possibility that the LLM could be provoked the recall the exact words you used. You'd have to set the parameters just right to make it happen and I think that proper training would drastically lower if not remove that possible scenario. But even if it doesn't, the LLM doesn't have a copy of that original content. All it has is weights representing those relationship probabilities. Yes, the minutia is more complex, but that is the essence. If my LLM were to generate enough of this essentially verbatim unique content and I tried to publish or copyright it, then I as the user should be on the hook. But then you get into a discussion about how many words in a unique sequence does it take to be infringement?

Obviously, I am not a lawyer.

My summation in all of this is that new laws need to be put into place to handle this stuff because the existing ones are sufficiently non-definitive and/or ill-suited such that every party is forming strong opinions about how old laws apply to new situations and causing massive friction.

kergonath10mo ago

That copying is already a violation. At least it was when regular people weee on the receiving end of the lawsuits.

triceratops10mo ago

Copyright is the right to make copies. Why is copying during training is any different from producing copies of training data after training?

If we're going that way, let me torrent every movie and TV show ever to "train" myself.

cgriswald10mo ago

Copyright is defined in law and as the original poster stated, whether this is 'copying' as defined by copyright law is legally ambiguous.

Copyright doesn't protect against all forms of duplication. For instance, you own the copyright to your post and grant HN a license to offer copies of it. I have no direct license from you to copy the content of your post; but I can copy it to memory, copy a cache to disk, and make a copy appear on my display.

kergonath10mo ago

> For instance, you own the copyright to your post and grant HN a license to offer copies of it.

It’s not a good example, because if you grant a license you give them the right to make copies. The problem is not when Meta got licenses, it’s when they did not.

1 more reply

ants_everywhere10mo ago

You can train yourself with every book at any library. You can also train yourself on a large number of movies and TV shows for a small monthly fee.

Where your analogy goes wrong is you're saying you want to "[Circumvent] payment to obtain copyright material for training" to use Workaccount2's words.

triceratops10mo ago

So Meta borrowed every book from a library and paid to obtain all of the movies and TV shows? They kept only one copy of every book at any time on their system?

Because I'm certainly not allowed to photocopy a library book in its entirety. And I guarantee you a Netflix subscription doesn't allow me to keep a copy of a movie on my hard drive and use it for training man or machine.

1 more reply

throwawaymaths10mo ago

It depends on your license? I mean strictly speaking if you stream a video you purchase legally over say amazon prime, there's lots of "copying" happening at various levels after those bits leave the data center.

triceratops10mo ago

> It depends on your license?

Exactly this. Legal copying requires a license.

foota10mo ago

I don't think this is a reasonable argument. I don't think copyright is actually defined in that sense, but is perhaps more focused on consuming the content. Is an http proxy making a copy of something? What about computing an md5 of it as it's streamed through the proxy? Or maybe counting the words in the thing being served in order to track stats? I'd argue none of these fall under copyright, but each is an incremental step towards what it means to train a model.

triceratops10mo ago

> I don't think copyright is actually defined in that sense, but is perhaps more focused on consuming the content.

https://en.wikipedia.org/wiki/American_Broadcasting_Cos.,_In....

I'm not a legal expert. My layman's understanding of the case above is Aereo was in violation because they made copies of content - content that the receiver was already allowed to access - available over the Internet to the intended receiver. That is to say, the copying was the problem.

1 more reply

atomicnumber310mo ago

It's almost like information wants to be free

1 more reply

j / k navigate · click thread line to collapse

0 comments

xyzzy_plugh10mo ago

Why does it have to be verbatim? Seriously, this I don't understand.

How many times does one need to compress the JPEG before it's fair use? I'm legitimately curious what the test is here.

stevenAthompson10mo ago

The purpose of copyright is it progress the arts and sciences. Not to guarantee profit. Guaranteeing profit is just the way we encourage people to progress the arts and sciences.

If derivative works weren't deemed acceptable copyright would have the opposite of it's intended effect and become an impediment to progress.

moregrist10mo ago

> That is why so called derivative works are allowed (and even encouraged).

It’s why fan fiction is such a gray area for copyright and why some publishers have historically squashed it hard.

The exceptions for this are typically fair use, which requires multi-factor analysis by the judiciary and is typically decided on a case-by-case basis.

dragonwriter10mo ago

> That is why so called derivative works are allowed (and even encouraged).

https://www.law.cornell.edu/uscode/text/17/106

throwaway29010mo ago

Derivative works are not generally allowed in many jurisdictions. Try releasing a cover song without clearing it first etc. Even using a recognizable sample will bite you

2 more replies

ndsipa_pomu10mo ago

> The purpose of copyright is it progress the arts and sciences. Not to guarantee profit.

That seems to go against the notion that copyright can last beyond the author's lifetime - most arts and science progress tends to reduce after death

__loam10mo ago

You do not understand fair use lol

nilsbunger10mo ago

When model training reads the text and creates weights internally, is that a substantial transformation? I think there’s a pretty strong argument that it is.

TheOtherHobbes10mo ago

No transformation is needed.

Individuals who torrented music and video files have been bankrupted for doing exactly this.

The same laws should apply when a corporation downloads torrent files. What happens to them after they're downloaded is irrelevant to the argument.

2 more replies

jayd1610mo ago

This is a leap in the argument. We've gone from the right to use a work to "unless the result is identical or close to it, we have full rights to all works.".

Seems like a big gap there.

1 more reply

mrgoldenbrown10mo ago

triceratops10mo ago

The counterargument to that is model training is impossible without making copies. That's not true for humans.

2 more replies

meta_ai_x10mo ago

completely different scenarios. A pirated movie is marketed/sold as a copy of something, which is not fair use. An LLM just remembers/get inspired by what it consumes

xyzzy_plugh10mo ago

I don't believe that's correct. The existence of filters to block potentially copyrighted materials contained in LLM outputs proves that they don't just "get inspired."

It seems like it is very much a matter of fidelity.

crystal_revenge10mo ago

> An LLM just remembers/get inspired by what it consumes

1 more reply

amelius10mo ago

> I'm legitimately curious what the test is here.

The test is if a judge says it is fair use, nothing else.

realusername10mo ago

It's the same as a book or a film review, you can't get the film or the book back from it but the original material is still needed to produce it.

Needing the original material isn't enough for claiming copyright infringement as we have existing counter examples

__loam10mo ago

Movie reviews are fair use because they don't compete with the original work.

1 more reply

ilikehurdles10mo ago

If you read a book and later understand its plot but can only explain it in your own words, did you copy it?

The model isn’t storing the book.

mcny10mo ago

> If you read a book and later understand its plot but can only explain it in your own words, did you copy it?

Personally, I think we have several fixes we need to make:

  1. Abolish the CFAA. 
  2. Limit copyright to a maximum of 5 years from date of production with no extension possible for any reason. 
  3. Allow explicit carveout in copyright for transformational work. Explicitly allow format shifting, time shifting, yada yada. 
  4. Prohibit authors and publishers from including the now obviously false statements like "No part of this publication may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording" bla bla bla in their works. 
  5. I am sure I am missing some stuff here.

For brand protection, we already have trademark law. Most readers here already know this but We really should severe the artificial ties we have created between patents, trademarks, and copyright.

https://www.gnu.org/philosophy/not-ipr.en.html

waynesonfire10mo ago

In early computing, everything was closed sourced. Quoting the wikiepdia page,

The legal team at Phoenix deemed inappropriate to "recall source in their own words" for legal reasons.

2 more replies

kxrm10mo ago

The model doesn't "understand its plot". So I am not sure this is a good analogy.

1 more reply

freejazz10mo ago

Great question for a different litigation actually involving humans.

triceratops10mo ago

You didn't copy it because you're a human. A computer can't "read" without copying. It's how it works.

johanyc10mo ago

The keyword you’re looking for is “transformative”

OtherShrezzing10mo ago

>That said, OP is referring to whether the resulting model is able to produce verbatim copies of the data.

The NYTimes in 2023 was able to demonstrate that the models can reproduce entire articles verbatim[0] with minimal coercion.

[0]https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

WillPostForFood10mo ago

Perhaps this is not evidence that the NY Times article was copied, but that what the NY TImes writes is highly predictable.

OtherShrezzing10mo ago

The obvious test for this would be to have the models produce an article from before and after its cutoff date and see if the output is still verbatim.

It would be a remarkable quirk of statistics that, if given all text on the internet except for the NYTimes back catalogue, a model would produce any NYT article.

HWR_1410mo ago

Dylan1680710mo ago

Temporary copies are in the scope of copyright law, yes. But also, you are allowed to make them. Or reading a book via a computer would be illegal.

triceratops10mo ago

> But also, you are allowed to make them.

Not of physical media. You're allowed to make archival copies of digital media.

> Or reading a book via a computer would be illegal

No you purchased a license (or your library did, in the case of e-borrowing) to read the book on a computer. That makes it legal.

1 more reply

shawabawa310mo ago

If I buy a book I'm free to print as many copies as I want inside my house

It becomes illegal if I try to distribute those copies

So the question is, does distributing an AI that has been trained on Harry Potter count as distributing Harry Potter?

triceratops10mo ago

This is not correct. It's only true that no one will go to the effort of prosecuting you for keeping photocopies of books in your home. But copyright law doesn't allow you to do it.

dragonwriter10mo ago

> That said, OP is referring to whether the resulting model is able to produce verbatim copies of the data.

arh6810mo ago

> LLMs specifically, have been shown to have the capacity to make such copies

Exactly. I asked my Gemma how long of a quote it could give me of a given book, if I were the author & gave express permission, and I was a bit surprised it readily admitted it could

> Without Permission (Current Limit): Single sentence.

> With Broad Permission (Full Reproduction Allowed): I could theoretically quote the entire book.

Eye-opening (for me, at least).

crystal_revenge10mo ago

> That said, OP is referring to whether the resulting model is able to produce verbatim copies of the data.

By this logic if I create a lossy Jpeg of a copyrighted image it's not "copying" because the lossy compression.

jayd1610mo ago

So if they could produce verbatim segments, that would be a violation? The technology is certainly there and these companies need to work backwards to prevent that.

superkuh10mo ago

triceratops10mo ago

Someone engaged in large-scale unconstitutional spying does not give two fs about incidentally doing some copyright violations to achieve the spying. These are entirely orthogonal considerations.

moomin10mo ago

A good way of thinking about this is: consider the case where the data in question is illegal. Could you get into trouble for not only having access to it but also making copies of it?

There’s plenty of case law there…

lsaferite10mo ago

Obviously, I am not a lawyer.

kergonath10mo ago

That copying is already a violation. At least it was when regular people weee on the receiving end of the lawsuits.

triceratops10mo ago

Copyright is the right to make copies. Why is copying during training is any different from producing copies of training data after training?

If we're going that way, let me torrent every movie and TV show ever to "train" myself.

cgriswald10mo ago

Copyright is defined in law and as the original poster stated, whether this is 'copying' as defined by copyright law is legally ambiguous.

kergonath10mo ago

> For instance, you own the copyright to your post and grant HN a license to offer copies of it.

It’s not a good example, because if you grant a license you give them the right to make copies. The problem is not when Meta got licenses, it’s when they did not.

1 more reply

ants_everywhere10mo ago

You can train yourself with every book at any library. You can also train yourself on a large number of movies and TV shows for a small monthly fee.

Where your analogy goes wrong is you're saying you want to "[Circumvent] payment to obtain copyright material for training" to use Workaccount2's words.

triceratops10mo ago

So Meta borrowed every book from a library and paid to obtain all of the movies and TV shows? They kept only one copy of every book at any time on their system?

1 more reply

throwawaymaths10mo ago

triceratops10mo ago

> It depends on your license?

Exactly this. Legal copying requires a license.

foota10mo ago

triceratops10mo ago

> I don't think copyright is actually defined in that sense, but is perhaps more focused on consuming the content.

https://en.wikipedia.org/wiki/American_Broadcasting_Cos.,_In....

1 more reply

atomicnumber310mo ago

It's almost like information wants to be free

1 more reply

j / k navigate · click thread line to collapse