If I produce a terrible shakycam recording of a film while sitting in a movie theater, it's not a verbatim copy, nor is it even necessarily representative of the original work -- muddied audio, audience sounds, cropped screen, backs of heads -- and yet it would be considered copyright infringement?
How many times does one need to compress the JPEG before it's fair use? I'm legitimately curious what the test is here.
That is why so called derivative works are allowed (and even encouraged). If copyrighted material is ingested, modified or enhanced to add value, and then regurgitated that is legal, whereas copying it without adding value is not legal.
If derivative works weren't deemed acceptable copyright would have the opposite of it's intended effect and become an impediment to progress.
Derivative works are not given a free pass from the normal constraints of copyright. You cannot legally publish books in the universe of A Song of Ice and Fire without permission from the author (and often publisher), calling them “derivative works.”
It’s why fan fiction is such a gray area for copyright and why some publishers have historically squashed it hard.
The exceptions for this are typically fair use, which requires multi-factor analysis by the judiciary and is typically decided on a case-by-case basis.
Derivative works are not "allowed (and even encouraged)" without a license from the copyright holder. Creating a derivative work is an exclusive right of the copyright holder just like making verbatim copies and requires a license for anyone else, unless an exception to copyright protection (like fair use) applies.
Derivative works are tolerated in some cases like some manga or fanfics but it is a gray area and whenever the author or publisher wants to pursue it it is their full right to do it. Many do pursue it
(You can get inspired by something, and this is where some arguments can happen if you get inspired too mmmm literally, but no one will say with a straight face that inspiration is a thing that happens to software)
That seems to go against the notion that copyright can last beyond the author's lifetime - most arts and science progress tends to reduce after death
When model training reads the text and creates weights internally, is that a substantial transformation? I think there’s a pretty strong argument that it is.
The point here is that book files have to be copied before they can be used for training. Copyright texts typically say something like "No unauthorised copying or transmission in any form (physical, electronic, etc.)"
Individuals who torrented music and video files have been bankrupted for doing exactly this.
The same laws should apply when a corporation downloads torrent files. What happens to them after they're downloaded is irrelevant to the argument.
If this is enforced (still to be seen...) it would be financially catastrophic for Meta, because there are set damages for works that have been registered for copyright protection - which most trad-pubbed books, and many self-pubbed books, are.
Seems like a big gap there.
It seems like it is very much a matter of fidelity.
As mentioned in another comment, LLMs (and most popular machine learning algorithms) can be viewed, correctly, as compression algorithms which leverage lossy encoding + interpolation to force a kind of generalization.
Your argument is that a video wouldn't count as pirated if the compression used for the pirated copy was lossy (or at least sufficiently lossy). The closest real world example would be the cases where someone records a the filming of a movie on their phone then uploads it. Such a copy is lossy enough that you can't produce anything really like the original, but my most definitions is still considered copyright.
The test is if a judge says it is fair use, nothing else.
The judge will take into account the human factor in this matter, e.g. things like who did the actual work, and who just used an algorithm (which is not the hard part anymore, code can be obtained on the internet for free). And we all know that DL is nowhere without huge amounts data.
Needing the original material isn't enough for claiming copyright infringement as we have existing counter examples
The model isn’t storing the book.
I think that is the center of the conversation. What does it mean for a computer to "understand"? If I wrote some code that somehow transformed the text of the book and never return the verbatim text but somehow modified the output, I would likely not be spared because the ruling will likely be my transformation is "trivial".
Personally, I think we have several fixes we need to make:
1. Abolish the CFAA.
2. Limit copyright to a maximum of 5 years from date of production with no extension possible for any reason.
3. Allow explicit carveout in copyright for transformational work. Explicitly allow format shifting, time shifting, yada yada.
4. Prohibit authors and publishers from including the now obviously false statements like "No part of this publication may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording" bla bla bla in their works.
5. I am sure I am missing some stuff here.
For brand protection, we already have trademark law. Most readers here already know this but We really should severe the artificial ties we have created between patents, trademarks, and copyright.In early computing, everything was closed sourced. Quoting the wikiepdia page,
To develop a legal BIOS, Phoenix used a clean room design. Engineers read the BIOS source listings in the IBM PC Technical Reference Manual. They wrote technical specifications for the BIOS APIs for a single, separate engineer—one with experience programming the Texas Instruments TMS9900, not the Intel 8088 or 8086—who had not been exposed to IBM BIOS source code.
The legal team at Phoenix deemed inappropriate to "recall source in their own words" for legal reasons.
My non-legal intuition is that these companies training their models are violating copyright. But, the stakes are too high--it's too big to fail if you will. If we don't do it, then our competitors will destroy us. How do you reconcile that?
The NYTimes in 2023 was able to demonstrate that the models can reproduce entire articles verbatim[0] with minimal coercion.
[0]https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...
It would be a remarkable quirk of statistics that, if given all text on the internet except for the NYTimes back catalogue, a model would produce any NYT article.
Not of physical media. You're allowed to make archival copies of digital media.
> Or reading a book via a computer would be illegal
No you purchased a license (or your library did, in the case of e-borrowing) to read the book on a computer. That makes it legal.
It becomes illegal if I try to distribute those copies
So the question is, does distributing an AI that has been trained on Harry Potter count as distributing Harry Potter?
While a tool being used to create infringing copies of some other work (whether or not it is the source material used to create the tool, and whether or not the infringing material is also verbatim copies) is relevant to whether the tool vendor is liable for contributory infringement for the infringing use of the tool, the absence of a capacity for creating such copies isn't usually enough to say that copying to make the tool isn't infringing.
(That said, generative AI tools, including LLMs specifically, have been shown to have the capacity to make such copies, to the extent that vendors of hosted models are now putting additional checks on output to try to mitigate the frequency with which verbatim copies of substantial portions of training-set works are produced, so arguing that LLMs can't do that is silly.)
Exactly. I asked my Gemma how long of a quote it could give me of a given book, if I were the author & gave express permission, and I was a bit surprised it readily admitted it could
> Without Permission (Current Limit): Single sentence.
> With Broad Permission (Full Reproduction Allowed): I could theoretically quote the entire book.
Eye-opening (for me, at least).
Transformers are fundamentally large compression algorithms where the target of compression is not just to minimize reconstruction loss + compressed file size. In fact, basically all of machine learning used today can be viewed through the lens of learning a compression algorithm with added goals other than the usual.
By this logic if I create a lossy Jpeg of a copyrighted image it's not "copying" because the lossy compression.
There’s plenty of case law there…
Using content to train an LLM is not copying the content. I'm ignoring the silly "but actually" arguments about the content being in RAM so it's "copying". It's using the content to generate a statistical model of token (word-ish) relationships and probabilities. If you write content that is so original in it's wording and I train an LLM against it, then there is certainly the possibility that the LLM could be provoked the recall the exact words you used. You'd have to set the parameters just right to make it happen and I think that proper training would drastically lower if not remove that possible scenario. But even if it doesn't, the LLM doesn't have a copy of that original content. All it has is weights representing those relationship probabilities. Yes, the minutia is more complex, but that is the essence. If my LLM were to generate enough of this essentially verbatim unique content and I tried to publish or copyright it, then I as the user should be on the hook. But then you get into a discussion about how many words in a unique sequence does it take to be infringement?
Obviously, I am not a lawyer.
My summation in all of this is that new laws need to be put into place to handle this stuff because the existing ones are sufficiently non-definitive and/or ill-suited such that every party is forming strong opinions about how old laws apply to new situations and causing massive friction.
If we're going that way, let me torrent every movie and TV show ever to "train" myself.
Copyright doesn't protect against all forms of duplication. For instance, you own the copyright to your post and grant HN a license to offer copies of it. I have no direct license from you to copy the content of your post; but I can copy it to memory, copy a cache to disk, and make a copy appear on my display.
It’s not a good example, because if you grant a license you give them the right to make copies. The problem is not when Meta got licenses, it’s when they did not.
Where your analogy goes wrong is you're saying you want to "[Circumvent] payment to obtain copyright material for training" to use Workaccount2's words.
Because I'm certainly not allowed to photocopy a library book in its entirety. And I guarantee you a Netflix subscription doesn't allow me to keep a copy of a movie on my hard drive and use it for training man or machine.
Exactly this. Legal copying requires a license.
https://en.wikipedia.org/wiki/American_Broadcasting_Cos.,_In....
I'm not a legal expert. My layman's understanding of the case above is Aereo was in violation because they made copies of content - content that the receiver was already allowed to access - available over the Internet to the intended receiver. That is to say, the copying was the problem.