Anyone could use those tools to download creative common files and linux ISO, but those arguments did not succeed in the legal system. Bittorent as a technology was however not made illegal, as could be seen in games using it to distribute patches.
Feature extraction is literally a form of lossy compression. You can prod DALEE to make obvious copies of some of the works it was trained on, but even seemingly novel images could contain enough similarities to training material to be problematic.
The standard isn’t “I think that looks like an Andy Warhol picture”, it is “That is substantially similar to a specific Andy Warhol piece”. Copyright doesn’t protect style.
> Feature extraction is literally a form of lossy compression.
This is one way think of neural nets, another is that they find the topological space of pictures.
But these are just models of computation, which aren’t especially relevant in the same way that it isn’t relevant what produces an infringing image, just that it is produced.
Which brings me back to my original point: there are a two different barriers for generative ai: is the model itself transformative, and is the primary purpose of the model to generate copyright infringing material.
With respect to the first… I have no idea how someone could argue that the model itself isn’t transformative enough. It isn’t “substantially similar” to any of the works that it is trained on. It might be able to generate things that are “substantially similar”, but the model itself isn’t… it’s just a bunch of numbers and code.
Regarding the second: I have less experience with image models, but I use chatgpt regularly without even trying to violate copyright, and I don’t think I’m alone, so I doubt you could make an argument that llms have a primary purpose of committing copyright infringement.
That really doesn’t fly legally because any digital format is ‘just’ numbers.
This music industry has been going through exactly this for the last few years and the courts have recognized that the creative process necessarily involves copying and that a small amount of copying is not infringement.
Critically it’s not just a question of what percentage of a work is a copy of the original but what of the original work was copied. IE copying 3 lines in a book is a tiny fraction of the book but if you coped half the poem it’s well past the de minimis threshold.
Similarly only a small percentage of a giant library of MP3’s comes from any one work, but that’s not relevant.
https://www.heswithjesus.com/tech/exploringai/index.html
I’ve also seen GPT spit out proprietary content word for word that’s not licensed for commercial use that I’m aware of. They probably got it from web crawling without checking licenses.
What I want more than anything in this space right now are two models: all public domain books (eg Gutenberg); permissive code in at least Python, JavaScript, HTML/CSS, C, C++, and ASM’s. One layered on the other but released individually. We can keep using that to generate everything from synthetic data to code to revenue-producing deliverables. All with nearly zero, legal risk.
So either we carve out an explicit exception that machines aren’t allowed to do things that are remarkably similar to what humans do… which would be a massive setback for AI in the US.
Or we agree that generative models are subject to the same rules that humans are — they can’t commit copyright infringement, but are able to appropriately consume copyrighted material that a human would be able to consume.
The second option seems to me to be much simpler, nicer, and more appropriate than the first.
Where generative AI ingests copyrighted works in order to work and bases its output on it, then it is copyright infringement, equivalent to 'straight piracy' of all that it ingested, unless it's deemed fair use.
What Google does with its search engine, for example, is fair use, what Napster did was not.
Learning, by human or machine, means extracting a copy of the essence of something and yes, storing that essence in a lossy way. It seems like learning from copyright-encumbered material ought to either be illegal for both, or legal. I know which world I would rather live in.
This is an absurd standard. Is it copyright infringement when a human "ingests" copyrighted work and bases their output on it? Because that's commonly called inspiration and is how every artist creates their work - through experiencing other works and using that cumulative inspiration to form their own product.
Copyright infringement is already ridiculously restrictive as it is, this proposal not only fundamentally misunderstands how generative AI works but penalises AI for doing what humans do everyday.
* Does the model itself violate copyright? * Does the output of the model violate copyright?
I don't know how you could make an argument that the ingestion of information into a model through a training procedure in order to create something that can generate truly unique outputs isn't transformative of the original works. The legal standard for a new work to be considered a copyright violation of an original work is "substantial similarity". I don't know how you can make an argument that a generative model is "substantially similar" to thousands of original works...
Honestly, I'm not even sure if "fair use" comes into play for the model itself. In order for fair use to come into play, the model has to be deemed to be violating some copyright. Only once it is found to be violating does "fair use" come into play in order to figure out if it is illegal or not.
The second question is the one where fair use is likely to come into play more. And this question has to be asked of each output. The model's legality only becomes an issue here if, like Napster, you can't argue that the model has much point other than violating copyright. Napster didn't violate copyright (the code for Napster wasn't infringing on anything), but it enabled the violation of copyright and didn't have much point other than that.
I don't think you can make that argument though. I use ChatGPT most days, and I've never gotten copyrighted material out of it. I could ask it to write me some Disney fan fiction, which would violate a copyright. And I think there is a valid legal question here about who is responsible for preventing me from doing that. This is where I think the gray area is.