That’s not how “derivative works”, well, work.
First of all, a thing can only be a derivative work if it is itself an original work of authorship.
Otherwise, it might be (or contain) a complete copy or a partial copy of one or more source works (which, if it doesn't fall into a copyright exception, would still be a at least a potential violation), but its not a derivative work.
Plus, if this is true, it would be a loophole. Plus this is totally crazy.
It would be great if courts declared WHAT is the case. But they won't, because copyright only protects massive companies.
No, I'm saying that your explanation of what makes something a derivative work is wrong. Now, personally, I think there is a very good argument that LLMs and similar models, if they have a copyright at all, do so only because of whatever copyright can be claimed on the training set as a work of its own (which, if ti exists, would be a compilation copyright), as a work of authorship of which it is a mechanical transformation (similar to object code having a copyright as a consequence of the copyright on the source code, which is a work of authorship.) Its also quite arguable that they are not subject to copyright, and many have made that argument.
> So anyone running those models can just freely copy them if they have access to them?
I'm not arguingn for that, but yes that is the consequence if they are not subject to copyright, assuming no other (e.g., contractual) prohibition binds the parties seeking to make copies.
> And, of course, it means distillation attacks, even if they do turn out to copy the OpenAIs/Anthropic/... model are just 100% perfectly legal?
Distillation isn't an “attack” and probably isn't a violation of copyright even if models are protected, they are literally interacting with the model through its interface to reproduce its function; they are functional reverse engineering.
Distillation is a violation of ToS, for which there are remedies outside of copyright.
> I mean paying someone to break into the DC and then putting the model on torrent would allow anyone downloading it to use it, legally.
Paying someone to break into the DC and do that would subject you to criminal charges for burglary and conspiracy, and civil liability for the associated torts as well as for theft of trade secrets covering the resulting harms, even without copyright protection.
> Plus, if this is true, it would be a loophole. Plus this is totally crazy.
Its not a “loophole” that copyright law only covers works of original authorship, it is the whole point of copyright law.
> It would be great if courts declared WHAT is the case.
If there is a dispute which turns on what is the case, courts will rule one way or the other on the issues necessary to resolve it. Courts (in the US at least) do not rule on issues not before them, except to the extent that a general rule which resolves but covers somewhat more than the immediate case can usefully be articulated by an appellate court.)
> But they won't, because copyright only protects massive companies.
Leaving out any question of whether the premise of this claim is true, the conclusion doesn't follow from it, since “what is the case” here is the kind of thing that is quite likely to be an issue between massive companies at some point in the not too distant future, requiring courts to resolve it even if they only address the meaning of copyright law for that purpose.
And btw: that a "compilation copyright" would apply to training data. Great. That only means, of course, that if they are publish their training data (like they agreed to when using GPL code to base their models on), people can't republish the exact same collection under different conditions (BUT they can under the same conditions). Everyone will happily follow that rule, don't worry.
> Paying someone to break into the DC and do that would subject you to criminal charges for burglary and conspiracy, and civil liability for the associated torts as well as for theft of trade secrets covering the resulting harms, even without copyright protection.
I don't claim the break-in would be legal, but without copyright protection, if that made a model leak, it would be fair game for everyone to use.
> Distillation is a violation of ToS, for which there are remedies outside of copyright.
But the models were created by violating ToS of webservers! This has the exact same problem the copyright violations have, only far far bigger! Scraping webservers is a violation of the ToS of those servers. For example [1]. Almost all have language somewhere that only allows humans to browse them, and bots, and IF bots are allowed at all (certainly not always), only specific bots for the purpose of indexing. So this is a much bigger problem for AI labs than even the GPL issue.
So yes, if you wanted to make the case that the AI labs, and large companies, violate any kind of contract, not just copyright licenses, excellent argument. But I know already: I'm a consultant, and I've had to sue, and won, 2 very large companies on terms of payment. In one case, I've had to do something called "forced execution", of the payment order (ie. going to the bank and demanding the bank execute the transaction against a random account of the company, against the will of the large company. Let me tell you, banks DO NOT like to do this)
Btw: what model training is doing, obviously, is distilling from the work, from the brain, of humans, against the will of those humans, and without paying for it. So in any reasonable interpretation, that's also a ToS violation. Probably a lot more implicit than the ones spelled out on websites, but not fundamentally different.
[1] https://www.bakerdatacounsel.com/blogs/terms-of-use-10-thing...