undefined | Better HN

0 pointsechelon1y ago0 comments

> AI models are intellectual property

If companies train on data they don't own and expect to own their model weights, that's hypocritical.

Model weights shouldn't be copyrightable if the training data was pilfered.

But this hasn't been tested because models are locked away in data centers as trade secrets. There's no opportunity to observe or copy them outside of using their outputs as synthetic data.

On that subject, training on model outputs should be fair use, and an area we should use legislation to defend access to (similar to web scraping provisions).

0 comments

dragonwriter1y ago

> If companies train on data they don't own and expect to own their model weights, that's hypocritical.

Its not hypocritical to follow a line of legal analysis whoch holds that copying material in the course of training AI on it is outside the scope of copyright protection (as, e.g., fair use in the US), but that the model weights resulting from the training are protected by copyright.

It maybe wrong, and it may be convenient for the interests of the firms involved, but it is not self-inconsistent in the way required for it to be hypocrisy.

Mordisquitos1y ago

If the resulting AI models are protected by copyright that invalidates the claim that AI models being trained on copyrighted materials is fair-use analogous to human beings becoming educated by exposure to copyrighted materials.

Educated human beings are not protected by copyright, hence neither should trained AI models. Conversely, if a copyrightable work is produced based on work which itself is copyrighted, the resulting work needs the consent of the original authors of the prior work.

AI models can't have their ©ake and eat it.

dragonwriter1y ago

> If the resulting AI models are protected by copyright that invalidates the claim that AI models being trained on copyrighted materials is fair-use analogous to human beings becoming educated by exposure to copyrighted materials.

No one training (foundation) models makes that fair use argument by analogy, they make arguments that addresses the specific statutory and case law criteria for fair use (abd frequently focus on the transformative character of the use); its true that the analogy to a learning human argument is frequently made in internet fora by AI enthusiasts who aren't the people training models on vaat scraped datasets. That argument is bunk for a number of reasons, but most critically the fact that a human learning from material isn’t fair use, because a human brain isn’t treated as a fixed medium, so learning in a human brain isn’t legally a copy or derivative work that would violate copyright without the fair use exception, so it's not a use to which fair use analysis even applies, so you can't argue anything is fair use by analogy to that. But its moot to any argument for hypocrisy by the big model makers, because they aren’t using that argument to start with.

drdeca1y ago

If I take 1000 books and count the distributions of the lengths of the words, and the covariance between the lengths of one word and the next word for each book, and how much this covariance matrix tends to vary across the different books, and other things like this, and publish these summaries, it seems fairly clear to me that this should count as fair use.

(Such a model/statistical-summary, along with a dictionary, could be used to generate nonsensical texts which have similar patterns in terms of just word lengths.)

Should the resulting work be protected by copyright? I’m not entirely sure…

I guess one thing is, the specific numbers I obtain by doing this are not a consequence of any creative decision making on my part, which I think in some jurisdictions (I don’t remember which) plays a role in whether a work is copyrightable (I will use “copyrightable” as an abbreviation for “protected by copyright”. I don’t mean to imply a requirement that someone specifically registers for copyright.). (Iirc this makes it so phone books are copyrightable in some jurisdictions but not others?)

The particular choice of statistical analysis does seem like it may involve creative decision making, but that would just be about like, what analysis I describe, and how the numbers I publish are to be interpreted, not what the numbers are? (Analogous to the source code of an ML model, not the parameters.)

Here is another question: suppose there is a method of producing a data artifact which would be genuinely (and economically) useful, and which does not rely on taking in any copyrighted input, but requires a large (expensive) amount of compute to produce, and which also uses a lot of randomness so that the result would be different each time it was done (but suppose also that there isn’t much point doing it multiple times at the same scale, as having two of this kind of data artifact wouldn’t be much more valuable than having one).

Should such data artifacts be protected by copyright or something like it?

Well, if copyright requires creative human decision making, then they wouldn’t be.

It seems like it would make sense to want it to be economically incentivized to create such data artifacts of higher sizes (to a point of course. Only as much as is justified by the value that is produced by them being available.) .

If such data artifacts can always be distributed without restriction, then ones that are publicly available would be public goods, and I guess only ones that are trade secrets would be private goods? It seems to me like having some mechanism to incentivize their creation and being-eventually-freely-distributed would be beneficial?

But maybe copyright isn’t the best way to do that? Idk.

3 more replies

OkayPhysicist1y ago

The model weights are the result of an automated process, by definition, and thus not protected by copyright.

In my unusually well-informed on copyright but not a lawyer opinion, without any new legislation on the subject, I suspect that the most likely scenario for intellectual property rights surrounding AI is that using other people's works for training probably falls under fair use, since it's extremely transformative (an AI that makes text and a textual work are very different things) and it's extremely difficult to argue that the AI, as it exists today, directly impacts the value of the original work.

The list of what traing data to use is probably protected by copyright if hand-picked, otherwise just whatever web-crawler they wrote to gather it.

The AI models, as in, the inference and training applications are protected by copyright, like any other application.

The architecture of a particular AI model can be protected by patents.

The weights, as the result of an automated process, are probably not protected by copyright.

dragonwriter1y ago

> The model weights are the result of an automated process, by definition, and thus not protected by copyright.

Object code is the result of an automated process and is covered by the copyright on the source code.

Compilations are covered by copyright separate from that of the individual works, and it is arguable that a training set would be covered by a compilation copyright, and the result of applying an automated training processs to it would remain covered by that copyright.

tjr1y ago

I think it is fair to say that existing copyright law was not written to handle all of this. It was written for people who created works, and for other people who were using those works.

To substitute either party with a computer system and assume that the existing law still makes sense may be assuming too much.

rsynnott1y ago

There are certainly publicly available weights with restrictive licenses (eg some of the StableDiffusion stuff). I’d agree that it’d seem fairly perverse to say “our process for making this by slurping in a ton of copyright content was not copyright theft, but your use of it outside our restrictive license is”, but then I’m not a lawyer.

ANewFormation1y ago

Now that you mention it, I'm quite surprised that none of the typical fanatical IP lawsuiters had sued arguing (reasonably I think) that the output of the LLMs is strongly suggestive that they have been trained on copyrighted materials. Get the lawsuit to discovery, and those data centers become fair game.

Perhaps 'strongly suggestive' isn't enough.

Onawa1y ago

Wasn't that the goal of both the New York Times lawsuit and other class action lawsuits from authors?

https://harvardlawreview.org/blog/2024/04/nyt-v-openai-the-t...

https://www.publishersweekly.com/pw/by-topic/industry-news/p...

dragonwriter1y ago

> Now that you mention it, I'm quite surprised that none of the typical fanatical IP lawsuiters had sued arguing (reasonably I think) that the output of the LLMs is strongly suggestive that they have been trained on copyrighted materials. Get the lawsuit to discovery, and those data centers become fair game.

No, there have been lawsuits, and the data centers have not been fair game because whether or not the models were trained on copyright-protected works is not generally in dispute. Discovery only applies to evidence relevant to facts in dispute.

cabalamat1y ago

> strongly suggestive that they have been trained on copyrighted materials

Given that everything -- including this comment -- is copyrighted unless it is (1) old or (2) deliberately put into the public domain, this is almost certainly true.

rusk1y ago

Isn’t this comment in the public domain? I presume that’s what I’m doing when I’m posting on a forum. If somebody copied and pasted something I wrote on here could I in theory use copyright law to restrict distribution? I think the law would say I published it on a public forum and thus it is in the public domain.

2 more replies

mikewarot1y ago

>models are locked away in data centers as trade secrets

The architecture and the weights in a model are the secret process used to make a commercially valuable output. It makes the most sense to treat them as a trade secret, in a court of law.

j / k navigate · click thread line to collapse

0 comments

dragonwriter1y ago

> If companies train on data they don't own and expect to own their model weights, that's hypocritical.

It maybe wrong, and it may be convenient for the interests of the firms involved, but it is not self-inconsistent in the way required for it to be hypocrisy.

Mordisquitos1y ago

AI models can't have their ©ake and eat it.

dragonwriter1y ago

drdeca1y ago

(Such a model/statistical-summary, along with a dictionary, could be used to generate nonsensical texts which have similar patterns in terms of just word lengths.)

Should the resulting work be protected by copyright? I’m not entirely sure…

Should such data artifacts be protected by copyright or something like it?

Well, if copyright requires creative human decision making, then they wouldn’t be.

But maybe copyright isn’t the best way to do that? Idk.

3 more replies

OkayPhysicist1y ago

The model weights are the result of an automated process, by definition, and thus not protected by copyright.

The list of what traing data to use is probably protected by copyright if hand-picked, otherwise just whatever web-crawler they wrote to gather it.

The AI models, as in, the inference and training applications are protected by copyright, like any other application.

The architecture of a particular AI model can be protected by patents.

The weights, as the result of an automated process, are probably not protected by copyright.

dragonwriter1y ago

> The model weights are the result of an automated process, by definition, and thus not protected by copyright.

Object code is the result of an automated process and is covered by the copyright on the source code.

tjr1y ago

I think it is fair to say that existing copyright law was not written to handle all of this. It was written for people who created works, and for other people who were using those works.

To substitute either party with a computer system and assume that the existing law still makes sense may be assuming too much.

rsynnott1y ago

ANewFormation1y ago

Perhaps 'strongly suggestive' isn't enough.

Onawa1y ago

Wasn't that the goal of both the New York Times lawsuit and other class action lawsuits from authors?

https://harvardlawreview.org/blog/2024/04/nyt-v-openai-the-t...

https://www.publishersweekly.com/pw/by-topic/industry-news/p...

dragonwriter1y ago

cabalamat1y ago

> strongly suggestive that they have been trained on copyrighted materials

Given that everything -- including this comment -- is copyrighted unless it is (1) old or (2) deliberately put into the public domain, this is almost certainly true.

rusk1y ago

2 more replies

mikewarot1y ago

>models are locked away in data centers as trade secrets

The architecture and the weights in a model are the secret process used to make a commercially valuable output. It makes the most sense to treat them as a trade secret, in a court of law.

j / k navigate · click thread line to collapse