undefined | Better HN

0 pointsCamperBob29mo ago0 comments

Anyone who doesn't train on all material available, legal or otherwise, will be outcompeted by teams that do, including those based in countries that don't respect Western copyright law. It's that simple.

Either this is practice is judged (or legislated) to be fair use, or copyright is done. It's also that simple.

0 comments

atrettel9mo ago

I'm not convinced that LLMs and other AI models need to train on all material available. A representative sample is better.

I'll ignore the legality aspects in my response. I think coming up with a representative sample of all relevant information would be better in the long term (teams will not be outcompeted on long time horizons). Why don't the companies do this? Because it is easier to just "carpet bomb the parameter space" and worry about the potential confounding [1] and sampling bias [2] later. Coming up with a representative sample requires domain expertise and that is expensive in terms of time and money. But it reduces the total amount of training data and should reduce the amount of time and resources it takes to build the models. That may matter now that models are quite large.

This is definitely a design decision with tradeoffs on both sides. I can entertain the notion that we don't have time to sample things, but I think we are all too often dismissing the long-term benefits of proper sampling.

(In terms of the legality aspects, judges are trying to "split the baby" [3] in my opinion by saying that training on stuff you got legally is OK but training on pirated material isn't. So nobody is going to recommend training on pirated material in the first place.)

[1] https://en.wikipedia.org/wiki/Confounding

[2] https://en.wikipedia.org/wiki/Sampling_bias

[3] https://www.404media.co/judge-rules-training-ai-on-authors-b...

CamperBob2OP9mo ago

Perhaps, but it seems safe to assume that the most valuable training material will be the 'illegal' material that is copyright-encumbered.

spaceport9mo ago

Quality. The tranformable value in all data is not equal.

9dev9mo ago

Or none of both happens and the corporations will just continue to evade laws and taxes to their benefit.

sigseg1v9mo ago

Outcompeted in the competition of what, exactly? How quickly they can produce inaccurate garbage?

alfalfasprout9mo ago

So, what? Authors and rights holders are supposed to just take it?

Copyright law exists for a reason. Trying to improve an LLM doesn't give you the right to flout our legal system. Yes, other countries might have an advantage in LLM training as a result but so be it.

crazygringo9mo ago

> Authors and rights holders are supposed to just take it?

If it's judged as fair use, then yes. And then it's not flouting anything.

Remember the whole point of fair use is to benefit society by allowing reuse of material in ways that don't directly copy large portions of the material verbatim.

For example, nonfiction authors already "just take it" when reviews describe the main points of their book without paying them a cent. The justification is that it's for the greater good, and rights are limited.

atrettel9mo ago

Judges have recently ruled [1] that training on legally obtained materials constitutes fair use, but we will have to see in the long term if that ruling holds up.

[1] https://www.404media.co/judge-rules-training-ai-on-authors-b...

dns_snek9mo ago

> Remember the whole point of fair use is to benefit society by allowing reuse of material in ways that don't directly copy large portions of the material verbatim.

That's a rather bastardized and twisted representation of copyright and fair use.

The "whole point" of copyright was to promote the authorship of original creative works by legally protecting the financial income of those authors. The "whole point" of fair use was to make exceptions in cases where it's clear that the usage doesn't result in a market substitute and deprive original authors of their income.

The end-goal of LLMs is to ingest all of that original content and reproduce it with expert-level accuracy, promising to be the know-all, end-all product. If wildly optimistic predictions of LLM proponents turn out to be correct then they will never buy a book again, they will have no reason to. And this is precisely what the copyright was designed to protect authors against.

1 more reply

Night_Thastus9mo ago

>the whole point of fair use is to benefit society

I'll stop you right there - I really don't think that applies at all. Does 'society' really benefit when the whole thing is a funnel for enormous amounts of wealth to go to already-gigantic companies like Microsoft?

2 more replies

bfrankline9mo ago

> Remember the whole point of fair use is to benefit society by allowing reuse of material in ways that don't directly copy large portions of the material verbatim.

How do you think masked language models work?

kmoser9mo ago

If I was a writer, I'd consider publishing my works under a license that explicitly bans AI training. What happens when those works inevitably get ingested by an LLM?

1 more reply

bee_rider9mo ago

It seems like it could conceivably be fair in some sense, as long as the models were actually released as open-weights (for the benefit of society).

hyperman19mo ago

Copyright law indeed exists for a reason. And that reason was that church and crown felt threatened by the power of printing presses to distribute ideas they couldn't control. 'To promote the usefull arts' has always been a way to sell the idea to the masses.

CamperBob2OP9mo ago

"...but so be it."

That phrase is carrying a lot of water, isn't it? Trillions of dollars worth by some estimates.

j / k navigate · click thread line to collapse

0 comments

atrettel9mo ago

I'm not convinced that LLMs and other AI models need to train on all material available. A representative sample is better.

[1] https://en.wikipedia.org/wiki/Confounding

[2] https://en.wikipedia.org/wiki/Sampling_bias

[3] https://www.404media.co/judge-rules-training-ai-on-authors-b...

CamperBob2OP9mo ago

Perhaps, but it seems safe to assume that the most valuable training material will be the 'illegal' material that is copyright-encumbered.

spaceport9mo ago

Quality. The tranformable value in all data is not equal.

9dev9mo ago

Or none of both happens and the corporations will just continue to evade laws and taxes to their benefit.

sigseg1v9mo ago

Outcompeted in the competition of what, exactly? How quickly they can produce inaccurate garbage?

alfalfasprout9mo ago

So, what? Authors and rights holders are supposed to just take it?

crazygringo9mo ago

> Authors and rights holders are supposed to just take it?

If it's judged as fair use, then yes. And then it's not flouting anything.

Remember the whole point of fair use is to benefit society by allowing reuse of material in ways that don't directly copy large portions of the material verbatim.

atrettel9mo ago

Judges have recently ruled [1] that training on legally obtained materials constitutes fair use, but we will have to see in the long term if that ruling holds up.

[1] https://www.404media.co/judge-rules-training-ai-on-authors-b...

dns_snek9mo ago

> Remember the whole point of fair use is to benefit society by allowing reuse of material in ways that don't directly copy large portions of the material verbatim.

That's a rather bastardized and twisted representation of copyright and fair use.

1 more reply

Night_Thastus9mo ago

>the whole point of fair use is to benefit society

2 more replies

bfrankline9mo ago

> Remember the whole point of fair use is to benefit society by allowing reuse of material in ways that don't directly copy large portions of the material verbatim.

How do you think masked language models work?

kmoser9mo ago

If I was a writer, I'd consider publishing my works under a license that explicitly bans AI training. What happens when those works inevitably get ingested by an LLM?

1 more reply

bee_rider9mo ago

It seems like it could conceivably be fair in some sense, as long as the models were actually released as open-weights (for the benefit of society).

hyperman19mo ago

CamperBob2OP9mo ago

"...but so be it."

That phrase is carrying a lot of water, isn't it? Trillions of dollars worth by some estimates.

j / k navigate · click thread line to collapse