Meta pirated at least 101 of my books, and others (opens in new tab)

(garymarcus.substack.com)

50 pointstu70011y ago55 comments

55 comments

The penalty for willful infringement in the US is pretty steep per instance.

Good point to think about. What specific act of training is infringing? Looks like none in this case, but happy to be corrected.

dfedbeef1y ago

Downloading and reproducing the pirated copy of the work. It's called a COPYright for a reason.

thedevilslawyer1y ago

Downloading is completely legal towards fair use (and hence non infringing).

No reproduction happens in the act of training an LLM.

1 more reply

puppycodes1y ago

i really dont get this and i personally beleive the world would be a better place without IP in any form.

But also no one is selling "your book", the product is completely different in literally every conceivable way.

you have never (and no one ever should) own words arranged in a certain way. You own the right to sell a book. Not the words themselves.

meta does bad things and im not a fan, but this really pales in comparison.

bernb1y ago

Put 10,000 hours into writing a book. Watch somebody with more resources or media coverage take full credits for it and/or make money instead of yourself. Copyright is a good thing. Same principle applies to the core of similar laws.

bigyabai1y ago

Copyright can be a good or bad thing, it doesn't stop businesses from arguing fair/transformative use, and winning in many cases: https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....

puppycodes1y ago

but literally no one is doing that in this scenario so... huh? Also the number of hours you put into a thing does not make it valuable.

pllbnk1y ago

No one is taking credit, but no one is giving credit either, which is nearly the same.

Also, to your second point, when you (speaking of any regular person) put in the number of hours, it might not be valuable. When the domain expert puts in the same number of hours, it might be very valuable.

1 more reply

zzo38computer1y ago

I agree with you. I think that copyright is bad, and patents is also bad.

It is a different issue if they steal your private data or your power (I mean the electrical power for the computers, in case that isn't already clear).

Making copies of published books, music, etc (and doing what you want with them) is not the bad things.

bernb1y ago

How would you replace the incentives for R&D for medication, for example? That often costs billions of dollars.

puppycodes1y ago

I take deep issue with this argument as if improving the lives of others isn't enough of an incentive. Ask most scientists whether they got into it for the money and you'll have your answer.

card_zero1y ago

Well, if you own copyright for a song, you can claim licensing fees for any public performance of those lyrics.

I wonder if an equivalent to Performance Rights Organizations will emerge as a channel for LLM publishers (so to speak) to pay fees.

puppycodes1y ago

and that is equally atrocious and should be eliminated from a society that wants to share ideas freely.

Idk if your in the US but you also massively oversimplify in your example, copyright law is waaaaaaaay more complex than that and it would take a set of special circumstances way beyond doing what you say it siphon money from an infringment claim

philistine1y ago

Copyright law is enshrined in a knotted web of trade deals across the world, meaning it will never change for as long as anyone on this board is alive. It's so tiring on here with the constant copyright law shouldn't exist. We know! How do we get rid of it? What's your argument for everyone to break all the world's trade deals to give you this one thing that would unmake our whole artistic sector?

1 more reply

MattPalmer10861y ago

A world without any legal protection for novel ideas or creative works is a world that does not share them freely.

That is why copyright law was originally created. Without it, there is little incentive to invest in the creation or disemmination of the works in the first place.

Edit: I am not defending the current form of those laws. The time period is too long for one thing. But removing it entirely would be a bad idea, IMHO.

1 more reply

card_zero1y ago

Yes, of course. I struggle to have an opinion here since I don't like either side in this fight, but I eventually squeezed one out, and there it is.

bernb1y ago

spatialspice1y ago

It feels like there are two equally valid sides to this argument that get muddied because of our current model’s/regulations inability to differentiate one over the other.

On the free-information side, I don’t think anyone would argue that AI shouldn’t be allowed to offer a general synopsis of a given book / series. From an author/creator’s POV, it feels like extortion to be able to summarize/recreate any given chapter/subsection to the point that the entire work could be reproduced near-verbatim.

IMO the question is, can we meaningfully draw a line between the two, and if so, how?

rich_sasha1y ago

I don't think anyone is stopping AI learning on the synopses of books. Or learning on books having paid licensing costs. It's the wanting to have cake, eat cake and for free that is falling.

3eb7988a16631y ago

In contrast to typical corporate crime, it seems there is documentation of upper management signing off on the decision.

Are there other juicy examples where the C-suite can be directly implicated? Always assumed that management knew how to leave instructions vague enough so as to keep their hands clean (a la meddlesome priests). The bad actor was always some middle-manager gone rogue.

acomjean1y ago

I think the main issue is that authors published books with the intention of human not machine consumption. Nobody though to put a contract in a book saying "human consumption only, not to be used to Train AI". Meta pirated the books in question, but what if they had bought a copy. Oddly cracking the encryption, a violation of the DMCA might be the infraction..

The courts have some tough questions to answer here.

CamperBob21y ago

(Shrug) We'll see what the courts say, Gary.

If training AI doesn't constitute fair use, you will lose more than you could ever possibly hope to gain. As will the rest of us.

Meanwhile, sublimate your dudgeon towards advocating for free access to the resulting models. That's what's important. Meta is not the company you want to go after here, since they released the resulting model weights.

ebiederm1y ago

To point out the obvious.

Unauthorized copying (aka pirating) is definitely a copyright violation.

That appears to be a huge problem with the large models and training. They don't secure legal access to the materials they train on, and thus fail to compensate authors for their work.

AKA students are required to buy or otherwise obtain legal access to their text books(like checking the book out of the library).

Training AI should play the same rules humans students have to follow.

MacsHeadroom1y ago

Obtaining copies of pirated works is not infringement. Unauthorized sharing is infringement but being on the receiving end of sharing is not (even if one is an active participant).

thedevilslawyer1y ago

And to point out the obvious - it seems training is not unauthorized copying. (At least this is the current legal status quo)

anoncow1y ago

This. I am not asking for a special 1000x fee for AI. Just pay the normal fee a human would have paid, but at least pay that.

verzali1y ago

Are you also willing to work for OpenAI for free then? Have you contacted them with such an offer?

CamperBob21y ago

As long as I have access to the resulting model, sure. I thought I made that clear. Copyright is not as important as reaching the next stage of our intellectual evolution. Current-gen AI may not be sufficient to reach that stage, but I believe it is a necessary step.

Like the author of this screed, my work went into training every major model. I get paid back every time one of those models helps me learn or do something. The injustice, if it happens, will occur when a few well-heeled players like OpenAI succeed in locking the technology up with regulatory capture or (worse) if a few greedy, myopic assholes render it illegal or uneconomical to continue development by advocating copyright maximalism.

thedevilslawyer1y ago

It's like saying that because a student reads a textbook, they now have to work for the author for free?

alanfranz1y ago

Does fair use imply that pirating copyrighted material is ok?

I mean, it’s a serious question; I don’t see this as really connected.

As long as an AI can “understand” the content of a book and spit out a summary of it, or even leverage what it learned to perform further inference, I’d be inclined to say that this is fair use; a human would do the same.

But this has nothing to do with using pirated material for training, especially for some kind of commercial purpose (even if llama is free, they’re building on top of it) - I don’t see why it should be legal.

thedevilslawyer1y ago

Fair use is literally that:

"Fair use" in copyright law allows limited, specific uses of copyrighted material without permission.

Hence, by definition, not "pirating".

ricardobeat1y ago

I get the commercial/legal angle, but from the viewpoint of AI being something we as a society have an interest in developing, how should this work?

Do you want to severely limit evolution of models by having them pick (and buy) a tiny subset of all books?

Should every training run put money into a pool that gets paid out to every rights holder of every book that has ever been published?

Should Meta buy a physical or electronic copy of every book they want to use for training? That has zero impact on revenue for individual authors.

Would they be paid by word, by token, by book? This makes little sense. We don’t charge people for the knowledge they acquired while going to the library over 50 years, AI just squeezes this into weeks. Our legal framework simply doesn’t fit.

card_zero1y ago

> Should every training run put money into a pool that gets paid out to every rights holder of every book that has ever been published?

That could actually work. Bearing in mind that all copyright laws are messy and terrible, this proposal is at least not impossible.

"Ever been published" means in the last 100 years.

1 more reply

alanfranz1y ago

> Should Meta buy a physical or electronic copy of every book they want to use for training?

Yes, and probably, if training in parallel, multiple copies, just as multiple people will need multiple books.

Multiply this by the amount of GPUs and AI model providers, and the revenue impact is not zero.

OneDeuxTriSeiGo1y ago

Why should it be fair use? Why would being a derivative work not be OK? There is a massive corpus of public domain and FOSS works. Likewise plenty of permissively licensed government created datasets. There is no reason why any corpus created from these sources is insufficient.

throwaway1501y ago

> Why would being a derivative work not be OK?

That's not even the real problem. It's a problem, yes, but not the real problem. The problem is that before they could train the model on the book, they had to copy the book from somewhere. Is it ok to make illegal pirated copies of a copyrighted book to train your model? I think that's the issue we are dealing with here.

Whether it is ok to create a derivative work or not is beside the point.

CamperBob21y ago

The problem is that before they could train the model on the book, they had to copy the book from somewhere.

That, in itself, raises kind of an interesting point.

Right now there's a post on the front page where people are exercising conspicuous outrage because ChatGPT rendered a good Indiana Jones likeness in response to a vague query asking for a 1930s archaeologist with a bullwhip. Was that particular response generated by ChatGPT because it "copied" Indiana Jones? Or because it was influenced by the same pulp fiction stories and deeply-embedded cultural archetypes that led Spielberg and Lucas to create the character in the first place?

card_zero1y ago

The illustration shows a page from Matter by Iain M. Banks. I don't suppose that's an IP violation, but it implies a human artist with attention to detail.

Mind you, it's page 1 and the book is not on page 1.

spudlyo1y ago

Speaking broadly, the publishers who hold the copyrights on these materials have often behaved poorly. From overbroad DMCA takedown demands that chill fair use, to threatening libraries, students, and scholars with lawsuits and stiff penalties for minor infringements, to "copyright trolling" campaigns sending mass settlement demands to alleged infringers -- I have little sympathy for copyright holders.

I'm still angry about how publishers and the Authors Guild sued Google over Google Books. Intellectual property is why we as a society can't have nice things. While I'm not a fan of Meta, their open weight models are probably one of the best things they've ever done, and I'll back big tech over publishers every time.

moscoe1y ago

The idea that you can’t train on copyrighted materials is ludicrous, imho. So apparently you don’t want humanity and the future of intelligence to benefit from your work? You just want it to keep it locked up in some archive that virtually no one ever reads?

Might as well say the people who read your books aren’t allowed to teach the concepts or theories. Completely asinine argument. If you don’t want the knowledge to proliferate, then don’t publish. They’re not copying and redistributing.

Meanwhile, jurisdictions outside of us copyright protection will leapfrog us because we can’t get out of our own way.

sepositus1y ago

I’m not sure that’s the whole picture. Followed to the logical conclusion, everyone should have the right to pirate whatever books they want and then feed them into a local LLM. Which leads to less kickback to the author, which means they can’t sustainably write, and we end up in a worse off situation.

throwaway1501y ago

> The idea that you can’t train on copyrighted materials is ludicrous, imho.

Let us for a minute accept that it is ok to train on copyrighted materials. I don't believe that but I'll humor you. So let's accept it.

To train on copyrighted materials, they need to purchase the copyrighted materials, correct? If you wanted to train a model on all O'reilly books, you'd purchase the O'reilly books first, wouldn't you?

Do you think it is ok to make illegal pirated copies of the book to do your training?

philistine1y ago

People on this board have mischaracterized this case so much it looks like Facebook employees trying to astroturf.

entangledqubit1y ago

Meta's more open models aside, are the other copyright abusers giving out their model for free?

fsckboy1y ago

what would be interesting to learn is,

"Mega can regurgitate virtually any excerpt from any of my books, there for they have stolen them"

versus what is not interesting such as

"my books are in libgen therefore they stole my work, even though I can't find direct evidence of the theft"

>The most damning thing? It appears that Meta knew exactly what they are doing, and chose to proceed anyway.

that is not the most damning thing. It might trigger worse damages or elevation of the severity of an infraction, but it is not evidence of guilt per se, which is what I would call "damning"

CaffeineLD501y ago

I guess Zuck's new found interest in manly combat sports is based on his expectation of seeing PDiddy, SBF, and Luigi in club fed over his l33+ pir8cy. It all makes sense now.

#freezuck

1 more reply

j / k navigate · click thread line to collapse

55 comments

dfedbeef1y ago

The penalty for willful infringement in the US is pretty steep per instance.

thedevilslawyer1y ago

Good point to think about. What specific act of training is infringing? Looks like none in this case, but happy to be corrected.

dfedbeef1y ago

Downloading and reproducing the pirated copy of the work. It's called a COPYright for a reason.

thedevilslawyer1y ago

Downloading is completely legal towards fair use (and hence non infringing).

No reproduction happens in the act of training an LLM.

1 more reply

puppycodes1y ago

i really dont get this and i personally beleive the world would be a better place without IP in any form.

But also no one is selling "your book", the product is completely different in literally every conceivable way.

you have never (and no one ever should) own words arranged in a certain way. You own the right to sell a book. Not the words themselves.

meta does bad things and im not a fan, but this really pales in comparison.

bernb1y ago

bigyabai1y ago

Copyright can be a good or bad thing, it doesn't stop businesses from arguing fair/transformative use, and winning in many cases: https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....

puppycodes1y ago

but literally no one is doing that in this scenario so... huh? Also the number of hours you put into a thing does not make it valuable.

pllbnk1y ago

No one is taking credit, but no one is giving credit either, which is nearly the same.

1 more reply

zzo38computer1y ago

I agree with you. I think that copyright is bad, and patents is also bad.

It is a different issue if they steal your private data or your power (I mean the electrical power for the computers, in case that isn't already clear).

Making copies of published books, music, etc (and doing what you want with them) is not the bad things.

bernb1y ago

How would you replace the incentives for R&D for medication, for example? That often costs billions of dollars.

puppycodes1y ago

I take deep issue with this argument as if improving the lives of others isn't enough of an incentive. Ask most scientists whether they got into it for the money and you'll have your answer.

card_zero1y ago

Well, if you own copyright for a song, you can claim licensing fees for any public performance of those lyrics.

I wonder if an equivalent to Performance Rights Organizations will emerge as a channel for LLM publishers (so to speak) to pay fees.

puppycodes1y ago

and that is equally atrocious and should be eliminated from a society that wants to share ideas freely.

philistine1y ago

1 more reply

MattPalmer10861y ago

A world without any legal protection for novel ideas or creative works is a world that does not share them freely.

That is why copyright law was originally created. Without it, there is little incentive to invest in the creation or disemmination of the works in the first place.

Edit: I am not defending the current form of those laws. The time period is too long for one thing. But removing it entirely would be a bad idea, IMHO.

1 more reply

card_zero1y ago

Yes, of course. I struggle to have an opinion here since I don't like either side in this fight, but I eventually squeezed one out, and there it is.

bernb1y ago

spatialspice1y ago

It feels like there are two equally valid sides to this argument that get muddied because of our current model’s/regulations inability to differentiate one over the other.

IMO the question is, can we meaningfully draw a line between the two, and if so, how?

rich_sasha1y ago

I don't think anyone is stopping AI learning on the synopses of books. Or learning on books having paid licensing costs. It's the wanting to have cake, eat cake and for free that is falling.

3eb7988a16631y ago

In contrast to typical corporate crime, it seems there is documentation of upper management signing off on the decision.

acomjean1y ago

The courts have some tough questions to answer here.

CamperBob21y ago

(Shrug) We'll see what the courts say, Gary.

If training AI doesn't constitute fair use, you will lose more than you could ever possibly hope to gain. As will the rest of us.

ebiederm1y ago

To point out the obvious.

Unauthorized copying (aka pirating) is definitely a copyright violation.

That appears to be a huge problem with the large models and training. They don't secure legal access to the materials they train on, and thus fail to compensate authors for their work.

AKA students are required to buy or otherwise obtain legal access to their text books(like checking the book out of the library).

Training AI should play the same rules humans students have to follow.

MacsHeadroom1y ago

Obtaining copies of pirated works is not infringement. Unauthorized sharing is infringement but being on the receiving end of sharing is not (even if one is an active participant).

thedevilslawyer1y ago

And to point out the obvious - it seems training is not unauthorized copying. (At least this is the current legal status quo)

anoncow1y ago

This. I am not asking for a special 1000x fee for AI. Just pay the normal fee a human would have paid, but at least pay that.

verzali1y ago

Are you also willing to work for OpenAI for free then? Have you contacted them with such an offer?

CamperBob21y ago

thedevilslawyer1y ago

It's like saying that because a student reads a textbook, they now have to work for the author for free?

alanfranz1y ago

Does fair use imply that pirating copyrighted material is ok?

I mean, it’s a serious question; I don’t see this as really connected.

thedevilslawyer1y ago

Fair use is literally that:

"Fair use" in copyright law allows limited, specific uses of copyrighted material without permission.

Hence, by definition, not "pirating".

ricardobeat1y ago

I get the commercial/legal angle, but from the viewpoint of AI being something we as a society have an interest in developing, how should this work?

Do you want to severely limit evolution of models by having them pick (and buy) a tiny subset of all books?

Should every training run put money into a pool that gets paid out to every rights holder of every book that has ever been published?

Should Meta buy a physical or electronic copy of every book they want to use for training? That has zero impact on revenue for individual authors.

card_zero1y ago

> Should every training run put money into a pool that gets paid out to every rights holder of every book that has ever been published?

That could actually work. Bearing in mind that all copyright laws are messy and terrible, this proposal is at least not impossible.

"Ever been published" means in the last 100 years.

1 more reply

alanfranz1y ago

> Should Meta buy a physical or electronic copy of every book they want to use for training?

Yes, and probably, if training in parallel, multiple copies, just as multiple people will need multiple books.

Multiply this by the amount of GPUs and AI model providers, and the revenue impact is not zero.

OneDeuxTriSeiGo1y ago

throwaway1501y ago

> Why would being a derivative work not be OK?

Whether it is ok to create a derivative work or not is beside the point.

CamperBob21y ago

The problem is that before they could train the model on the book, they had to copy the book from somewhere.

That, in itself, raises kind of an interesting point.

card_zero1y ago

The illustration shows a page from Matter by Iain M. Banks. I don't suppose that's an IP violation, but it implies a human artist with attention to detail.

Mind you, it's page 1 and the book is not on page 1.

spudlyo1y ago

moscoe1y ago

Meanwhile, jurisdictions outside of us copyright protection will leapfrog us because we can’t get out of our own way.

sepositus1y ago

throwaway1501y ago

> The idea that you can’t train on copyrighted materials is ludicrous, imho.

Let us for a minute accept that it is ok to train on copyrighted materials. I don't believe that but I'll humor you. So let's accept it.

Do you think it is ok to make illegal pirated copies of the book to do your training?

philistine1y ago

People on this board have mischaracterized this case so much it looks like Facebook employees trying to astroturf.

entangledqubit1y ago

Meta's more open models aside, are the other copyright abusers giving out their model for free?

fsckboy1y ago

what would be interesting to learn is,

"Mega can regurgitate virtually any excerpt from any of my books, there for they have stolen them"

versus what is not interesting such as

"my books are in libgen therefore they stole my work, even though I can't find direct evidence of the theft"

>The most damning thing? It appears that Meta knew exactly what they are doing, and chose to proceed anyway.

that is not the most damning thing. It might trigger worse damages or elevation of the severity of an infraction, but it is not evidence of guilt per se, which is what I would call "damning"

CaffeineLD501y ago

I guess Zuck's new found interest in manly combat sports is based on his expectation of seeing PDiddy, SBF, and Luigi in club fed over his l33+ pir8cy. It all makes sense now.

#freezuck

1 more reply

j / k navigate · click thread line to collapse