Nvidia contacted Anna's Archive to access books (opens in new tab)

(torrentfreak.com)

249 pointsantonmks4mo ago158 comments

158 comments

> In response, NVIDIA defended its actions as fair use, noting that books are nothing more than statistical correlations to its AI models.

Does this even make sense? Are the copyright laws so bad that a statement like this would actually be in NVIDIA’s favor?

ThrowawayR24mo ago

Yes, it's been discussed many times before. All the corporations training LLMs have to have done a legal analysis and concluded that it's defensible. Even one of the white papers commissioned by the FSF ( "Copyright Implications of the Use of Code Repositories to Train a Machine Learning Model" at https://www.fsf.org/licensing/copilot/copyright-implications... ), concluded that using copyrighted data to train AI was plausibly legally defensible and outlined the potential argument. You will notice that the FSF has not rushed out to file copyright infringement suits even though they probably have more reason to oppose LLMs trained on FOSS code than anyone else in the world.

jkaplowitz4mo ago

> Even one of the white papers commissioned by the FSF

Quoting the text which the FSF put at the top of that page:

"This paper is published as part of our call for community whitepapers on Copilot. The papers contain opinions with which the FSF may or may not agree, and any views expressed by the authors do not necessarily represent the Free Software Foundation. They were selected because we thought they advanced the discussion of important questions, and did so clearly."

So, they asked the community to share thoughts on this topic, and they're publishing interesting viewpoints that clearly advance the discussion, whether or not they end up agreeing with them. I do acknowledge that they paid $500 for each paper they published, which gives some validity to your use of the verb "commissioned", but that's a separate question from whether the FSF agrees with the conclusions. They certainly didn't choose a specific author or set of authors to write a paper on a specific topic before the paper was written, which a commission usually involves, and even then the commissioning organization doesn't always agree with the paper's conclusion unless the commission isn't considered done until the paper is updated to match the desired conclusion.

> You will notice that the FSF has not rushed out to file copyright infringement suits even though they probably have more reason to oppose LLMs trained on FOSS code than anyone else in the world.

This would be consistent with them agreeing with this paper's conclusion, sure. But that's not the only possibility it's consistent with.

It could alternatively be because they discovered or reasonably should have discovered the copyright infringement less than three years ago, therefore still have time remaining in their statute of limitations, and are taking their time to make sure they file the best possible legal complaint in the most favorable available venue.

Or it could simply be because they don't think they can afford the legal and PR fight that would likely result.

ThrowawayR24mo ago

Since I very specifically wrote "commissioned by the FSF" instead of "represents the opinion of the FSF" to avoid misrepresenting the paper, you're arguing against something I have not said.

1 more reply

grayhatter4mo ago

> Even one of the white papers commissioned by the FSF [...] concluded that using copyrighted data to train AI was plausibly legally defensible [...] notice that the FSF has not rushed out to file copyright infringement suits even though they probably have more reason to oppose LLMs trained on FOSS code than anyone else in the world.

I agree with jkaplowitz, but for a different reason I still believe that your description feels a bit misleading to me. The FSF commissioned paper makes the argument that Microsoft's use of code FROM GITHUB, FOR COPILOT is likely non-infringing, because of the additional github ToS. This feels like critical context to provide given in the very next statement, you widened it to LLMs generally, and the FSF which likely cares about code, not on github as well.

All of that said, I'm not sure it matters, because while I don't find the argument from the that whitepaper very compelling, because it's based critically on additional grants in the ToS. IIRC (going only from memory) the ToS requires that you grant github a license as it's needed to provide the service. Github can provide the services the user reasonably understood github to provide, without violating the additional clauses specified in the existing FOSS license covering the code. That being from a while ago, and I'd say it's very murky now, because everyone knows Microsoft provides copilot, so "obviously" they need it.

Unfortunately, and importantly, when dealing with copyrights, the paper also covers the transformative fair use arguments in depth. And I do find those following arguments very compelling. The paper, (and likely others) are making the argument that the code output from an LLM is likely transformative. And thus can't be infringing compelling, (or is unlikely to be). I think in many cases, the output is clearly transformative in nature.

I've also seen code generated by claude (likely others as well?) to copy large sections from existing works. Where it's clearly "copy/paste" which clearly can't be fair use, nor transformative. The output clearly copies the soul of the work. Thus given I have no idea what dataset they're copying this code from, it's scary enough to make me unwilling to take the chance on any of it.

reorder96954mo ago

So it's legal to train an "intelligence" on everything for free based on fair use, but it's not legal to train another intelligence (my brain) on it?

grayhatter4mo ago

No, it's also not illegal to train your brain. If you break into a store, and read all the books, you'll get arrested for breaking and entering. Not for reading the books. My (superficial) take on the argument is that they're hoping by saying "it's not illegal to read" no one will notice, and no one will ask how they got into the book store to begin with.

1 more reply

Arnt4mo ago

You're close to an important point.

Our current laws are written to make it legal for you to copy the Quran via your brain — some people learn it by rote and can stand up and speak the entire work from one end to the other. This is intended to be legal. Fair use of the Quran.

I went to a concert recently where someone copied every word and (as far as I could hear) every note from a copyrighted work by Bruce Springsteen. Singing and playing. This too is intended to be fair use.

You can learn how to play and sing Springsteen songs verbatim, and you can use his records to learn to sound like him when you sing, and that's intended to be legal.

Since the law doesn't say "but you cannot write a program to do these things, or run such a program once written", why would it be illegal to do the same thing using some code?

The people who want the law to differentiate have a difficult challenge in front of them. As I see it, they need to differentiate between what humans do to learn from what machines do, and that implies really knowing what humans do. And then they need to draw boundaries, making various kinds of computer-assisted human learning either legal or illegal.

Some of them say things like "when an AI draws Calvin and Hobbes in the style of Breughel, it obviously has copied paintings by Breughel" but a court will ask why that's obvious. Is it really obvious that the way it does that drawing necessarily involves copying, when you as a human can do the same thing without copying?

1 more reply

general14654mo ago

Did you pirated this movie? No I did not, it is fair use because this movie is nothing more than a statistical correlation to my dopamine production.

earthnail4mo ago

The movie played on my screen but I may or may not have seen the results of the pixels flashing. As such, we can only state with certainty that the movie triggered the TV's LEDs relative to its statistical light properties.

gruez4mo ago

>Did you pirated this movie? No I did not, [...]

You're probably being sarcastic but that's actually how the law works. You'll note that when people get sued for "pirating" movies, it's almost always because they were caught seeding a torrent, not for the act of watching an illegal copy. Movie studios don't go after visitors of illegal streaming sites, for instance.

aucisson_masque4mo ago

> Movie studios don't go after visitors of illegal streaming sites, for instance.

They absolutely do, in France we have Hadopi that tracks torrent leecher. Hadopi had been heavily pushed by the movie and music industry.

1 more reply

bmitc4mo ago

It's how the law works for those at the top of the oligarchy.

thaumasiotes4mo ago

Note that what copyright law prohibits is the action of producing a copy for someone else, not the action of obtaining a copy for yourself.

codedokode4mo ago

If I am not mistaken, the law prohibits producing any unauthorized copies. So if you download a pirated book on a computer, you produce an illegal copy: [1]. If I am not missing anything, ML companies are galaxy-scale infringers.

> 106. Exclusive rights in copyrighted works

> Subject to sections 107 through 122, the owner of copyright under this title has the exclusive rights to do and to authorize any of the following:

> (1) to reproduce the copyrighted work in copies or phonorecords;

> 501. Infringement of copyright

> (a) Anyone who violates any of the exclusive rights of the copyright owner as provided by sections 106 through 122 or of the author as provided in section 106A(a), or who imports copies or phonorecords into the United States in violation of section 602, is an infringer of the copyright or right of the author, as the case may be.

[1] https://www.copyright.gov/title17/92chap5.html

bulbar4mo ago

Training of an LLM however is a lossy compressing algorithm to provide a copy of a variant of the data to the user later on.

ErroneousBosh4mo ago

Did you pirate this movie?

No, I acquired a block of high-entropy random numbers as a standard reference sample.

JKCalhoun4mo ago

I saw the movie, but I don't remember it now.

machomaster4mo ago

I saw the movie, but I did not watch it.

Ferret74464mo ago

Indeed, the "copy" of the movie in your brain is not illegal. It would be rather troublesome and dystopian if it were.

visarga4mo ago

The problem is when you use your "copy" as inspiration and actually create and publish something. It is very hard to be certain you are safe, besides literal expression close paraphrasing is also infringing, using world building elements, or using any original abstraction (AFC test). You can only know after a lawsuit.

It is impossible to tell how much AI any creator used secretly, so now all works are under suspicion. If copyright maximalists successfully copyright style (vibes), then creativity will be threatened. If they don't succeed, then copyright protection will be meaningless. A catch 22.

1 more reply

SoftTalker4mo ago

Not yet, anyway.

NitpickLawyer4mo ago

> Does this even make sense? Are the copyright laws so bad that a statement like this would actually be in NVIDIA’s favor?

It makes some sense, yeah. There's also precedent, in google scanning massive amounts of books, but not reproducing them. Most of our current copyright laws deal with reproductions. That's a no-no. It gets murky on the rest. Nvda's argument here is that they're not reproducing the works, they're not providing the works for other people, they're "scanning the books and computing some statistics over the entire set". Kinda similar to Google. Kinda not.

I don't see how they get around "procuring them" from 3rd party dubious sources, but oh well. The only certain thing is that our current laws didn't cover this, and probably now it's too late.

musicale4mo ago

> There's also precedent, in google scanning massive amounts of books,

Except that Google acquired the books legally, and first sale doctrine applies to physical books.

> but not reproducing them

See also: "Extracting books from production language models"

https://news.ycombinator.com/item?id=46569799

olejorgenb4mo ago

> I don't see how they get around "procuring them" from 3rd party dubious sources

Yeah, isn't this what Anthropic was found guilty off?

bulbar4mo ago

Is they don't reproduce the data of any kind, how could the LLM be of any use?

The whole/main intention of an LLM is to reproduce knowledge.

masfuerte4mo ago

Scanning books is literally reproducing them. Copying books from Anna's Archive is also literally reproducing them. The idea that it is only copyright infringement if you engage in further reproduction is just wrong.

As a consumer you are unlikely to be targeted for such "end-user" infringement, but that doesn't mean it's not infringement.

NitpickLawyer4mo ago

https://cases.justia.com/federal/appellate-courts/ca2/13-482...

This is the conclusion of the saga between the author's guild v. google. It goes through a lot of factors, but in the end the conclusion is this:

> In sum, we conclude that: (1) Google’s unauthorized digitizing of copyright-protected works, creation of a search functionality, and display of snippets from those works are non-infringing fair uses. The purpose of the copying is highly transformative, the public display of text is limited, and the revelations do not provide a significant market substitute for the protected aspects of the originals. Google’s commercial nature and profit motivation do not justify denial of fair use. (2) Google’s provision of digitized copies to the libraries that supplied the books, on the understanding that the libraries will use the copies in a manner consistent with the copyright law, also does not constitute infringement. Nor, on this record, is Google a contributory infringer.

amanaplanacanal4mo ago

It seems like they pretty much don't care unless you distribute the copy. There is certainly precedent for it, going back to the Betamax case in the 1980s.

Ferret74464mo ago

Private reproductions are allowed (e.g. backups). Distributing them non-privately is not.

2 more replies

threethirtytwo4mo ago

It does make sense. It’s controversial. Your memory memorizes things in the same way. So what nvidia does here is no different, the AI doesn’t actually copy any of the books. To call training illegal is similar to calling reading a book and remembering it illegal.

Our copyright laws are nowhere near detailed enough to specify anything in detail here so there is indeed a logical and technical inconsistency here.

I can definitely see these laws evolving into things that are human centric. It’s permissible for a human to do something but not for an AI.

What is consistent is that obtaining the books was probably illegal, but say if nvidia bought one kindle copy of each book from Amazon and scraped everything for training then that falls into the grey zone.

ckastner4mo ago

> To call training illegal is similar to calling reading a book and remembering it illegal.

Perhaps, but reproducing the book from this memory could very well be illegal.

And these models are all about production.

roblabla4mo ago

To be fair, that seems to be where some of the IA lawsuits are going. The argument goes that the models themselves aren't derivative works, but the output they produce can absolutely be - in much the same way that reproducing a book from memory could be copyright violation, trademark infringement, or generally go afoul of the various IP laws.

threethirtytwo4mo ago

Models don’t reproduce books though. It’s impossible for a model to reproduce something word for word because the model never copied the book.

Most of the best fit curve runs along a path that doesn’t even touch an actual data point.

3 more replies

lelanthran4mo ago

> To call training illegal is similar to calling reading a book and remembering it illegal.

A type of wishful thinking fallacy.

In law scale matters. It's legal for you to possess a single joint. It's not legal to possess 400 tons of weed in a warehouse.

kalap_ur4mo ago

It is not the scale that matters here, in your example, but intent. With 1 joint, you want to smoke yourself. With 400, you very possibly want to sell it to others. Scale in itself doesnt matter, scale matters only as to the extent it changes what your intention may be.

3 more replies

threethirtytwo4mo ago

Er no. I’ve read and remember hundreds of books in my life time. It’s not any more illegal based off scale. The law doesn’t differentiate whether I remember one book or a hundred then there’s no difference for thousands or millions.

No wishful thinking here.

2 more replies

kalap_ur4mo ago

You can only read the book, if you purchased it. Even if you dont have the intent to reproduce it, you must purchase it. So, I guess NVDA should just purchase all those books, no?

threethirtytwo4mo ago

Yep, I agree. That’s the part that’s clearly illegal. They should purchase the books, but they didn’t.

1 more reply

ThrowawayR24mo ago

Obviously not; one can borrow books from libraries and read them as well.

1 more reply

_trampeltier4mo ago

But to train the models they have to download it first (make a copy)

threethirtytwo4mo ago

You had to do this for reading too. The words were burned onto your retina as volatile memory before getting processed by your brain.

You retina likely overwrote it's "memory" as soon as you looked at something else, but that's no different than copying and deleting or the more apt analogy: streaming.

2 more replies

Nursie4mo ago

But it’s not just about recall and reproduction. If they used Anna’s Archive the books were obtained and copied without a license, before they were fed in as training data.

godelski4mo ago

You need to pay for the books before you memorize them

threethirtytwo4mo ago

Partially true. I can pay for a book then lend it out to people for free.

The government is in full support of this "lending" concept, in fact they have created entire facilities devoted to this very concept of lending out books.

1 more reply

laterium4mo ago

You can sit down at a library or Barnes and Noble and memorize for free.

1 more reply

Bombthecat4mo ago

Who cares? Only Disney had the money to fight them.

Everything else will be slurped up for and with AI and be reused.

nancyminusone4mo ago

When you're responsible for 4% of the global GDP, they let you do it.

qingcharles4mo ago

They let you just grab any book you want.

HillRat4mo ago

It's not settled law as it pertains to LLMs, but, yes, creating a "statistical summary" of a book (consider, e.g., a concordance of Joyce's "Ulysses") is generally protected as fair use. However, illegally accessing pirated books to create that concordance is still illegal.

HWR_144mo ago

Copyright laws are so undefined and NVIDIAs lawyers so plentiful that the statement works in their favor. You're allowed to copy part of a work in many cases, the easiest example is you can quote a line from a book in a review. The line is fuzzy.

tobwen4mo ago

Books are databases, chars their elements. We have copyright for databases in EU :)

bulbar4mo ago

Of course it does not make sense, it's just the framing of a multi billion dollar industry and people tend to buy those.

lencastre4mo ago

I would love to see these nvidia designs as mere statistical correlations of graphic card design.

RGamma4mo ago

The chicken is trying to become the egg.

postexitus4mo ago

A quite good explanation of what copyright laws cover and should (and should not) cover is here by Cory Doctorow: https://www.theguardian.com/us-news/ng-interactive/2026/jan/...

Elfener4mo ago

It seems so, stealing copyrighted content is only illegal if you do it to read it or allow others to read it. Stealing it to create slop is legal.

(The difference, is that the first use allows ordinary poeple to get smarter, while the second use allows rich people to get (seemingly) richer, a much more important thing)

poulpy1234mo ago

I'm not saying it will change anything but going after Anna's archive while most of the big AI players intensely used it is quite something

gizajob4mo ago

Library Genesis worked pretty great and unmolested until news came out about Meta using it, at which point a bunch of the main sites disappeared off the net. So not only do these companies take ALL the pirated material, their act of doing so even borks the pirates, ruining the fun of piracy for everyone else.

pjc504mo ago

NVIDIA are "legitimate", so anything they do is fine, while AA are "illegitimate", so it's not.

countWSS4mo ago

Short-term thinking, they don't care about where the data comes from but how easy is to get it. Its probably decided at project-manager level.

haritha-j4mo ago

Just to clarify, the most valuable company in the world refuses to pay for digital media?

rpdillon4mo ago

I see this sentiment posted quite a bit, but have the publishers made any products available that would allow AI training on their works for payment? A naive approach would be to go to an online bookstore and pay $15 for every book, but then you have copyrighted content that is encrypted, that it's a violation of the DMCA to decrypt.

I assume you're expecting that they'll reach out and cut a deal with each publishing house separately, and then those publishing houses will have to somehow transfer their data over to NVIDIA. But that's a very custom set of discussions and deals that have to be struck.

I think they're going to the pirate libraries because the product they want doesn't exist.

haritha-j4mo ago

Perhaps because authors don't want their content to be used for this purpose? Because Microsoft refuses to give me a copy of the source code to Windows to 'inspire' my vibe-coded OS, Windowpanes 12, of which I will not give microsoft a single cent of revenue, its acceptable for me to pirate it? Someone doesn't want to sell me their work, so I'm justified in stealing it?

rpdillon4mo ago

Oh, I'm certain that authors want to take away every right the reader has. This is easy to see empirically: we used to have the first-sale doctrine with physical books, but every digital platform has made reselling either impossible, or a violation of their terms. And yet, courts have rendered judgements that say training an AI model on a book is fair use if the book was obtained legally, meaning no additional license is needed. You assume the authors' permission is needed, but I'm not sure that assumption holds. I think your argument is mostly emotional, rather than legal.

Your use of the word "stealing" is incorrect, but regardless, I'm not condoning piracy, merely examining the incentives we've set up to lead massive, multi-billion dollar corporations from engaging in it.

a4564634mo ago

Exactly! ALL of the LLM companies are complicit in this bullying and theft and so are the AI grifters on LinkedIn.

1 more reply

zaptheimpaler4mo ago

> I assume you're expecting that they'll reach out and cut a deal with each publishing house separately, and then those publishing houses will have to somehow transfer their data over to NVIDIA. But that's a very custom set of discussions and deals that have to be struck.

If this is the only legal way for them to train, then yes that is what they should do instead of breaking the law... just because its not easy doesn't mean piracy is fine.

rpdillon4mo ago

My comment is being misread as my support for piracy; my comment isn't meant to discuss anything at all about piracy. It's instead intended to look at everything that's not piracy, and examining their costs, and why the industry chose the path they did.

Existing rulings are beginning to suggest that if the books can be obtained legally, a separate license is not required for training. So I'm naturally interested in legal ways folks training models would get a lot of books, and whether the publishing industry has even considered the value there.

g947o4mo ago

Hmm, didn't Anthropic buy a bunch of used books (like, physical ones), scanned them, and then destroyed them? If Anthropic can do that, surely can NVIDIA

rpdillon4mo ago

Yes! And it was ruled legal by the courts, but the media spun it as "Anthropic destroys a million books to build AI". This is the only legal bulk approach I know of, hence my inquiry about such a product. I didn't expect such a harsh response from some of these comments.

dns_snek4mo ago

Do you believe in private property rights? If the product they want doesn't exist then they're shit out of luck and they must either make one or wait for one to get made. You're arguing that it's okay for them to break the law because doing business legally is really inconvenient.

That would be the end of discussion if we lived in a world governed by the rule of law but we're repeatedly reminded that we don't.

rpdillon4mo ago

Not arguing it's ok to break the law, but rather examining their incentives and alternatives, along with their associated costs.

trueismywork4mo ago

The product i want doesnt exist too. But if I pirate, straight to Alcataraz I go.

rpdillon4mo ago

Yeah, I wasn't discussing legality, simply the incentives and alternatives.

kelnos4mo ago

That's not relevant went it comes to copyright law. The copyright holder has the sole legal right to decide how the work is distributed.

If it isn't distributed in a manner to your liking, the only legal thing you can do is not have a copy of it at all.

rpdillon4mo ago

I was trying to find out if any product that was legal can bridge that gap other than buying books in print, in bulk, and scanning them and destroying them. From the responses here, it sounds like the answer is a vehement "no".

Wasn't asking for advice on copyright, but since we're here, your statement is slightly too strict, at least with respect to US copyright law. The copyright holder has sole distribution authority over the first sale of the work in the United States, but thereafter the first-sale doctrine allows it to be distributed by anyone thereafter. It is limited to the US, though, as far as I know. This is what allowed anthropic to train on printed books, which they then destroyed: they were able to purchase them in bulk because of the first-sale doctrine, as the publishers and authors would likely try to destroy the first-sale doctrine if they could, as evidenced by what's happened in the world of digital books.

nexle4mo ago

they already paid 10x more to their lawyers to ensure that torrenting for LLM training is perfectly legal, why they want to pay more?

1over1374mo ago

Not spending money (vs spending money) helps make one rich!

machomaster4mo ago

Not in the case of Nvidia. Famously, "the more you pay, the more you save".

GeorgeOldfield4mo ago

this is good. down with copyright.

NekkoDroid4mo ago

Well... you don't want the good guys (Nvidia) giving money to the bad guys (Anna's Archive) right??? /s

flipped4mo ago

Considering AA gave them ~500TB of books, which is astonishing (very expensive to even store for AA), I wonder how much nvidia paid them for it? It has to be atleast close to half a million?

qingcharles4mo ago

I have a very large collection of magazines. AI companies were offering straight cash and FTP logins for them about a year or so ago. Then when things all blew up they all went quiet.

uncivilized4mo ago

How did AI companies find your collection?

antonmksOP4mo ago

NVIDIA executives allegedly authorized the use of millions of pirated books from Anna's Archive to fuel its AI training. In an expanded class-action lawsuit that cites internal NVIDIA documents, several book authors claim that the trillion-dollar company directly reached out to Anna's Archive, seeking high-speed access to the shadow library data.

derelicta4mo ago

I feel like Nvidia's CEO would be the kind to snatch off sugary sachets from his local deli just to save up some more.

songodongo4mo ago

“Yes officer, it was the goober thinking he looked cool in the leather jacket.”

musicale4mo ago

The reason why it's legal for Nvidia to download Anna's Archive and illegal for you is that they have well-paid lawyers and deep pockets to kick the can down the road for years before negotiating a settlement, making billions in the meantime. The settlement itself becomes a moat.

They also pay millions of dollars to lobbyists to encourage favorable regulation and enforcement.

utopiah4mo ago

People HAVE to somehow notice how hungry for proper data AI companies are when one of the largest companies propping the fastest growing market STILL has to go to such length, getting actual approval for pirated content while they are hardware manufacturer.

I keep hearing how it's fine because synthetic data will solve it all, how new techniques, feedback etc. Then why do that?

The promises are not matching the resources available and this makes it blatantly clear.

2OEH8eoCRo04mo ago

I've always wondered about some of the torrent whales with multiple petabytes on private trackers. A lot of the whales auto dl every single new torrent that's uploaded. Perhaps even the sites themselves are allowed to operate as a way to get users to crowd source media.

SanjayMehta4mo ago

I'm wondering what Amazon is planning to do with their access to all those Kindle books.

quinncom4mo ago

I was curious:

• Anna’s Archive: ~61.7 million “books” (plus ~95.7M papers) as of January 2026 https://en.wikipedia.org/wiki/Anna%27s_Archive • Amazon Kindle: “over 6 million titles” as of March 2018 https://en.wikipedia.org/wiki/Anna%27s_Archive

Hard to compare because AA contains duplicates, and the Kindle number is old, but at a glance it seems AA wins.

philipwhiuk4mo ago

What do you mean 'planning'. You think they haven't already been sucked up?

embedding-shape4mo ago

What do you mean 'sucked up'? It's data on their machines already, people willingly give them the data, so Amazon can process and offer it to readers. No sucking needed, just use the data people uploaded to you already.

sib4mo ago

There's definitely a legal & contractual difference between (1) storing the books on your servers in order to provide them to end users who have purchased licenses to read them and (2) using that same data for training a model that might be used to create books that compete with the originals. I'm pretty sure that's why GP means by "sucking up."

This is analogous the difference between Gmail using search within your mail content to find messages that you are looking for vs Gmail providing ads inside Gmail based on the content of your email (which they don't do).

2 more replies

1over1374mo ago

A great retaliation to Trump tariffs would be just cancelling copyright for American works in your country.

ronsor4mo ago

This would likely mean America canceling copyright for works in that country as well. I'm OK with that. Destroy copyright.

rtbruhan004mo ago

It's generous of them to ask for permission.

gizajob4mo ago

They wanted access to a faster pipe to slurp 500 terabytes, and that access comes at a cost. It wasn’t about permission.

And yeah they should be sued into the next century for copyright infringement. $4Trillion company illegally downloading the entire corpus of published literature for reuse is clearly infringement, its an absurdity to say that it’s fair use just to look for statistical correlations when training LLMs that will be used to render human authors worthless. One or two books is fair use. Every single book published is not.

empath754mo ago

Whatever they get sued for would be pocket change.

breakingcups4mo ago

It wasn't about permission, it was about high-speed access. They needed Anna's Archive to facilitate that for them, scraping was too slow. It's incredible that they were allowed to continue even after Anna's Archive themselves explicitly pointed out that the material was acquired illegally.

kristofferR4mo ago

That's just normal US modus operandi. The court case against Maduro is allowed to continue even after everyone has acknowledged he was acquired illegally.

kristofferR4mo ago

It's not permission, it's a service they offer:

https://annas-archive.li/llm

hollow-moe4mo ago

whatever, laws are for the poor anyways, you ought to think it would be common knowledge by now but nope

wosined4mo ago

Sounds like BS. Why would nvidia need the books. Do they even have a chatbot? I doubt the books help with framegen.

johndough4mo ago

From the top of the linked article:

    > NVIDIA is also developing its own models, including NeMo, Retro-48B, InstructRetro, and Megatron. These are trained using their own hardware and with help from large text libraries, much like other tech giants do.

You can download the models here: https://huggingface.co/nvidia

utopiah4mo ago

The same reason Intel worked on OpenCV : they want to sell more hardware by pushing the state of the art of what software can do on THEIR hardware.

It's basically just a sales demonstrator, that optionally, if incredibly successful and costly they can still sell as SaaS, if not just offer for free.

Think of it as a tech ad.

voidUpdate4mo ago

I cant see the whole relevant section in the article, but there is a screenshot of part of the legal documents that states "In response, NVIDIA sought to develop and demonstrate cutting edge LLMs at its fall 2023 developer day. In seeking to acquire data for what it internally called "NextLargeLLM", "NextLLMLarge" and-" (cuts off here)

j / k navigate · click thread line to collapse

158 comments

skilled4mo ago

> In response, NVIDIA defended its actions as fair use, noting that books are nothing more than statistical correlations to its AI models.

Does this even make sense? Are the copyright laws so bad that a statement like this would actually be in NVIDIA’s favor?

ThrowawayR24mo ago

jkaplowitz4mo ago

> Even one of the white papers commissioned by the FSF

Quoting the text which the FSF put at the top of that page:

> You will notice that the FSF has not rushed out to file copyright infringement suits even though they probably have more reason to oppose LLMs trained on FOSS code than anyone else in the world.

This would be consistent with them agreeing with this paper's conclusion, sure. But that's not the only possibility it's consistent with.

Or it could simply be because they don't think they can afford the legal and PR fight that would likely result.

ThrowawayR24mo ago

Since I very specifically wrote "commissioned by the FSF" instead of "represents the opinion of the FSF" to avoid misrepresenting the paper, you're arguing against something I have not said.

1 more reply

grayhatter4mo ago

reorder96954mo ago

So it's legal to train an "intelligence" on everything for free based on fair use, but it's not legal to train another intelligence (my brain) on it?

grayhatter4mo ago

1 more reply

Arnt4mo ago

You're close to an important point.

You can learn how to play and sing Springsteen songs verbatim, and you can use his records to learn to sound like him when you sing, and that's intended to be legal.

Since the law doesn't say "but you cannot write a program to do these things, or run such a program once written", why would it be illegal to do the same thing using some code?

1 more reply

general14654mo ago

Did you pirated this movie? No I did not, it is fair use because this movie is nothing more than a statistical correlation to my dopamine production.

earthnail4mo ago

gruez4mo ago

>Did you pirated this movie? No I did not, [...]

aucisson_masque4mo ago

> Movie studios don't go after visitors of illegal streaming sites, for instance.

They absolutely do, in France we have Hadopi that tracks torrent leecher. Hadopi had been heavily pushed by the movie and music industry.

1 more reply

bmitc4mo ago

It's how the law works for those at the top of the oligarchy.

thaumasiotes4mo ago

Note that what copyright law prohibits is the action of producing a copy for someone else, not the action of obtaining a copy for yourself.

codedokode4mo ago

> 106. Exclusive rights in copyrighted works

> Subject to sections 107 through 122, the owner of copyright under this title has the exclusive rights to do and to authorize any of the following:

> (1) to reproduce the copyrighted work in copies or phonorecords;

> 501. Infringement of copyright

[1] https://www.copyright.gov/title17/92chap5.html

bulbar4mo ago

Training of an LLM however is a lossy compressing algorithm to provide a copy of a variant of the data to the user later on.

ErroneousBosh4mo ago

Did you pirate this movie?

No, I acquired a block of high-entropy random numbers as a standard reference sample.

JKCalhoun4mo ago

I saw the movie, but I don't remember it now.

machomaster4mo ago

I saw the movie, but I did not watch it.

Ferret74464mo ago

Indeed, the "copy" of the movie in your brain is not illegal. It would be rather troublesome and dystopian if it were.

visarga4mo ago

1 more reply

SoftTalker4mo ago

Not yet, anyway.

NitpickLawyer4mo ago

> Does this even make sense? Are the copyright laws so bad that a statement like this would actually be in NVIDIA’s favor?

I don't see how they get around "procuring them" from 3rd party dubious sources, but oh well. The only certain thing is that our current laws didn't cover this, and probably now it's too late.

musicale4mo ago

> There's also precedent, in google scanning massive amounts of books,

Except that Google acquired the books legally, and first sale doctrine applies to physical books.

> but not reproducing them

See also: "Extracting books from production language models"

https://news.ycombinator.com/item?id=46569799

olejorgenb4mo ago

> I don't see how they get around "procuring them" from 3rd party dubious sources

Yeah, isn't this what Anthropic was found guilty off?

bulbar4mo ago

Is they don't reproduce the data of any kind, how could the LLM be of any use?

The whole/main intention of an LLM is to reproduce knowledge.

masfuerte4mo ago

As a consumer you are unlikely to be targeted for such "end-user" infringement, but that doesn't mean it's not infringement.

NitpickLawyer4mo ago

https://cases.justia.com/federal/appellate-courts/ca2/13-482...

This is the conclusion of the saga between the author's guild v. google. It goes through a lot of factors, but in the end the conclusion is this:

amanaplanacanal4mo ago

It seems like they pretty much don't care unless you distribute the copy. There is certainly precedent for it, going back to the Betamax case in the 1980s.

Ferret74464mo ago

Private reproductions are allowed (e.g. backups). Distributing them non-privately is not.

2 more replies

threethirtytwo4mo ago

Our copyright laws are nowhere near detailed enough to specify anything in detail here so there is indeed a logical and technical inconsistency here.

I can definitely see these laws evolving into things that are human centric. It’s permissible for a human to do something but not for an AI.

ckastner4mo ago

> To call training illegal is similar to calling reading a book and remembering it illegal.

Perhaps, but reproducing the book from this memory could very well be illegal.

And these models are all about production.

roblabla4mo ago

threethirtytwo4mo ago

Models don’t reproduce books though. It’s impossible for a model to reproduce something word for word because the model never copied the book.

Most of the best fit curve runs along a path that doesn’t even touch an actual data point.

3 more replies

lelanthran4mo ago

> To call training illegal is similar to calling reading a book and remembering it illegal.

A type of wishful thinking fallacy.

In law scale matters. It's legal for you to possess a single joint. It's not legal to possess 400 tons of weed in a warehouse.

kalap_ur4mo ago

3 more replies

threethirtytwo4mo ago

No wishful thinking here.

2 more replies

kalap_ur4mo ago

You can only read the book, if you purchased it. Even if you dont have the intent to reproduce it, you must purchase it. So, I guess NVDA should just purchase all those books, no?

threethirtytwo4mo ago

Yep, I agree. That’s the part that’s clearly illegal. They should purchase the books, but they didn’t.

1 more reply

ThrowawayR24mo ago

Obviously not; one can borrow books from libraries and read them as well.

1 more reply

_trampeltier4mo ago

But to train the models they have to download it first (make a copy)

threethirtytwo4mo ago

You had to do this for reading too. The words were burned onto your retina as volatile memory before getting processed by your brain.

You retina likely overwrote it's "memory" as soon as you looked at something else, but that's no different than copying and deleting or the more apt analogy: streaming.

2 more replies

Nursie4mo ago

But it’s not just about recall and reproduction. If they used Anna’s Archive the books were obtained and copied without a license, before they were fed in as training data.

godelski4mo ago

You need to pay for the books before you memorize them

threethirtytwo4mo ago

Partially true. I can pay for a book then lend it out to people for free.

The government is in full support of this "lending" concept, in fact they have created entire facilities devoted to this very concept of lending out books.

1 more reply

laterium4mo ago

You can sit down at a library or Barnes and Noble and memorize for free.

1 more reply

Bombthecat4mo ago

Who cares? Only Disney had the money to fight them.

Everything else will be slurped up for and with AI and be reused.

nancyminusone4mo ago

When you're responsible for 4% of the global GDP, they let you do it.

qingcharles4mo ago

They let you just grab any book you want.

HillRat4mo ago

HWR_144mo ago

tobwen4mo ago

Books are databases, chars their elements. We have copyright for databases in EU :)

bulbar4mo ago

Of course it does not make sense, it's just the framing of a multi billion dollar industry and people tend to buy those.

lencastre4mo ago

I would love to see these nvidia designs as mere statistical correlations of graphic card design.

RGamma4mo ago

The chicken is trying to become the egg.

postexitus4mo ago

A quite good explanation of what copyright laws cover and should (and should not) cover is here by Cory Doctorow: https://www.theguardian.com/us-news/ng-interactive/2026/jan/...

Elfener4mo ago

It seems so, stealing copyrighted content is only illegal if you do it to read it or allow others to read it. Stealing it to create slop is legal.

(The difference, is that the first use allows ordinary poeple to get smarter, while the second use allows rich people to get (seemingly) richer, a much more important thing)

poulpy1234mo ago

I'm not saying it will change anything but going after Anna's archive while most of the big AI players intensely used it is quite something

gizajob4mo ago

pjc504mo ago

NVIDIA are "legitimate", so anything they do is fine, while AA are "illegitimate", so it's not.

countWSS4mo ago

Short-term thinking, they don't care about where the data comes from but how easy is to get it. Its probably decided at project-manager level.

haritha-j4mo ago

Just to clarify, the most valuable company in the world refuses to pay for digital media?

rpdillon4mo ago

I think they're going to the pirate libraries because the product they want doesn't exist.

haritha-j4mo ago

rpdillon4mo ago

a4564634mo ago

Exactly! ALL of the LLM companies are complicit in this bullying and theft and so are the AI grifters on LinkedIn.

1 more reply

zaptheimpaler4mo ago

If this is the only legal way for them to train, then yes that is what they should do instead of breaking the law... just because its not easy doesn't mean piracy is fine.

rpdillon4mo ago

g947o4mo ago

Hmm, didn't Anthropic buy a bunch of used books (like, physical ones), scanned them, and then destroyed them? If Anthropic can do that, surely can NVIDIA

rpdillon4mo ago

dns_snek4mo ago

That would be the end of discussion if we lived in a world governed by the rule of law but we're repeatedly reminded that we don't.

rpdillon4mo ago

Not arguing it's ok to break the law, but rather examining their incentives and alternatives, along with their associated costs.

trueismywork4mo ago

The product i want doesnt exist too. But if I pirate, straight to Alcataraz I go.

rpdillon4mo ago

Yeah, I wasn't discussing legality, simply the incentives and alternatives.

kelnos4mo ago

That's not relevant went it comes to copyright law. The copyright holder has the sole legal right to decide how the work is distributed.

If it isn't distributed in a manner to your liking, the only legal thing you can do is not have a copy of it at all.

rpdillon4mo ago

nexle4mo ago

they already paid 10x more to their lawyers to ensure that torrenting for LLM training is perfectly legal, why they want to pay more?

1over1374mo ago

Not spending money (vs spending money) helps make one rich!

machomaster4mo ago

Not in the case of Nvidia. Famously, "the more you pay, the more you save".

GeorgeOldfield4mo ago

this is good. down with copyright.

NekkoDroid4mo ago

Well... you don't want the good guys (Nvidia) giving money to the bad guys (Anna's Archive) right??? /s

flipped4mo ago

Considering AA gave them ~500TB of books, which is astonishing (very expensive to even store for AA), I wonder how much nvidia paid them for it? It has to be atleast close to half a million?

qingcharles4mo ago

I have a very large collection of magazines. AI companies were offering straight cash and FTP logins for them about a year or so ago. Then when things all blew up they all went quiet.

uncivilized4mo ago

How did AI companies find your collection?

antonmksOP4mo ago

derelicta4mo ago

I feel like Nvidia's CEO would be the kind to snatch off sugary sachets from his local deli just to save up some more.

songodongo4mo ago

“Yes officer, it was the goober thinking he looked cool in the leather jacket.”

musicale4mo ago

They also pay millions of dollars to lobbyists to encourage favorable regulation and enforcement.

utopiah4mo ago

I keep hearing how it's fine because synthetic data will solve it all, how new techniques, feedback etc. Then why do that?

The promises are not matching the resources available and this makes it blatantly clear.

2OEH8eoCRo04mo ago

SanjayMehta4mo ago

I'm wondering what Amazon is planning to do with their access to all those Kindle books.

quinncom4mo ago

I was curious:

Hard to compare because AA contains duplicates, and the Kindle number is old, but at a glance it seems AA wins.

philipwhiuk4mo ago

What do you mean 'planning'. You think they haven't already been sucked up?

embedding-shape4mo ago

sib4mo ago

2 more replies

1over1374mo ago

A great retaliation to Trump tariffs would be just cancelling copyright for American works in your country.

ronsor4mo ago

This would likely mean America canceling copyright for works in that country as well. I'm OK with that. Destroy copyright.

rtbruhan004mo ago

It's generous of them to ask for permission.

gizajob4mo ago

They wanted access to a faster pipe to slurp 500 terabytes, and that access comes at a cost. It wasn’t about permission.

empath754mo ago

Whatever they get sued for would be pocket change.

breakingcups4mo ago

kristofferR4mo ago

That's just normal US modus operandi. The court case against Maduro is allowed to continue even after everyone has acknowledged he was acquired illegally.

kristofferR4mo ago

It's not permission, it's a service they offer:

https://annas-archive.li/llm

hollow-moe4mo ago

whatever, laws are for the poor anyways, you ought to think it would be common knowledge by now but nope

wosined4mo ago

Sounds like BS. Why would nvidia need the books. Do they even have a chatbot? I doubt the books help with framegen.

johndough4mo ago

From the top of the linked article:

    > NVIDIA is also developing its own models, including NeMo, Retro-48B, InstructRetro, and Megatron. These are trained using their own hardware and with help from large text libraries, much like other tech giants do.

You can download the models here: https://huggingface.co/nvidia

utopiah4mo ago

The same reason Intel worked on OpenCV : they want to sell more hardware by pushing the state of the art of what software can do on THEIR hardware.

It's basically just a sales demonstrator, that optionally, if incredibly successful and costly they can still sell as SaaS, if not just offer for free.

Think of it as a tech ad.

voidUpdate4mo ago

j / k navigate · click thread line to collapse