Meta torrented & seeded 81.7 TB dataset containing copyrighted data

938 comments

Based on the encyclopedic knowledge LLMs have of written works I assume all parties did the same. But I think there is a broader point to make here. Youtube was initially a ghost town (it started as a dating site) and it only got traction once people started uploading copyrighted TV shows to it. Google itself got big by indexing other people's data without compensation. Spotify's music library was also pirated in the early days. The contracts with the music labels came later. GPL violations by commercial products fits the theme also.

Companies aggressively protect their own intellectual property but have no qualms about violating the IP rights of others. Companies. Individuals have no such privilege. If you plug a laptop into a closet at MIT to download some scientific papers you forfeit your life.

throwawaygmbno1y ago

Everyone on here is smart enough. Just do not participate and save your money. Do not pay for digital goods. If Netflix raises their prices, it doesn't matter because there is a torrent of all of their shows. If Spotify raises their prices, it doesn't matter because your favorite artist has their entire library in a torrent. If some game company ask you to pay real life prices for a digital costume, find the crack online and play on a private server. If YouTube wants to interrupt your video with an ad in the middle of the sentence, download one of the many options that blocks all ads. Billion dollar companies have shown they do not care about you. The people who complain about losing their salary, should just get replies thanking them for paying.

All the sad poor people who might be hurt were already paid. The caterer on your favorite show is not getting residuals. NBC also isn't going to stop making TV shows because that is all they can do. Content creators also existed on the internet long before that was a job. They just did it because they cared about it not for ad money. If you really want to support the artist directly go to a concert or just mail them a check. If you can't actually identify a person who might be hurt, then do not care.

ericyd1y ago

I just can't get behind the sentiment that the unethical behavior by big companies means I get to access all the content I want for free.

8 more replies

hnpolicestate1y ago

I'm not paying for Led Zeppelin IV after having probably bought 3 copies in my lifetime. I agree with you.

dingnuts1y ago

if you want to support an artist go to the show and BUY MERCH at the table! almost all of their income comes from that. the importance of buying a T-shirt at the show cannot be overstated and sometimes you get to say hi to your idol, too

2 more replies

MourYother1y ago

I sometimes think my adblocker should very much lie to the page that "yeah, watched that, totally" in an undetectable way.

1 more reply

Cub31y ago

If buying isn't owning, piracy isn't stealing

swozey1y ago

lol I absolutely do not want non digital goods nor pirating. Ever. It's 2025. I don't have a cdplayer, a tape player, a blue ray player, I don't even know what the most modern "blue ray" disc would be. I have $2k worth of vinyls that are just unique copies I display as art I'll never put in my record player, that's also never been used. I don't want to constantly worry about 60gb of mp3 files.

Oh no, that TV show I'll forget about in a year cost me $15/mo instead of $60 of blurays.

I jump in my cars and hit a button and music plays. Almost any music I want. That's amazing.

I'm also not pirating games. I'm not 12 without a job. I have a job. I pay developers for their work. I want more games, like Kingdom Come 3, to come out.

Weird ass comment. You seriously think we're going to put our lives on hold to.. what, fight "digital media"? You think I care about netflix? Or societies use of it? I haven't used netflix in years. I don't know anybody under 40 with a netflix account. Everyone on your end of the pirate spectrum uses debrid nowadays, anyway.

Next you're going to tell people to install the "Black XP Windows" edition to not support Microsoft and they all get malware and their credit cards stolen because they installed some pirated and modified cracked windows. Genius.

MSNBC just cancelled Andrea Mitchells TV show, today, because she brought in no younger audiences. So yes, shows do get cancelled by not being watched.

This comment was upvoted? Hn needs a break. This is some I'm 14 and edgy bullshit that sounds like it belongs on an eastern european piracy forum.

3 more replies

lolinder1y ago

Yes. And the problem here isn't that companies get away with doing things like this, the problem is that individuals don't. Attempting to lock information behind a nightmarish legal system is the problem.

I'm pretty much at the point now where I don't buy the "copyright incentivizes creation" argument any more. Copyright, like advertising, incentivizes creation by enormous corporations, but also like advertising it incentivizes creations that overwhelmingly have little value.

Creative individuals don't need copyright to be incentivized to create—they need a safety net that gives them the freedom to spend time on the creativity that naturally wants to bubble out. If the goal is to encourage creativity, copyright is a lousy and enormously expensive substitute for Universal Basic Income.

post-it1y ago

Also, in Canada, it's basically impossible to protect your IP as an individual due to the astronomical cost and lack of options to recover that cost. So copyright will never incentivize my creations, or those of any small creator.

derektank1y ago

Sure creative people will always create but the scope of that creativity will be limited if we do away with intellectual property. Steve Spielberg would probably always have created movies, but he wouldn't have been able to make Jurassic Park, Saving Private Ryan,or Indiana Jones without capital from the studio system, and the studio system wouldn't have provided him with that capital of they couldn't extract economic rents from the copyright for those films.

3 more replies

startupsfail1y ago

Nothing stops you from downloading Ann’s archive and training a model on it, right? The likelihood that you, as an individual, get sued over is is virtually zero.

This is what Meta tried to do, quietly download and use the data, to do research and advance their LLMs, without trying to establish any legal precedents or pick up fights.

1 more reply

teaearlgraycold1y ago

Individuals do get away with it all of the time.

Lucasoato1y ago

> If you plug a laptop into a closet at MIT to download some scientific papers you forfeit your life.

In case anybody here doesn't know, that's a reference to Aaron Swartz, an activist (and Reddit co-founder) that was risking 35 years in prison and a $1 million fine just for downloading a lot of academic papers from JSTOR. He eventually took his life because of the pressure. May his soul rest in peace.

3 more replies

vel0city1y ago

> once people started uploading copyrighted TV shows to it

End users, not YouTube employees, right? And they would take things down following DMCA requests and what not, right? So, pretty much following the law?

> Google itself got big by indexing other people's data without compensation

Scraping public websites to build a search index isn't the same as making LLMs that can recreate the source verbatim devoid of even attribution. I do agree there's an argument to be had about the LLM's transformative nature in the end though.

> Spotify's music library was also pirated in the early days

Not any version generally available to the public, and with the copyright holder's permission to do so.

1 more reply

ysofunny1y ago

the english empire once tried to mantain a monopoly over steam loom machines

the americans cheated their way to competition,

heck, even before that, the english empire got jumpstarted by stealing gold from the spanish (who were themselves exploiting it away from aztec and other mexican natives)

I'm saying it's business as usual, but also, culture doesn't work like tangible physical widgets so we must stop letting a few steal this boon of digital copying by means of silly ideas like DRM, copyright, patents. all means to cause scarcity

m4rtink1y ago

The textile industry in Brno here in Czech Republic (sometimes called "Moravian Manchester") was hugely helped by a local noble posing as a worker in England & the smuggling detailed self-drawn plans of industrial machinery back:

"Brno’s fortunes were changed forever when a young freemason called Franz Hugo Salma set out for England in 1801. He intended to steal the plans for the most modern textile machinery in the world. His crime, the first recorded act of industrial espionage, boosted the competitiveness of Moravian textiles. Soon after smuggling the plans out disguised as a worker, and handing them over to Brno’s fledgling textile industry, Brno became the most important textile centre in the Habsburg empire."

You can even go see some of the original plans in a museum:

"Eleven designs are still preserved in the library of the Rájec chateau. They form a unique set of documents demonstrating both the level of wool processing technology at the turn of the late 18th and early 19th centuries, as well as the aims and means of the relatively rare business of industrial espionage at that time."

https://www.gotobrno.cz/en/brno-phenomenon/this-is-brno-kate... https://www.gotobrno.cz/en/place/salm-reifferscheidt-palace/

choult1y ago

Hollywood became popular for filmmaking because they were literally the opposite side of the country from Thomas Edison and his patents...

3 more replies

miltonlost1y ago

People criming in the past is not an excuse for companies committing crimes today. You’re excusing lawlessness.

Cain killed Abel and got away with it!! I can kill someone today too!!!

3 more replies

nottorp1y ago

Interesting, if we're to trust what NotOpenAI and Facebook say about their IP, the US should pay the UK reparations for IP theft based on textile industry profits starting in the 1850s until today?

portaouflop1y ago

Why do I get sued when I share some BitTorrents but $bigcorp can just do it with 1000 scale without problems?

The issue here is not copyright/patents/etc - the issue is that the law is applied selectively — the issue is that Aaron Schwartz is dead for sharing knowledge with the public and Zuccborg is a billionaire building his torment nexus

1 more reply

sebzim45001y ago

I don't think I've heard the term "English empire". Is it an attempt by the Scottish to pretend they weren't involved?

3 more replies

earthnail1y ago

In Spotify’s defense, they used the pirated data only to show a proof of concept to the copyright holders, and that use was sanctioned by the local rights holders organization STIM.

The copyright holders then approved their concept, and subsequently Spotify got the rights to offer their service to customers. Everybody won.

tanjtanjtanj1y ago

That’s not entirely true, in Spotify’s early days you could upload files to the service and listen to songs uploaded by other people. I think the majority of any song I wanted to listen to before they went Europe-only for a time was “pirated”.

2 more replies

pockmarked191y ago

> Spotify's music library was also pirated in the early days.

I want to know more, please enlighten me (anyone who knows). I read the book "The Spotify Play" and it made it seem like the pirated music was an internal-only thing and not something available to customers. Is that true?

mzl1y ago

Before the launch, Spotify had a deal with the music rights holders association in Sweden (STIM) that they could use a merged collection of friends and families music libraries. All this was removed before Spotify went out of beta.

So while it was using pirated media, it was sanctioned by the rights holders for the experiment of building Spotify.

arwineap1y ago

Users would upload their copies of the music and spotify would replay them. This was obvious to early users, even if they were only consumers, because of the pirate-shout-out-overlays that were in a lot of the poorer quality releases.

Another interesting note, in the early days of spotify, the app would saturate your upload bandwidth while using it. Given their close ties to utorrent, I always assumed that's how they were affording the bandwidth as well.

Pretty brilliant way to bootstrap I guess; they didn't have to pay for bandwidth or content until they already had contracts in place

1 more reply

billdybas1y ago

"Mood Machine: The Rise of Spotify and the Costs of the Perfect Playlist" by Liz Pelly goes into more detail about their origins and the culture around piracy in Sweden at the time.

https://lizpelly.info/book

marcosdumay1y ago

> If you plug a laptop into a closet at MIT to download some scientific papers you forfeit your life.

Just to point, but the material in question was public domain, so nobody had even a copyrights claim over it.

scottbez11y ago

Do you have a citation for that claim? I've not seen a claim that none of the material had copyright before.

1 more reply

Cthulhu_1y ago

Crunchyroll started off as a straight up piracy site, it now has millions of paying subscribers and was sold to Sony for over a billion a few years ago.

gnfargbl1y ago

I think if Google attempted to download the entirety of JSTOR with the express intent of making the full dataset freely available, then Google would also face legal consequences.

It's true, and relevant, that Google would feel those consequences much less sharply than Swartz did.

vintermann1y ago

Don't buy into the rhetoric and call it "consequences". It's always a choice to sue, a choice to prosecute, and this would be true even if these choices were made consistently and impartially (which they certainly aren't).

1 more reply

dekhn1y ago

Google Scholar explicitly made direct deals with publishers to scrape their content, with the constraint that while they can use the content to serve search results in Scholar, but cannot show the content of the papers on the site- just titles and short fragments that match. the deals were tenuous and I had to step carefully around my plan to use that database to implement large-scale scientific search over the literature (this was a long time before anybody was seriously considering using LLMs on research data).

I've spoken to several very wealthy/powerful people and tried to get them to negotiate a large-scale content license with the various publishers that would allow researchers and individuals to access more research in lower-friction ways. None of them (NIH, Schmidt, etc) were really interested.

josefx1y ago

Google book search was declared fair use and copyright holders ended up having to explicitly request removal of their works.

Apparently he would have gotten away with downloading the JSTOR database if he made it clear that he intended to only publish half of each paper.

coliveira1y ago

Yes, these companies are based on massive IP and copyright theft. And they still want to lecture others about their "property rights".

immibis1y ago

Something to understand about capitalist competition (also in politics) is that it's a war. Not one with guns and bombs, but more like a cold war, with espionage and hacking and just generally doing anything you can to gain an advantage without bringing negative consequences on yourself.

The limit is what you can actually get away with, not what the rules say you can get away with, and the system aggressively selects players who recognize this. It's amoral - there is no "ought", only "is". An actor gets punished or not, with absolutely no regard to whether it "should" get punished. One thing is consistent: following the rules as written means you lose.

You can see it in Y Combinator (and other) startups. The biggest ex-startups are things like AirBNB (hotels but we don't follow the rules but we don't get punished for not following them) and Uber (taxis but we don't follow the rules but we don't get punished for not following them).

One way to not get punished for not following the rules is to invent a variation of the game where the rules haven't been written yet. I again refer you to AirBNB and Uber; Omegle also comes to mind, although they didn't monetize.

Viewed in this light, Aaron Swartz's mistake was not the part where he downloaded journal articles, but the part where he got caught downloading journal articles. Shadow library sites are doing the same thing, minus the getting caught. So are Meta and Google and OpenAI. sci-hub is only involved in a lawsuit because it got caught and is now in the stage where it finds out whether it gets punished or not.

oblio1y ago

> Something to understand about capitalist competition (also in politics) is that it's a war.

Turns out there are 2 simultaneous wars there. One where companies and individuals compete ruthlessly.

And another one where if non profit associations of individuals form, guns come out.

soheil1y ago

Aaron committed suicide and FBI going after him was meant more as a lesson to the other kids at MIT than anything.

MegaUpload did the same, kim dotcom got raided in his sleep by FBI in New Zealand! So no I don't buy your reductionist argument, there are forces at play that allow companies with founders with the likes of Google to get away with it but not others.

yowzadave1y ago

> Youtube was initially a ghost town (it started as a dating site) and it only got traction once people started uploading copyrighted TV shows to it

To this day, there are a huge number of videos that show copyrighted content on YouTube; they are usually crappy clips, reversed and with different music playing in the background to avoid automated detection.

belter1y ago

"Zuckerberg was at White House for meetings on Thursday" - https://www.reuters.com/world/us/zuckerberg-was-white-house-...

Wowfunhappy1y ago

> Based on the encyclopedic knowledge LLMs have of written works I assume all parties did the same.

I don't understand why you wouldn't just buy copies of the books. Seems like such a relatively inexpensive way to strengthen your legal case.

freeone30001y ago

Buying a copy of the book doesn’t grant you the right to copy it. That is what copyright is for.

2 more replies

londons_explore1y ago

Pretty sure that even if you gave a purchasing team enough money for retail price and a list of all books ever published, they wouldn't be able to buy even a quarter of them.

1 more reply

jokethrowaway1y ago

Buying the books won't automatically give you permission to use the content commercially

gosub1001y ago

thanks to the byzantine copyright system, you can't easily do it. Plus, just speculating, but maybe by paying, it establishes "consideration" for some implied contract? "You implicitly entered a contract with us by purchasing the book, then violated the contract by 'distributing' the material for commercial use" ?

1 more reply

jml7c51y ago

Anna's Archive has 40 million books and 100 million papers. It's unlikely they could achieve similar coverage.

cess111y ago

Too much paperwork, too much effort. These are important people, doing much more important stuff than whatever book authors do.

Or so they think, I think.

1 more reply

plasticbugs1y ago

I briefly worked for Crunchyroll, which began life as an anime pirating service with subtitles. The contracts with the Japanese anime publishers came later. Now they vigorously protect their content from "pirates".

electriclove1y ago

Some can pirate on a large scale and see no repercussions.

Some can steal from stores and see no repercussions.

Some can steal from others and see no repercussions.

Some can violently harm others and see no repercussions.

Some can damage property and see no repercussions.

Some can’t. This world is not right.

1 more reply

1vuio0pswjnm71y ago

"Spotify's music library was also pirated in the early days."

"Ek, who had been the CEO of the piracy platform uTorrent, founded Spotify with his friend, another entrepreneur named Martin Lorentzon. Both-Ek at 23 and Lorentzon 37-were already millionaires from the sales of previous businesses. The name Spotify had no particular meaning, and was not associated with music. According to Spotify Teardown, the company developed a software for improved peer-to-peer network sharing, and the founders spoke of it as a general "media distribution platform." The initial choice to focus on music, the founders said at the time, was because audio files are smaller than video files, not because of a dream of saving music.

In 2007, when Spotify first publicly tested its software, it allowed users to stream songs downloaded from The Pirate Bay, a service for unlicensed downloads. By late 2008, Spotify would convince music labels in Sweden to license music to the site, and unlicensed music was removed. From there, Spotify would take off across Europe and then the world."

https://qz.com/1683609/how-the-music-industry-shifted-from-n...

sylario1y ago

And Hollywood was created on the west coast because for intellectual property it was still the far west and it allowed them to ignore patents on movie technologies.

1 more reply

cess111y ago

It's roughly the Spotify story too. They had an extremely impressive catalog very early, way before they were bought by the entertainment cartel. The founders had background in torrenting and the initial product was quite similar to The Pirate Bay but with clearly capitalist ambitions and branding, in contrast to the anarchist leanings of the Pirate Bureau and rather anarchic attitude of The Pirate Bay.

bko1y ago

The thing is Google, meta and YouTube weren't giant entities when they did this stuff. I think it's good no one cracked down on them for copyright stuff. Now they're developing an LLM that will generate potentially trillions in value to humanity and looks like they're not exactly playing by the rules. But I prefer looser intellectual property rights anyway so Im ok with it

ziddoap1y ago

>But I prefer looser intellectual property rights anyway so Im ok with it

I think more people, potentially anyways, would feel similar to to this if it applied even somewhat equally.

Instead, companies can seemingly do whatever they please whereas lawyers will send letters to your home for downloading a single episode of game of thrones.

2 more replies

ok1234561y ago

DRM for thee not for me.

lofaszvanitt1y ago

Well, we'll see how will it generate value and for whom.

BrenBarn1y ago

Exactly. We need leaders with the political will to apply a "financial death penalty" to companies that engage in this kind of brazen behavior. That means all assets seized, the company dissolved, personal assets of executives seized, executives jailed. People running companies should live in mortal fear of ever doing the things that they routinely do today.

1 more reply

wcfrobert1y ago

VC and startups are fundamentally about disruption. You can't make an omelette without breaking a few eggs (laws). The incumbent players are not going to sit still and let things be "disrupted". A common response is to make sure the public knows about the broken eggs. I would say youtube, Google, Spotify, Uber, doordash, etc. all have made my life much better.

2 more replies

vkou1y ago

> Google itself got big by indexing other people's data without compensation.

So in other words, it got big by providing free user traffic to people's websites without asking for compensation?

You generally don't charge the phone book money to include you in it. It's actually the other way around.

sandeepkd1y ago

Reminds me of recent discussions about similar topic, what may clearly look like a crime can be treated differently depending on if you do it as an individual or as a company. Somewhere down the line its all about understanding the limits and boundaries of the system, its a skill in itself.

yurlungur1y ago

I think the difference may be LLMs may not be laundered clean of copyright data anytime soon. Even if chatgpt got big and profitable, it's not so clear that it won't contain copyrighted data as that may simply be necessary to train the best models.

1 more reply

dcchambers1y ago

I guess the solution is to create a shell company for your illegal activities?

georgemcbay1y ago

The modern solution has been to grow so fast that by the time anyone can go after you legally you've already amassed so much money/power that you can have the laws rewritten (or at least enforced) around your existence.

IMO part of the reason the SV tech bros are embracing right wing grift culture so publicly now is that this method, which had been serving them well for decades, doesn't really work without the infinite free money lending spigot being wide open.

1 more reply

Cumpiler691y ago

You must be new to billionaire business practices: break the rules first, ask for forgiveness later.

By the time the cheque comes, your illicit venture either went bust or you built a bilion dollar empire capable of buying the best lawyers and lobbying to walk away clean.

sneak1y ago

> If you plug a laptop into a closet at MIT to download some scientific papers you forfeit your life.

I’m opposed to copyright and pro-aaronsw, but the state did not kill him.

1 more reply

modzu1y ago

i know of a company that poisoned an entire town! thats terrorism if done by an individual. the company still exists, just paid a settlement and carried on...

JKCalhoun1y ago

I agree with your point, but will split hairs on using the word "terrorism". I think that should be reserved for people that commit atrocities for some political aim. I'm fairly sure the company in question (I assume Union Carbide) did not poison the town to advance a political agenda.

3 more replies

jodrellblank1y ago

Russia has a town called Asbest which has an open-pit asbestos mine half the size of Manhattan where they mine with explosives.

https://en.wikipedia.org/wiki/Asbest

https://www.youtube.com/watch?v=cy3piCUPIkc - VICE documentary and visit video. I think it contains an interview with an American woman who suffered from WR Grace and Company's asbestos mining and manufacturing in the USA, she says "they knew, they knew". WRG faced 129,000 personal injury claims and set asude $3 Bn for settling asbestos related lawsuits.

dahart1y ago

Are you talking about Bhopal in 1984? If so it would be an understatement to refer to half a million people as a “town”, and an overstatement to imply it was terrorism. Willful negligence, yes, but terrorism, no.

pbh1011y ago

> Google itself got big by indexing other people's data without compensation

Weird framing given how much value was and is still placed on Google driving traffic to you

mrkeen1y ago

For Google's case the order was reversed.

Google used to send customers to your site. Now they try to show you the information on their site so that the customer doesn't need to go to your site.

1 more reply

joshstrange1y ago

Even before the LLM-craze Google was showing their Answers box or whatever it was called at the top of the results that told you the answer (sometimes) so that you didn’t have to visit any website.

1 more reply

newsclues1y ago

Comprehensive intellectual property needs to happen for the modern (digital) era.

Basically the entire legal system needs to be retooled and rethought for computers.

actionfromafar1y ago

Looks like the entire legal system is being retooled at the moment.

threeseed1y ago

No we just need to enforce the existing laws.

And the legal system is for humans not computers.

2 more replies

yard20101y ago

RIP Aaron Swartz

soheil1y ago

So be a company? Last I checked it costs a couple of hundred dollars to form an LLC, what am I missing?

cyanydeez1y ago

Mmm, the broader point is: laws are are as real as the cash you can pay a lawyer to fight.

smugma1y ago

Spotify was born as a response to piracy. Why do you say their catalog was pirated?

mrtesthah1y ago

Don’t forget the original developers of Skype also created Kazaa first.

djmips1y ago

Doesn't Google have their own internal scanning of books?

ctrlp1y ago

The sooner people learn this lesson, the sooner it might change.

chanux1y ago

Corporations are people. Just a notch above the regular kind.

Izikiel431y ago

So, might makes right, a tale as old as humanity

whatever11y ago

How does that prosecutor sleep at night?

observationist1y ago

This frames Google's indexing of the web in a totally, abjectly wrong fashion. It wasn't "other people's data", it was data people published to the public internet, implicitly and explicitly granting permission to download through the act of serving that data without restriction to whoever navigated to a particular URL.

That's how the internet works. If you want private content, you need to put up a gate mechanism of some sort with authentication or other methods of restricting access. Without that, you are literally having your server "serve" the content to whoever asks for it, without restriction or exception, without ToS or meaningful contract or agreements.

You can't have it both ways. "But they didn't know" or other post-hoc claims of innocent people publishing content to the web being misled or confused or abused is infantilizing nonsense.

The web wouldn't have been as amazing and revolutionary and liberating if the fundamental public and open nature of its systems was private and walled off by default.

Your take on YouTube going viral initially over copyrighted content isn't correct, either - it was ease of use and access. It was fairly popular by the time Google bought it, and once it was reachable and advertised by google itself, it exploded, because by that time, everyone had defaulted to using google for search.

Other people corrected your Spotify take.

The reason they pirated is because it is functionally impossible to gain access to the data in any other way. For consumers, there are lots of old shows, music, and other content that aren't accessible, so they turn to piracy. A vast majority of the time, if content is accessible, people will pay and do the technically legal and "right" thing.

Publishers exploit authors and content creators in the name of "platforming" and "marketing" , effectively doing as little as possible to take 90%+ of the value of a product and providing as little as possible to the producer of content or books or music. They get by on technicalities and have captured the legal arena entirely, with any attempt at reform or revolution meeting a messy death at the hands of lawyers and big money publishers.

Screw those people. They lie, cheat, and steal, and somehow have gotten away with fooling the world into thinking they're the good guys.

Copying bits and bytes is not stealing, and the ones trying to shill that narrative are trying to fool as many people as possible into giving them more money without any return of value in kind. I'd download the hell out of a car. Pirate everything.

larodi1y ago

The most outrageous thing about the whole story is that smart people (like here and not only) knew this all since day one. They been uncovering this the whole time.

And in their face, with all the fierce ignorance, broligarchs deny, evade and totally pretend this never happened. The most non open company of all even went to lengths to accuse others of stealing their IP - not theirs to begin with.

Just think of it - why did all major content platforms closed their APIs the day after GPT-2 got the word going…? Cause they knew all this very well - the content is precious and needed. They been doing it all along. Distilling the essence of world’s writing and digital imagery they had no right to.

We have a saying where I come from - no mercy for the chicken, no laws for the millions. I thought it was a local thing at first, it turned is how the world goes. Nothing new under the sun, indeed.

qup1y ago

Speaking of GPT2, I remember that nobody gave a shit what it was trained on, because it sucked then.

1 more reply

nostrademons1y ago

A bigger lesson might be "don't get caught until you're big enough to destroy the people suing you."

Napster got shut down for widespread enabling of copyright infringement. So did numerous other filesharing startups, including Travis Kalanick's first startup, Scour. Lots of small startups get put out of business all the time for being sued and not having the money to defend themselves.

Likewise, individuals like Donald Trump or Elon Musk get away with all sorts of illegal shit, because they are big enough to shut down the court systems prosecuting them.

Google's genius was in staying under the radar and aligning their incentives with everyone that might dislike them, until they were big enough that they could simply crush anyone that might dislike them.

illegalmemory1y ago

" If you plug a laptop into a closet at MIT to download some scientific papers you forfeit your life."

This is exactly what I immediately thought while reading the article. It almost feels like the legal system only punishes general public, while most of these guys are above it.

rchaud1y ago

Airbnb and Uber have showed us that laws matter only to the extent that the political will to enforce them exists. Throw enough lawyers and lobbying money at the problem and the laws can simply be re-written to be friendlier to your business model.

6 more replies

veggieroll1y ago

Wilhoit’s law:

> There must be in-groups whom the law protects but does not bind, alongside out-groups whom the law binds but does not protect.

1 more reply

rahton1y ago

The legal system is built to favor large corps and capital owners. See Katharina Pistor books for instance.

1 more reply

jamesbfb1y ago

RIP Aaron

arp2421y ago

If you do something wrong then you, as a person, are held responsible and accountable.

If you do something wrong as "part of your job" then you're typically not held responsible and accountable but the company is (the exceptions being spectacular fraud: Enron, VW diesel).

It's not hard to see how this can go off the rails.

1 more reply

nico1y ago

> the legal system only punishes general public, while most of these guys are above it

It’s because the legal system is not about justice, it’s about money

Most people can’t afford lawyers or expensive legal battles

On the other hand, individuals and organizations with a lot of money get to weaponize and exploit the legal system to their advantage

“To my friends, anything; to my enemies, the law”

1 more reply

artyom1y ago

> the legal system only punishes general public.

In more general terms, the legal system punishes what can be made a profit or an example when punishing.

Also, I don't think the legal system itself wants to get too much into "big institutions against the work of others", save for the fictional TV representations of smart lawyers and clever arguments, 99.9% of the legal system output is copy/paste.

jimmySixDOF1y ago

> MIT

I think Aaron Swartz went to Harvard, not MIT

https://en.wikipedia.org/wiki/United_States_v._Swartz

1 more reply

meeech1y ago

At this point, I think it's safe to say it doesn't 'feel' that way. It is that way. Sorry if you were being facetious and I didn't pick up on it.

censorfree1y ago

>This is exactly what I immediately thought while reading the article. It almost feels like the legal system only punishes general public, while most of these guys are above it.

Welcome to the modern day aristocracy. Not only what you mentioned, this world is also divided into a group of insider who can get capital from 0 - 2%, while rest of us has a cost of 17%, 22% or 30%?

isaacremuant1y ago

It doesn't "seem". The entire system in most countries works, by design, that way because the people in power trade in influence at a different plane.

That's why democracy often feels "failed" in that no change can be achieved because "it's just more of the same". Few Lobbyists representing the interests of a few people have more power than millions voting differently.

1 more reply

jmount1y ago

They may have just been the friendly step A. We didn't end up seeing where that was going to go.

G_o_D1y ago

Money speaks ! Money buys !

yoyohello131y ago

It's not "almost" like that. The legal system IS that.

TZubiri1y ago

How so? It is still illegal if meta does it, they will face trial.

quaintdev1y ago

I read the same thing earlier today on Reddit, weird!

devwastaken1y ago

if you get a group of people and call it an llc then criminal elements are largely eliminated.

bayindirh1y ago

As Venus Theory elaborates the issue on his video [0]:

"This problem will be solved in the favor of the (party) which has the most money to throw into the problem" (paraphrase mine).

So, yeah.

[0]: https://www.youtube.com/watch?v=LrkAORPiaEA

kordlessagain1y ago

When individuals are assigned heroic status despite clear evidence of mental illness and crimes, such as “breaking and entering”, it prevents society from having rational discussions about both law enforcement and mental health support. This dynamic repeats across multiple high-profile cases.

People often elevate deeply flawed figures to heroic status when those figures seem to challenge authority or "the system." This happens especially with individuals who present themselves as outsiders fighting the establishment, have a compelling personal struggle narrative, or voice grievances that resonate with public frustrations

Trump fits this pattern - his supporters overlook concerning behaviors and statements because they see him as fighting a system they distrust. Like Manning and Swartz, his mental state and fitness are often ignored in favor of the "hero against the system" narrative.

This dynamic creates a feedback loop where legitimate criticism becomes harder to discuss rationally.

jeffwask1y ago

Welcome to the two-tier legal system of the modern world. Why obey the law when the penalty is a rounding error?

ossobuco1y ago

It's an oligarchy, always has been. I don't know how colossal the pile of evidence supporting this has to get before people finally accept it.

2 more replies

gscott1y ago

It is more a money thing. Meta can pay x billion like pocket change. Regular people are run through the ringer to teach the plebs to not get out of line.

bmitc1y ago

It's not a feeling. It's exactly what happens. It's completely blatant.

For some reason, whenever you're a billionaire or company, things suddenly get so difficult that you can claim that it's impossible to be held accountable for anything. Murder, insider trading, laundering, treason, etc.

OpenAI complained about this, as did Google and everyone else. If your company can't exist without stealing data, then it's not a viable company. Companies don't have a constitutional right to exist.

threeseed1y ago

> Google itself got big by indexing other people's data without compensation

Wrong.

a) Robots.txt which defines what content you wish to make available to third parties predates every search engine including Google. Web site owners chose to make it available to Google and search engines have respected their wishes despite it not being in their best interest.

b) The difference here is that OpenAI, Meta etc have not even tried to honour the wishes of copyright holders. They just considered everything as theirs.

c) Google grew big because it had no ads, fast interface and PageRank was significantly better. It wasn't because it had the most comprehensive index.

karamanolev1y ago

> Web site owners chose to make it available to Google.

Strong disagree. Since robots.txt is optional and the default is "crawl me as you please", website owners don't "choose to make it available", they just don't choose to make it non-available.

1 more reply

RALaBarge1y ago

To your first point, the op said without compensation, not without permission.

tobyhinloopen1y ago

a) If you don't have a robots.txt, you're indexed by default. It's opt-out, not opt-in. If you do nothing, you're being indexed.

1 more reply

veggieroll1y ago

Robots.txt is irrelevant after hiQ Labs v. LinkedIn (2019)

fredgrott1y ago

point c is wrong...they had ads since the original yahoo contract....

1 more reply

boesboes1y ago

Wrong. Google ignores robots.txt entirely

1 more reply

peterbonney1y ago

The more I learn about how AI companies trained their models, the more obvious it is that the rest of us are just suckers. We're out here assuming that laws matter, that we should never misrepresent or hide what we're doing for our work, that we should honor our own terms of use and the terms of use of other sites/products, that if we register for a website or piece of content we should always use our work email address so that the person or company on the other side of that exchange can make a reasonable decision about whether we can or should have access to it.

What we should have been doing all along is YOLO-ing everything. It's only illegal if you get caught. And if you get big enough before you get caught then the rules never have to apply to you anyway.

Suckers. All of us.

wrs1y ago

And if you were in any doubt before, this lesson is now exemplified by the holder of the highest office in the land and approved by popular vote. The rewards of acting ethically are, unfortunately, sometimes only personal. This must be a hard environment to raise children in, given the examples they see around them.

4 more replies

Barrin921y ago

>What we should have been doing all along is YOLO-ing everything

No it isn't. The actual sucker attitude is copying what they do. You should act morally and with integrity out of respect for yourself. I never had any illusions that large tech companies act with respect towards the law, but it also has nothing to do with me.

afandian1y ago

If you have a spare few hours, the Acquired podcast episode on Meta is enlightening. They just stumbled through growth hack experiment after experiment without seemingly any risk assessment or ethics.

1 more reply

77pt771y ago

> It's only illegal if you get caught

Not quite. It's only illegal if you get caught and you are the wrong kind of person.

For the right kind of person not even a pat on the wrist.

clueless1y ago

yep, pretty much.

callc1y ago

This sort of mindset is devoid of morals and honor. Don’t fall into the this mindset trap.

Like when Trump said he is “smart” for evading taxes during the presidential debates (IIRC the first ones, not recent ones).

It’s absolutely despicable. Have a moral compass. Treat people fairly. Be nice. Let’s be better than toddlers who haven’t learned yet that hitting is bad, and you shouldn’t do it even if mommy and daddy aren’t in the room.

2 more replies

hall0ween1y ago

<Tether's ears burning>

JW_000001y ago

I don't understand why it's even a question that Meta trained their LLM on copyrighted material. They say so in their paper! Quoting from their LLaMMa paper [Touvron et al., 2023]:

> We include two book corpora in our training dataset: the Gutenberg Project, [...], and the Books3 section of ThePile (Gao et al., 2020), a publicly available dataset for training large language models.

Following that reference:

> Books3 is a dataset of books derived from a copy of the contents of the Bibliotik private tracker made available by Shawn Presser (Presser, 2020).

(Presser, 2020) refers to https://twitter.com/theshawwn/status/1320282149329784833. (Which funnily refers to this DMCA policy: https://the-eye.eu/dmca.mp4)

Furthermore, they state they trained on GitHub, web pages, and ArXiv, which are all contain copyrighted content.

Surely the question is: is it legal to train and/or use and/or distribute an AI model (or its weights, or its outputs) that is trained using copyrighted material. That it was trained on copyrighted material is certain.

[Touvron et al., 2023] https://arxiv.org/pdf/2302.13971

[Gao et al., 2020] https://arxiv.org/pdf/2101.00027

gameshot911OP1y ago

Critically, by torrenting they also directly distributed the copywritten material itself. That is a standalone infringement separate from any argument about trained LLMs.

2 more replies

Workaccount21y ago

There are two different things when it comes to discussing training LLM's on "copyright" protected data, and I almost never see people differentiate.

1.) Training on copyright that is publicly available. You write a poem and publish it online for the world to read. That is your IP, no one else can take it an sell it, but they are free to read and be inspired by it. The legalitly of training on this is in the courts, but so far seems to be going in favor of LLMs.

2.) Training on copyright that is not publicly available. These are pretty much pirated works or works obtained by backdoor to avoid paying for them. Your poem is behind a paywall and you never got paid, yet the poem is known by the LLM. This is just straight illegal, as you legally must pay to view the work. However there might be conditions here too like paying for access to an archive and then training on everything in it.

5 more replies

unraveller1y ago

Trained on doesn't mean significant inclusion in the final state.

Is it truly a violation of copyright when a user hacks out bits and pieces of easily restyled raw data points from a model to look samey? what about if it takes two models? Might be time to accept humans are just cooked in their ability to discern attempts at direct plagiarism - just as it is hard to discern Sky voice from Her voice.

peterclary1y ago

I strongly urge people to read Thomas Babington Macaulay's speeches on copyright, its aims, terms, and hazards. Very well reasoned and explained.

In particular, people often cited the case of authors who had died leaving a family in destitution, and claimed that copyright extension would be a fair way of preventing this, but in most cases the remaining family had never held the copyright; the author had initally sold the reproduction rights to a publisher who had then sat on the work without publishing it. The author, driven into penury, was then induced to sell the copyright to the publisher outright for a pittance. So in such cases a copyright extension only benefited the publisher, and indeed increased their incentive to extort the copyright.

kshri241y ago

> Thomas Babington Macaulay

The one who got Hindu Sanskrit books translated in a horrible manner and then claimed: "I have no knowledge of either Sanskrit or Arabic. But I have done what I could to form a correct estimate of their value. I have read translations of the most celebrated Arabic and Sanskrit works. I have conversed both here and at home with men distinguished by their proficiency in the Eastern tongues. I am quite ready to take the Oriental learning at the valuation of the Orientalists themselves. I have never found one among them who could deny that a single shelf of a good European library was worth the whole native literature of India and Arabia."

This chap will educate us on copyright?

No thanks!

demosthanos1y ago

This is the corollary of the fallacy of appeal to authority: the rejection of an argument on the grounds that the speaker was horribly wrong on an unrelated or very loosely related topic.

If you reject Macaulay on copyright because he was an imperialist, you can use the exact same logic to reject the arguments of essentially every person who ever lived. Very few humans who ever wrote anything important will perfectly align with your morality, and most will be horribly misaligned in at least one way.

2 more replies

fL0per1y ago

Edit.: OMIT THIS FIRST PARAGRAPH¹.

Very nice of you to omit the following sentences of that excerpt, where it proceeds to develop its point on the argument for institution of an English-language based education system on British India. He praised how superior in quantity and quality were the Sanskrit or Arabic corpora, compared to European works, in the lyric/poetry. But that no technical or didactical literature amounted to even the most mundane of the European manuals like those used by then in England humble schools (and it seems completely plausible).

He was a fierce abolitionist. So much for accomplishing the mission of allegedly, judging by comments in this thread, 'deranged imperialist destruction and chaos imposition over the lesser ones'.

I'm not much versed into his speeches/stance on copyright, but I can vouch for the fact that the most honest and well-intended moves (not by him, by other figures) in defence of everyone's intellectual property were done in the same century. From the Twentieth onwards, it has been only twisted for the interest of a select few, and needless to ask where we are today in terms of caring about intellectual property of anybody.

[1] Just saw your other comment where you go on with his nauseating words. One just cannot comprehend that framing the past on the actual status quo is as futile as to not being even wrong, I guess?

Terr_1y ago

I kind of hate it that the auto-complete in brain launched off in this direction:

> The one who got Hindu Sanskrit books translated in a horrible manner and then claimed: "I have no knowledge of either Sanskrit or Arabic. But

... Here's what they mean, from ChatGPT."

bbor1y ago

I’m a huge IP hater and am sure that happens, but to be fair, letting copyright extend past death also increases the amount the author can sell it for in the first place.

ttyprintk1y ago

The current workaround is to attribute footnotes to your beneficiaries, or quote them in the dedication. Those become derivative works subject to the lifetime of your beneficiary.

golergka1y ago

> in most cases the remaining family had never held the copyright; the author had initally sold the reproduction rights to a publisher

He was able to sell it because it is something valuable, exactly because of the copyright protections. Regardless of whether author sells the rights or not, he and his family would equally be better off with copyright.

grayhatter1y ago

Why does this argument remind me so much of those of slavery apologist arguments?

copyright as written serves the interests of publishers who don't create valuable works more than the creators of the work...

arresin1y ago

This one example does not make stealing acceptable which is what you’re implying.

2 more replies

mik19981y ago

Libgen is a civilizational project that should be endorsed, not prosecuted. I hope one day people will look at it and think how stupid we were today to shun the largest collection of literary works in human history.

greeniskool1y ago

Anna's Archive encourages (and monetizes!!) the use of their shadow library for LLM training. They have a page dedicated to it on their site. You pay them, and they give you high download speeds to entire datasets.

adamsb61y ago

I wonder how much more libgen traffic can be attributed to the lawsuit.

When Metallica sued Napster, for many people the reaction was, "wait I can download music for free?"

luqtas1y ago

Libgen turns into a problem when you have a company developing generative AI with it, either giving money to GPU manufacturers or themselves with paid services (see OpenAI)

qup1y ago

What are we actually worried about happening?

Are AI-written books getting published?

If they start out-competing humans, is that bad? According to most naysayers, they can't do anything original.

Are people asking the AI for books? And then hoping it will spit it out a human-written book word for word?

5 more replies

bbor1y ago

…why? Will people buy less books because we have intuitive algorithms trained on old books?

Personally, I strongly believe that the aesthetic skills of humanity are one of our most advanced faculties — we are nowhere close to replacing them with fully-automated output, AGI or no.

1 more reply

rafram1y ago

I think you’re overstating its importance. The internet already makes it possible to order almost any book in existence and have it arrive at your doorstep within a week or so, or often on your ebook reader instantly. And your local library probably participates in an interlibrary loan system that lets you request any book held by any library in the country for free.

LibGen gives you access to a much smaller body of works than either of those. It’s a little more convenient. But the big difference is that it doesn’t compensate the author at all.

Just go to a real library.

intotheabyss1y ago

And what about the other billions of people on the planet that don't even have a library, let alone a doorstep to receive a first world delivery service.

Cyph0n1y ago

1. We are not talking about physical books.

2. DRM is built in to most purchased ebooks, which means you can’t consume the book on any device. “Illegal” tools exist to circumvent this.

3. Large ebook stores - like other digital stores - essentially lend you a copy of the book. So when they are forced to pull a book, they’ll pull your access too.

Of course, now that the big players have consumed/archived the entire book dump, they can go ahead and kill it to prevent others from doing the same thing.

ALittleLight1y ago

It is *much* more convenient. When a research path takes me to an article or book - I could buy or order or go to a physical library, that would take hours or days. I could also open it as a PDF in seconds. If you need to read a chapter from a book, or an article, or skim such checking to see if it's worthwhile, 20-30 times to figure something out, then libgen is the difference between finishing in a day or a month.

thfuran1y ago

There are a whole lot of books that are out of print, and if a book went out of print before ebooks were a thing, it probably doesn't have a legal digital edition either.

1 more reply

mik19981y ago

No one sells scans of older books, which are often sparsely available in obscure (often private) libraries.

1 more reply

sva_1y ago

Libraries can burn down (see Library of Alexandria), civilizations end (see various). LibGen makes it possible for an individual to backup a snapshot of cumulative human knowledge, and I think that's commendable.

greenavocado1y ago

> LibGen gives you access to a much smaller body of works than either of those.

> Just go to a real library.

The thrill of waiting a week for a book to arrive or navigating the labyrinthine interlibrary loan system is truly a privilege that many can afford. And who needs instant access to knowledge when you can have the pleasure of paying for shipping or commuting to a physical library?

It's also fascinating that you mention compensating authors, as if the current publishing model is a paragon of fairness and equity. I'm sure the authors are just thrilled to receive their meager royalties while the rest of the industry reaps the benefits.

LibGen, on the other hand, is a quaint little website that only offers access to a vast, sprawling library of texts, completely free of charge and accessible to anyone with an internet connection. I'm sure it's totally insignificant compared to the robust and equitable systems you mentioned.

Your suggestion to "just go to a real library" is also a brilliant solution, assuming that everyone has the luxury of living near a well-stocked library, having the time and resources to visit it, and not having any other obligations or responsibilities. I'm sure it's not at all a tone-deaf, out-of-touch recommendation.

2 more replies

yoavm1y ago

We all like hating big corporations, especially Meta, and people seem to use this as an opportunity to advocate for punishing them. I think it's wiser to advocate for changing our IP laws.

_Algernon_1y ago

We're sick of the double standards.

https://en.wikipedia.org/wiki/Aaron_Swartz#United_States_v._...

https://en.wikipedia.org/wiki/Aaron_Swartz#Death

While Aaron Swartz was bullied to suicide, these corporations will walk free and make billions. I say give every tech CEO the Swartz treatment, then change the law.

nashashmi1y ago

The lesson here is make sure you only break the rules in the limits of severity that your wealth class allows.

MIT students will get away with breaking bigger rules than community college students will.

1 more reply

IncreasePosts1y ago

Swartz committed suicide because he was mentally ill. He also attempted suicide multiple times in his life while not being "bullied".

If he was acting rationally and came to the conclusion that dying was better than spending X years in jail, he would have committed suicide after sentencing, not before any trial had even happened.

2 more replies

scotty791y ago

Double standards is how the law is practiced since time immemorial. Copyright is Disney-Sony law made up few decades ago for no reason other than money. Pick your battles.

crazygringo1y ago

Why not change the law first?

Two wrongs don't make a right. If a law is unjust, then what good is there in continuing to punish people who have broken it, just because other people have been punished in the past?

Either you think the law is just or unjust. If you think it's unjust, I don't possibly see how you think people should be punished for it. Meta wasn't responsible for what happened to Aaron Swartz.

2 more replies

palata1y ago

You're conflating different problems.

Big corporations are too big, they should just not exist. When you have corporations more powerful than the government of the biggest states, it's a bug, not a feature.

The IP laws may need rethinking. Saying that they should disappear because big corporations are above the law doesn't help, though. First kill the big corporations, then think about fair laws. Changing the law now would not change anything since those corporations are already above the law.

alickz1y ago

> First kill the big corporations, then think about fair laws.

It's not possible to kill big corporations before fair laws, because as you said yourself "corporations are already above the law"

Unfair laws don't apply to big corporations, they only apply to the people opposed to big corporations

It's akin to hamstringing a horse and saying you'll fix it when they win

1 more reply

larodi1y ago

Perhaps they just did, or we are doing it - basically this should lead to abolition of copyright to any published article there is. Not sure how’d it impact open source, we’ll either have all of it open, or none at all.

1 more reply

qudat1y ago

> When you have corporations more powerful than the government of the biggest states, it's a bug, not a feature.

The only distinction between corporations and governments is one of them are morally bankrupt arbiters of force.

1 more reply

qup1y ago

How do you suggest making them smaller?

For instance, what if google was still just serving search results w/ ads, and they never expanded that. How would you make them smaller?

2 more replies

Nuzzerino1y ago

> When you have corporations more powerful than the government of the biggest states

I don’t know how you define powerful, but I highly doubt it is at that point.

BeetleB1y ago

> Big corporations are too big, they should just not exist.

Nor should big governments.

Nor should big countries, for that matter.

therealdrag01y ago

Economies of scale generate value

1 more reply

lrvick1y ago

I truly hope Meta has a serious security issue that burns their company to the ground.

That said, I want them to burn for the right reasons.

Downloading data that should be available to the public is not one of them.

lblume1y ago

Exactly. Everyone should have the right to have access to this.

1 more reply

yodsanklai1y ago

Big corporations don't have morale or ethics. They'll break any laws as long as it's profitable. There's no point complaining about Meta or Zuck. Meta does what it's designed to do. If people aren't happy, they should vote for more regulations.

JKCalhoun1y ago

...and boycott the offender's products.

Ekaros1y ago

First punish them. Then change the laws.

anticensor1y ago

In many countries, that'd trigger an automatic release/repayment of unjustly sentenced fines.

DaSHacka1y ago

I bet you and my "first build the product, then worry about security" manager would get along.

2 more replies

blueboo1y ago

We may in retrospect find that the moment may have passed where "big corporations" have become more powerful and impactful on our lives than the IP laws on the books. After all, we can already plainly see they only come into effect when useful by the powerful

aprilthird20211y ago

I think most of the public is probably in favor of stronger IP laws now that big corps are threatening to make them jobless with IP-disrespecting AIs

rchaud1y ago

Something tells me stronger IP laws will be drafted by holders of that IP, with little if any regard to the potential for job losses for regular people from AI.

1 more reply

Nasrudith1y ago

Most of the public has jobs based upon IP? While it is probably a bigger share than farming, I doubt that. The actual drivers appear to be a mixture of hysteria, and reflexive anti-corporate sentiment as we see even self-proclaimed leftists going "WTF, I love copyright now!".

freeAgent1y ago

The point is about the hypocrisy and double-standards evinced by this behavior.

jillyboel1y ago

First we must prosecute Meta into committing suicide like was done to Aaron Swartz. After justice is served, we should change IP laws.

boesboes1y ago

They broke the law and should be punished for that. Whether the law should change is a separate discussion.

Also, change the law so this is legal for poor meta? smh..

miltonlost1y ago

Big corporations all like hating their consumers abd legal laws. You love committing crimes it seems.

DaSHacka1y ago

I fail to see how you arrived at GP being a hobbyist criminal based on their suggestion that IP laws need to be modernized.

fimdomeio1y ago

It really makes you think about those crazy internet folks from back in the day who thought copyright law was too strict and that restricting humanity to knowledge in such a way was holding us all back for the benefit of a tiny few.

jeroenhd1y ago

I'm all for chopping up copyright law. But until we do so, companies like Meta need to be treated just like everyone else.

That means lawsuits, prison sentences, and millions in fines. And that's just the piracy part, there's also the lying/fraud part.

Interestingly, a Dutch LLM project was sent a cease and desist after the local copyright lobby caught wind of it being trained on a bunch of pirated eBooks. The case unfortunately wasn't fought out in court, because I would be very interested to see if this could make that copyright lobby take down ChatGPT and the other AI companies for doing the same.

Workaccount21y ago

>need to be treated just like everyone else.

So a copyright warning letter in the mail from their ISP? Maybe someone should tell them about VPNs...

stefan_1y ago

The more concerning thing is that the best thing these overpaid people could come up with was.. download the torrent, like everyone else. Here you are, billions of resources, and no one is willing to spend a part of it to at least digitize some new data? Like even Google did?

dietr1ch1y ago

I think they are morally required to improve the current state.

- Seed the torrent and publicly promote piracy pushing lawmakers.

- Contribute with digitisation and open access like Google did in the past.

- Make the part of their dataset that was pirated publicly accessible.

- Fight stupid copyright laws. I can't believe that copyright lasts more than 20 years. No field moves that slowly, and there should be tighter limits on faster moving fields.

1 more reply

fsflover1y ago

> crazy internet folks from back in the day

You mean Electronic Frontier Foundation? https://www.eff.org/issues/innovation

Workaccount21y ago

Probably the single biggest thing I learned growing up is that you can safely live by "Everyone is in it for themselves".

It's incredibly rare to find people who hold ideals that are detrimental to their own life.

3 more replies

gameshot911OP1y ago

Beyond illegal downloading and distribution of copyrighted content, the article also describes how Meta staff seemingly lied about it in depositions (including, potentially, Mark Zuckerberg himself).

malfist1y ago

Huh, a big tech CEO lied to us?

Flippant response I know, but too many people worship at the alter of the job creater and believe these folks are moral upstanding citizens

bmsleight_1y ago

So if I torrented and seeded, I would be doing it for my own entertainment, not commercially. I expect big copy-write holders to come after myself. If Meta does it - I guess they have better lawyers ?

Could make interesting case law.

unification_fan1y ago

> Could make interesting case law.

Yeah, to perpetuate this system where only those who can afford lawyers get to benefit

echoangle1y ago

Since it’s case law, everyone would benefit from the precedent

2 more replies

nyoomboom1y ago

Remembering Aaron Swartz in this moment

stingraycharles1y ago

Which was arguably more innocent — scientific papers.

piyuv1y ago

Meta is not “innocent”, and comparing this instance with Swartz is a huge offense to his legacy.

2 more replies

qup1y ago

Would Aaron have preferred us to download the material and train the AI?

zackmorris1y ago

Is there a concept in the legal system of first-come-first-served that could be used as precedent?

What I mean is: when someone is prosecuted for copyright infringement, but Meta isn't, then could the case be put on hold until Meta is found guilty and pays a fine?

Also maybe the fine on the later case would have to be proportional to the prior case. So if Meta pays $1 per infringement, the penalty might be $1 for torrenting something else (which is immaterial and not worth the justice system's time) so pretty much all copyright infringement cases would get thrown out.

It reminds me of how mainstream drug addicts get convicted and spend years in prison, while celebrities get off with a warning or monetary fine.

hnfong1y ago

Lawyers (and hence, judges) are really good at arguing why the earlier case does not apply in a present case, even if most reasonable people would think the two cases are essentially the same.

It's a fundamental part of lawyer training, and if they want to let BigCorp go and bring the hammer down on the little guy, they can make up a hundred reasons for it.

Ekaros1y ago

Considering prices for single work, this must be multi-billion dollar compensation.

Take for example 675k paid for 31 songs. So 20k a song. If we estimate book to be say 10MB that would 8 million works. So I think reasonable compensation is something along 163 billion. Not even 10 years of net income. Which I think is entirely fair punishment.

ricardobeat1y ago

Beyond the absurdity of those amounts, the funny thing is that the authors wouldn’t ever see a dime of that money. Not in the music case, not in this one either. Fairness?

karel-3d1y ago

Meta argues that it's fair use, and that they just downloaded, and never seeded, all the torrents.

qiqitori1y ago

They have so much bandwidth and never seeded anything? Damn leechers! That is not fair use of torrents at all!

TheJoeMan1y ago

The article is purposely conflating the downloading from the seeding statistics. Saying "just 0.008%" the size resulted in big punishments is confusing when Meta is also saying they set their client to be leechers.

larodi1y ago

No they never seeded the essence of it ALL :;))

HeatrayEnjoyer1y ago

Seeding and downloading are in the same protocol. You can't do one without the other

1 more reply

pinoy4201y ago

For creating a backup of library genesis. No. They should be awarded a philanthropic prize.

striking1y ago

There's evidence of them seeding back as little as possible. I'm not sure how that's "creating a backup".

2 more replies

panki271y ago

They could have at the very least seeded some more, to give something back to the, uh, community.

RobotToaster1y ago

Before I decided my opinion on this I need to know their ratio.

adamsocrat1y ago

Article states: Meta also allegedly modified settings "so that the smallest amount of seeding possible could occur"

malfist1y ago

Big tech taking and not giving back, where have I heard this before?

MaKey1y ago

Damn leechers!

1 more reply

RobotToaster1y ago

In that case, throw the book at them.

wnevets1y ago

My ISP will shut off my internet if it catches me torrenting copyrighted material but if you're a massive corporation that steals TBs of data its barely a blip in the news.

freeAgent1y ago

Wouldn't it be amazing if all of Meta's ISPs cut them off for torrenting? One can dream...

gkbrk1y ago

You should look into changing your ISP, or at least get a VPN.

lrvick1y ago

This should be legal. Copyright law does more harm than good.

The only ethical problem here is that only Meta sized companies can afford to pay the "damages" for such blatant law violations at worst, or the fees of their lawyers at best.

maronato1y ago

Copyright law does more harm than good to individuals who just want to learn and enjoy content without profiting from it.

Companies like Meta and OpenAI, however, should definitely have to pay to use the hard work of humans to train their AI.

pleeb1y ago

If an individual was the one tormenting almost 82 TB of copyrighted books, the damages they would have to pay would be in the trillions (mostly because of how broken the copyright law system is)

moffkalast1y ago

If only these corporations with vested interests in permissive copyright would put their money where their mouth is with lobbying for a change. Or is that only allowed when they're trying to do something scummy? I forget.

belter1y ago

"Supposedly, Meta tried to conceal the seeding by not using Facebook servers while downloading the dataset to "avoid" the "risk" of anyone "tracing back the seeder/downloader" from Facebook servers, an internal message from Meta researcher Frank Zhang said, while describing the work as in "stealth mode." Meta also allegedly modified settings "so that the smallest amount of seeding possible could occur," a Meta executive in charge of project management, Michael Clark, said in a deposition..."

They will be getting a lot of Frommer Legal letters...

bigmattystyles1y ago

The question is, if they could and would have paid for each book, would it be ok to train the LLM on them? I'm talking about prior books, I'm sure new books have language forbidding their use to train LLMs at the point of sale. But legally, how does using a book to train a LLM differ from a teacher learning from a book and teaching its contents to their pupils. Obviously, the LLM can do so at scale, but is there a legal difference?

dragonwriter1y ago

> The question is, if they could and would have paid for each book, would it be ok to train the LLM on them?

Whether training on AI model on an array of diffentent works, many of which are copyright protected, is itself a copyright violation, in addition to or distinct from any copyright violation that goes on gathering the dataset for training (and separate from any copyright violation in the actual or intended use of the LLM), remains to be resolved as a legal question, and may or may not have a simple yes or no answer (or the same answer under every system of copyright laws globally).

My inclination is that it is probably generally not a violation in US law, but that's not something I am very confident in; how the definitions of copy and derivative work apply to determine if it would be without fair use, and how fair use analysis applies, are not clear from the available precedent.

> But legally, how does using a book to train a LLM differ from a teacher learning from a book and teaching its contents to their pupils.

It is very clear, by looking at how US copyright law is written and even more clear in its history of application, that information stored in brains of people are without exception neither copies nor new works that can be derivative works under US law, and so cannot be infringing, no matter how you gain them. It’s also very clear in the statute itself and the case law that data in media used by artificial digital computers, on the other hand, can constitute copies or derivative works that can be infringing. Even if the process is arguably similar in legally relevant manners, copyright law is critically focussed on the result and whether it is a particular kind of thing which can be infringing, not just the process.

1 more reply

CryptoBanker1y ago

A LLM is not a person. That is the legal difference...until we have Citizens United v2

liendolucas1y ago

For some misterious reason I can't see Zuckerberg in front of a judge facing 50 years imprisonment. Anyone can?

I truly hope that whoever takes the case goes after Meta with 1000 times the pressure that was put on Swartz, but honestly I don't expect much just as the top comment precisly expressed.

And if we are going to be fair please also let's not forget about the other usual suspects, or anyone thinks they are falling behind?

impossiblefork1y ago

There are other countries than the US though and if rightsholders wish to sue, lawsuits can happen there too.

Several EU countries, Switzerland, South Korea, Japan, etc. are viable countries to sue from. Even in Japan which has a law specifically permitting training on copyrighted material you must still obtain it legally-- i.e. you must license it.

1 more reply

Havoc1y ago

Really curious what the judges are going to do here.

Horse has functionally bolted on this already

I’m guessing slap on wrist despite courts going after individual for a couple of movies torrented pretty hard

aprilthird20211y ago

Is there any other possible outcome than a fine? That too one which will not really affect Meta's overall earnings

Havoc1y ago

Ideally we have a conversation about how we as society have ended up in a situations where we have a two tier justice system.

At a minimum the starting point of discussion here should be that if life ruining $80,000 per item is an acceptable fine for individuals then why is it not the same for corporations. Which would probably get you a number in the trillions at which point we could have a discussion about reforming this entire system.

But yes realistically slap on wrist is what is going to happen here.

hnfong1y ago

> Is there any other possible outcome than a fine?

Yes, of course.

It's quite possible that judges realize that if they restrict training data to licensed materials, LLMs will become stupid and China will overtake the US to become the leader in AI, and because that can't happen, they'll make up some reason to make training on unlicensed data legal. It's definitely fair use!

I'm not even joking. Last time the US Supreme Court basically said "Android is too important, we have to declare its use of Java API fair use."

1 more reply

empath751y ago

The reality of the situation is that the economic value and utility of AI is going to cause the laws to be restructured around them.

woadwarrior011y ago

I wonder what happened to the related OpenAI training GPT3 on the books3 dataset story[1] from ~2 years ago?

[1]: https://www.wired.com/story/battle-over-books3/

gundmc1y ago

I think this one is different because the legality of training on copyrighted material is an open legal question while distributing/seeding copyrighted material is decidedly illegal.

ksynwa1y ago

A good chance for federal prosectutors to "send a message" as they did with Aaron Swartz but I don't see things going that way.

acomjean1y ago

If you were wondering why meta was making a lot of donations to the new government (including settling a lawsuit for 25 million with the New president, 1 million to the inauguration)…. I suspect there will be no federeal charges.

The rules have always seemed different for corporations regardless.

https://www.businessinsider.com/trump-settles-lawsuit-meta-m...

Nasrudith1y ago

Well of course, bullies always prefer targets that can't fight back. That itself is unfortunately a basis of the legal system from it being run on flawed monkey brains. Why else is hitting vulnerable children okay but getting into a consensual bar fight illegal?

courseofaction1y ago

Even after JSTOR declined to press charges in that case. Despicable. The US has dug the hole it's going down.

openplatypus1y ago

Something tells me uncle Donald will exonerate his new favourite lapdog from any criminal or civil liability.

Terr_1y ago

IANAL but the pardon power (A) only extends to criminal punishments, not civil liabilities and (B) copyright lawsuits can be launched by anybody, not just the Department of Justice.

So, barring further Might Makes Right shit--which I'm not willing to fully rule out--Trump can't fully shield Zuckerberg et al.

2 more replies

HPsquared1y ago

If you owe the bank $1,000 it's your problem; if you owe the bank $1,000,000,000 it's the bank's problem.

651y ago

I'm more interested in piracy not being highly prosecuted than I am in Meta getting punished for this. I'm not trying to spend 20 years in jail for pirating a TV show.

fsflover1y ago

Support EFF if you think that the copyright laws should be changed and also applied equally to all: https://www.eff.org/issues/innovation

sva_1y ago

> By September 2023, Bashlykov had seemingly dropped the emojis, consulting the legal team directly and emphasizing in an email that "using torrents would entail ‘seeding’ the files—i.e., sharing the content outside, this could be legally not OK."

I'm pretty sure you can theoretically download torrents without seeding, although this is frowned upon. If they really seeded (with full bandwidth?) that's indeed pretty brazen.

It is sort of strange that Meta is being singled out here though, and sort of sad considering they at least release the model weights. What's the signal? Do illegal shit to be competitive, but make sure there is no evidence?

voidUpdate1y ago

You can, in transmission for example you can just set the seed percentage to 0%. I recognise that this makes me a bad torrenter, but I've been told in the past that my ISP wont be too happy about me seeding, and they already do something screwy to torrents I access through the surface web, so I'm just playing it safe

1 more reply

jokethrowaway1y ago

Great, can we get the full Kim Dotcom treatment for Zuckenberg now?

I'm also ok with abolishing copyright all together if he's too untouchable

mnsu1y ago

So according to some AI, the damages awarded per infringed work is ~$750 minimum in the US. 80TB of books, each let's say 10MB on average, would be 8 million works. So Meta should pay 6 billion USD for their copyright infringement?

gorbachev1y ago

Minimum doesn't cover willful copyright infractions, for which maximum penalty is $150K per work. That comes out to quite a different number.

oersted1y ago

Nice calculation, that’s actually quite doable for them, they have already been paying similar fines for a while.

timeon1y ago

Prosecutors filed for Swartz 50 years of imprisonment and $1 million in fines.

Can you calculate how many years that would be for Mark and his people?

qup1y ago

I ran it, it came out to zero

perihelions1y ago

Best way to "punish" Meta is to slash the Gordian knot and abolish copyright. Level the playing field, incrementally, for everyone else who isn't a trillion-dollar corporation.

The alternative is a futile legalistic attack against a monopoly entity too powerful to be meaningfully punished. That won't accomplish anything useful. It would, rather, help cement this status quo, where copyright infringement is selectively legal or illegal, for different entities at the same time; and companies like Meta thrive arbitraging that difference. You can't defeat Meta—but you can help dig them a moat.

miltonlost1y ago

Ridding copyright would level the playing field for individuals and companies????!!!! Getting rid of laws that protect the individual only will help the larger empowered businesses.

Workaccount21y ago

>only will help the larger empowered businesses.

I'm pretty sure I could list ten megacorps that would collapse overnight if copyright was abolished. The music groups, movie studios, streaming platforms...

nkrisc1y ago

What's the alternative to copyright then? Anything I create will be instantly reproduced and sold for less than I can afford to by some entity far larger and more efficient than me.

> Level the playing field, incrementally, for everyone else who isn't a trillion-dollar corporation.

There is no level playing field when you have individuals and trillion-dollar companies in the same market.

clueless1y ago

Right, all this talk about getting rid of copyright and no one is talking about what should replace it? how would we we incentives people to write good books? to pour 1000s of hours of their time to produce new knowledge?

1 more reply

9999000009991y ago

"Say they hood robin, ain't that a b*, take from the poor and give to the rich."

- Ice Cube.

Meta will face no consequences. Say your a small publisher and you'd like a bit of compensation. If you dare sue Meta can just blacklist your books on its platforms. Even if they don't, you probably don't have the money to sue one of the biggest companies on earth.

I think copyrights should be limited to 25 years after first publication. This would fix plenty of issues and give the AIs of the world plenty to learn from.

Who am I kidding, Meta will take what they will. For that author making 20k a year, be honored to be of use to Meta.

bwfan1231y ago

can people vote with their feet, and leave the platform ?

but the masses are addicted to the slop that meta feeds them.

rvz1y ago

Maybe you should go after the worst offender (OpenAI) first before going after Meta, since the latter already gave back their model away for free for everyone and the architecture.

We will know why OpenAI isn't getting investigated.

hruzgar1y ago

So true. It seems like there is a controlled operation to shut open models down starting with Meta. Obviously they can't go after deepseek atm

unraveller1y ago

Could be why OpenAI paid them so much, to go after their open-source competition hardest of all.

postepowanieadm1y ago

That's horrible! Magnet anyone?

addandsubtract1y ago

Anna's Archive: https://annas-archive.org

immibis1y ago

specifically https://annas-archive.se/torrents - this is a meta-project which aggregates illegal copyrighted material from other illegal projects. You absolutely should not download any material this page links to, although you can use it for the purpose of researching about shadow libraries.

pinoy4201y ago

Library genesis

ykonstant1y ago

Weird shenanigans are happening in libgen at the moment; better go through Anna's Archive to look for the items you want, it will link you to the corresponding mirrors more reliably.

At least this has been the recent experience of a friend who used libgen and anna's archive to download legal, public domain works!

1 more reply

kelseyfrog1y ago

The usual copyright cartel is up in arms, crying theft. But here’s the truth: intellectual property is a state-enforced monopoly, not real property.

Property is based on scarcity - if you take my car, I no longer have a car. But if you copy my book, I still have my book. No loss, no theft, just an outdated legal fiction designed to stifle innovation and enrich rent-seeking middlemen. An no, loss of potential sales doesn't count - it's like being able to claim a lottery ticket has real value.

Copyright was never about protecting creators—it’s about locking down ideas, preventing competition, and extracting endless fees. Shakespeare borrowed, tech companies iterate, and science thrives on free exchange. The idea that knowledge should be locked away indefinitely is absurd.

Meta’s mistake wasn’t using the data - it was pretending copyright still matters. AI is exposing the system for what it is: obsolete. The future belongs to those who create without asking permission.

abigail951y ago

This reminds me of Peter Sunde's "komimashin"

https://www.engadget.com/2015-12-21-peter-sunde-kopimashin.h...

It's obviously absurd to enforce copyright as bytes are copied around instead of as it is used. Training an LLM is a different thing than re-hosting and giving away copies to other people.

If you don't want people to transform your works - keep them private. You don't own ideas.

golly_ned1y ago

As the article says, Meta /was/ giving away copies to other people by seeding the libgen torrents. This isn't the usual case of "should companies be allowed to train on books".

1 more reply

henriquemaia1y ago

Thanks for the link. I wondered what that word meant.

From the article: Kopimashin, as in Copy Machine.

caterwhal1y ago

Really strange how much torrenting is demonized by all of these companies and ISPs when individuals want to use it but when a company like Meta uses it there is so little scrutiny.

seydor1y ago

We have at least 4 types of ill-defined concepts of property in the 21st century , largely due to our laziness, intellectual inertia and lack of motivation to make forward-thinking definitions for the coming age of AI and ubiquitous access to all information and all communication.

1) the concept of copyright is as old as the word suggests (copies are the least of our worries going forward - it should be possible to define processes for exploitation of ideas in a fair way)

2) we allow humans to learn from other people's ideas and transform them to commercial products and the same should happen for AIs in the future

3) we have an ill-defined concept of "personally identifying information" which gives people ownership to information that others have created via their own means - there should be better ways to ensure a level of privacy (but not absolute privacy) without overly-broad, nonsensical definitions of what is personally protected information

4) We allow social media and other telecommunications media to arbitrarily censor people's speech without recourse. This turns people's speech to property of the social media companies and imposes absolute power on it. This makes zero sense and is abusive towards the public at large. We need legal protections of speech in all media, not just state-owned media.

thfuran1y ago

>we have an ill-defined concept of "personally identifying information" which gives people ownership to information that others have created via their own means - there should be better ways to ensure a level of privacy (but not absolute privacy) without overly-broad, nonsensical definitions of what is personally protected information

What information about me could a corporation create via its own means that would be legally protected but shouldn't be? PII is generally information that a corporation collects. Unless you mean that my cellphone provider creates the association between my name and phone number and should therefore be able to do with it as they please?

seydor1y ago

It's not just about corporations. Banking and government services e.g. are required to keep your personal information stored for years and years even against your will

ofou1y ago

Who would have known that BitTorrent, shadow libraries, and seeders will help to train the best AI models out there, that adds a whole new meaning to a "seed".

gorbachev1y ago

Previous: https://news.ycombinator.com/item?id=42673628

z71y ago

How about a consequentialist argument? In some fields, AI has already surpassed physicians in diagnosing illnesses. If breaking copyright laws allows AI to access and learn from a broader range of data, it could lead to earlier and more accurate diagnoses, saving lives. In this case, the ethical imperative to preserve human life outweighs the rigid enforcement of copyright laws.

KolmogorovComp1y ago

There’s nothing particular to AI about your comment, it’s a general downside of IP.

z71y ago

No, the development of an artificial general intelligence does seem like a special case compared to usual IP debates, particularly in the potential multiplicative positive-sum effects on society overall.

nprateem1y ago

If you're an author with a book likely to have be hoovered up, I wonder what you'd get from the fb models if you asked "complete this in the style of [author] in [book]: [quite a long excerpt]"

If you get a direct quote then you're good with your claim, surely.

Nemo_bis1y ago

That's the NYT's case. Not necessarily very strong. https://www.techdirt.com/2024/03/05/openais-motion-to-dismis...

unraveller1y ago

The way it works counts if you bring prompting into it. It could easily have learned enough style chops of [author] from other sources to mimic/predict those stanzas from raw data points.

Whatever the ruling one thing is for sure, plagiarism is no longer the sincerest form of flattery. The human authors are out for AI blood on this.

aprilthird20211y ago

I believe that is part of this lawsuit pretty much

aucisson_masque1y ago

You wouldn't download a car.

nickpsecurity1y ago

That they’d focus on file sharing over transformation or outputs is exactly the risk I warned the companies about in my AI report. Most datasets, like RefinedWeb and The Pile, also require sharing copyrighted workers between people who are not licensed to do that. Many works also prohibit commercial use or have patents on them.

They need to make datasets which don’t have this problem or have entities in Singapore train the foundation models within their rules. The latter has a TDM exemption that would let AI’s use much of the Internet, maybe GPL code, licensed/purchased works they digitize, etc. Very flexible.

nullfield1y ago

I think everyone can see that whatever

(imo not in accordance with the Constitution, after absurdities like deciding “limited time” the way mathematicians might define something of some order of infinity)

the alleged social contract was is not functional the way it was intended, and we see who benefits and who loses.

mass dynamic editing for vitriol and profanity occurred while writing this comment in order to remain within site rules

stevage1y ago

Wow, I'm actually a bit shocked that senior levels of management at Meta were fine with torrenting pirated books. WTaF.

Meta does a lot of stuff I disagree with, but they're usually not just straight breaking the law.

passwordoops1y ago

Eye for an eye. Meta losses rights to 81.7 TB of IP. Transcribed into a text file

cma1y ago

Meta already does that to themselves every year or so, deleting all internal communications.

They've thrown away a huge amount of communication to source code commit reinforcement training data as a result. They do it to avoid emails making it into trials like this.

zaik1y ago

No large company will ever consider training a public LLM on all their internal communications.

1 more reply

yodsanklai1y ago

> Meta already does that to themselves every year or so, deleting all internal communications.

Aren't they obligated by law to keep all internal communication?

3 more replies

scotty791y ago

Seeding it was probably most societally useful thing Meta ever did.

yalogin1y ago

LLMs are worse than search for figuring out what value a specific asset provides to the LLM. Atleast with search your work or page is not lost and still gets a click/user interaction, and may be give you a chance to monetize the interaction. However, LLMs just don’t have any such option. Gemini adds links but the links they add are completely editorialized by the LLM and need not reflect the original at all. So how does anyone ask for compensation even if they sue?

pjfin1231y ago

Copyright law needs major reform. We need to figure out a way to let authors monetize their work while not making complying with the law so burdensome. We've created a system where people who (understandably) ignore the law benefit at the expense of people trying to do the right thing.

ngneer1y ago

Sounds just like how Facebook got started, harvesting photos without permission. From the Wikipedia article, the Facebook precursor was known as Facemash. On Zuckerberg, "He hacked into the online intranets of Harvard Houses to obtain photos, developing algorithms and codes along the way. He referred to his hacking as "child's play.""

If I were younger, I would be livid.

toss11y ago

>>"vastly smaller acts of data piracy—just .008 percent of the amount of copyrighted works Meta pirated—have resulted in Judges referring the conduct to the US Attorneys’ office for criminal investigation.".....While Meta may be confident in its legal strategy despite the new torrenting wrinkle...

Zuckerberg has paid the vig several times [0,1,2], which is evidently the best legal strategy under this administration. OFC, considering there are already multiple payments, there is no assurance the vig payments won't substantially increase as the Capo sees more opportunity for profit.

[0] https://en.wikipedia.org/wiki/Vigorish

[1] https://www.politico.com/news/2025/01/29/meta-settles-trump-...

[2] https://www.bbc.com/news/articles/c8j9e1x9z2xo

buyucu1y ago

I love this. Large corpos should torrent more. Maybe we'll get better copyright law as a result.

thunder-blue-31y ago

You know the wierd thing is - I've never used Meta AI. I've never thought of using it. The only product of FB i use is whatsapp, however I've not seen/heard any of my friends using Meta AI for FB,IG,Whatsapp. I really don't understand what their ROI here is...

asjir1y ago

I thought about it for a full day, and I have one idea for how to handle copyrighted data training. It would need to be open / regulated and training till double descent would need to be disallowed, to make sure that the model is not memorizing the data.

kpgraham1y ago

Damn! One of my old books can be found in the Anna's Archive search. The book has been out of print for years. I pity the Meta users who get results based on something that I wrote. (Check Anna's for 'Keith P. Graham', and the first book listed is mine.)

srameshc1y ago

At OpenAI we have seen some employees expressed their concern publicy about the moral grounds on which company was acting. We never heard about it from anyone at Meta but there were some jokes ofcourse. I guess everything is fair in AI and Corporates.

api1y ago

One of the largest businesses of the Internet to date has been piracy. Individual informal piracy has been the smallest component of this. By far the largest has been corporate mass-scale piracy, and LLMs are probably the largest heist to date. They've literally downloaded the sum total of all human thought and knowledge, compressed it into queryable lossy compression models (which is what LLMs are), and are selling it back to us.

Meta, with its "open weights" models, is one of the least guilty parties, since at least they've made the resulting blobs of mass piracy available to us. Same with Mistral, Deepseek, etc.

ClosedAI, Google, and others have all probably done this and more and refuse to make even the model available.

I think the way to deal with this is very simple:

If you have trained your model on works to which you do not have rights or permission, the resulting model is not copyrightable and cannot be sold. It must either be kept for research purposes only or released free of charge and in the public domain. All these models that have been trained on pirated works should become public domain.

Of course now that we have full capture of the US Federal Government I'm sure any suggestion like that would be neutralized with one bribe to Trump.

flojo1y ago

Did they at least seed back?

lvl1551y ago

I’d think people can get together to put this on a public space strictly for training purposes and have the consortium of some sort get paid per use.

But we live in this stupid society where you have to move mountains to change things an inch.

StefanBatory1y ago

I as a individual would be liable to pay ~1000$ of damages if I'd downloaded a movie in Germany or Poland and the publisher would get to me.

I'm going to assume as it's a corporation, then the laws no longer apply.

Anamon1y ago

That's okay, they should just charge The Zuck with it personally; I'd be fine with that.

Der_Einzige1y ago

The only bad thing about this is that small time players who do it are treated poorly (Aaron Swartz). IP de-facto not existing for AI companies is a feature, not a bug.

The fact that most of the world embraced hardcore copyright troll ludditism when the means of their (badly paying creative) jobs economic production was democratized implies that most people do not believe in any "egalitarianism" and especially not the left-wing form many profess to believe in. Certainly not "information wants to be free" or any of the other idealist shit that I or Aaron Swartz believed in. What meta did was software communism - full stop. They literally released their models to the public! I support all of this 10000%. The only issue is that they're not open enough (fully open source the dataset)

So, unironically, good! Thank you, please pirate more! Please destroy the US IP system while you're at it. Copyright abolitionism is good and thank you Zuckerberg!

pilimi_anna1y ago

We're grateful to Meta for helping seed and backup our torrents. The more copies the better. Thank you Meta, for helping preserve humanity's legacy! :)

djyaz12001y ago

“Behind every great fortune lies a great crime” -Honoré de Balzac

antirez1y ago

Copy-right is not learn/train-right. That said Meta full its mouth with open source while they release models that are not SOTA nor usable for commercial purposes.

black_puppydog1y ago

Wouldn't it be a real shame if the entirety of US constitution, laws, and legal precedence went out the window these days, and the only thing left unscathed was the rotten mess that is copyright law? Just saying, this might be the moment to burn it to the ground. Not that it makes up for any of the other stuff going on, but why waste a perfectly good crisis?

maxwell1y ago

I'm sure they'll throw the book at them.

cratermoon1y ago

We're starting to find out that Meta ruined LibGen for the rest of use who used it like a library. Just like how Google screwed over libraries by sending interns to the Stanford library to checkout books they scanned into Google Books. Not to increase shared knowledge or preserve human artificats, but to put them all in a museum and, to paraphrase Joni Mitchell, charge the people a dollar and a half just to see 'em.

ezekiel681y ago

Unless Meta 'fessed up to this (which seems unlikely), the headline here is missing the word "allegedly".

papercrane1y ago

Meta admitted to the torrenting more than a month ago. The reason this is in the news is because some of the emails discussing it have been unsealed.

esarbe1y ago

It's okay - they are multi-billion company. Rules don't apply to them.

Rules are just for us peasants.

dansitu1y ago

I'm fine with them using my books to train an open source model, but it would have been nice to be asked.

1 more reply

lewdev1y ago

It's okay when large corporations download cars. But when you do it, you'll be in trouble.

iimaginary1y ago

We need better laws that would create a better way to do this legally whilst compensating rights holders.

miltonlost1y ago

We need better justice system that enforces the laws we have in the books that would help compensate right owners when big companies in emails pirate terabytes of data.

SketchySeaBeast1y ago

I really don't think that Meta did this because the alternative would have been too onerous; they are a huge org, they could work through whatever loopholes required. They did it because it would have cost money and there will be no penalty for not paying.

impossiblefork1y ago

So, if they're sued in Japan, or France, do you think that the courts will take any special measures because it's a valuable American corporation?

I suspect that if the case is reasonable they will just convict, and quickly-- appeal denied and all simply because the laws are so straightforward.

1 more reply

breppp1y ago

Yes it smells bad but facebook did the right thing (at least for facebook)

After OpenAI trained their models on the famed books2 dataset, and seeing the technological implications of ChatGPT, there was a good chance they would let them get away with it.

Would the USA really surrender its AI technological advantage for trivial matters like copyright? They would make some royalty arrangement and get it over with

mrinterweb1y ago

Remember people getting sued insane amounts of money per-song they torrented. If we applied that precedent to Meta, Meta would need to declare bankruptcy. https://www.cbsnews.com/news/file-sharing-mom-fined-19-milli...

ofslidingfeet1y ago

Yeah well, OpenAI compressed the whole internet into proprietary weights and is now providing access via paid subscription while the original internet gets deleted from our culture.

josefritzishere1y ago

Zuckerberg did more copyright infringement? Shocking!

losvedir1y ago

Hooray! Or wait, are we not doing that anymore?

waltercool1y ago

Based. Free knowledge to the people

zelphirkalt1y ago

Come on publishers! This is your chance! Now you can really show, how you will treat all copyright infringements equally and not only go after easy target. Show us, how you spend all that money in a lawsuit against Meta!

bloopbloopscoop1y ago

Death to intellectual property!

tremarley1y ago

ebooks are a 1-2 mb each max. 81.7 TB are a lot of books, like 42-85 million books.

weberer1y ago

The article says they got datasets from Anna's Archive. It was most likely the scihub/libgen torrent which is 96.0 TB right now and contains 92,872,581 files. That's about 1 megabyte per file.

https://annas-archive.org/datasets

southernplaces71y ago

Where does one find these torrent datasets? Did they download the books in bits and pieces or as a single huge multi-TB file?

thunkingdeep1y ago

I’ve got 70-80mb pirated books, I think because of the illustrations. Guess it depends on the book.

mateus11y ago

I don’t think they’re using picture heavy book for LLM training, no?

8 more replies

squigz1y ago

It could be anywhere from a few million to a hundred million

https://annas-archive.org/datasets

Refusing231y ago

their whole business is stealing data..

so its quite funny to see they freely share it too.

ocean_moist1y ago

At least they seeded!

snapcaster1y ago

The powerful do what they can, the weak suffer what they must

jfbaro1y ago

They are getting shittier and shittier

reverendsteveii1y ago

So they're gonna go through every book that was stolen and apply the appropriate penalty, right? Each copyrighted work has a minimum penalty of $750 under the DMCA. That will be applied fairly in order to ensure that the rights holder is made whole by the infringer, right?

It's so funny to see the law blatantly ignored by the overlords. Like, there isn't even a pretext anymore. They just steal what they want and budget for the fines and campaign donations to make the consequences go away.

uncomplexity_1y ago

did they not seed enough, is that the crime? lol

Pxtl1y ago

Laws are for poor people.

TZubiri1y ago

I love it. This plotline feels out of cryptonomicon or silicon valley series.

hackerbeat1y ago

One of the many reasons why Zuck’s been sucking up to Trump. He’s in desperate need of some Get-Out-Of-Jail-Free cards.

Same for all the other sleazy tech bros.

lazycog5121y ago

abolish knowledge rentiers

imgabe1y ago

Boo hoo.

We are trying to advance civilization here. To accumulate and make available all human knowledge to date. And you stand there with your hand out to stop this? You are a villain. There is no sympathy for you.

palata1y ago

Good, we know it. Nothing will happen, because nothing happens to billionaires and their companies. Musk is proving it every day now.

jokethrowaway1y ago

This is why we need to abolish the government. If the government doesn't have any power, they can't do preferential treatment to their cronies.

Enough with laws for thee but not for me!

ArnoVW1y ago

I was having difficulty figuring out if this was parody or not. But I guess the username checks out.

palata1y ago

The problem is precisely that those billionaires are too powerful. If anything, we need to abolish the billionaires.

swozey1y ago

I deleted my facebook account about 10 years ago. Downloaded data, deleted. Not deactivated.

Nothing in my life made me ever want to go back except for when I got back into playing hockey, and all the hockey leagues use facebook to communicate a few months ago.

I made a new account, had to literally upload a picture of my face to pass verification.. and then a few days later I was immediately banned and couldn't use my account. I assume because they searched previous data and compared my face to find out I have a "deleted" (lol) account and matched me. I've assumed they'll only let me log in if i use my original 10 years ago deleted account.

Fuck meta. Fuck zuck.

1970-01-011y ago

And they're going to get away with it simply because if you or I openly did this the DMCA fines would be for a million trillion dollars. Since Meta shareholders can't stomach a million trillion dollars in fines, their lawyers will wave their magic wands and poof! No laws were broken!

elzbardico1y ago

Nothing is gonna happen. Just a slap on the hand. And we all from the intelectual work class, writers, journalists, programmers will be proletarized by LLMs that have been:

a) Financed via inflation/"cantillon effect" due to ZRP/Stimulus that absolutely flooded the market with funny money in the hand of the sharks. b) Trained upon copyrighted work without compensation. c) Trained upon open source without even asking politely for authorization.

The Robber Barons from the last century can't even get close to our modern Feudal Tech Lords.

Unless you're one of us that have amassed multi-generation wealth in a exit in the last 20 years, you're completely fucked.

j / k navigate · click thread line to collapse

Meta torrented & seeded 81.7 TB dataset containing copyrighted data (opens in new tab)

938 comments