Companies aggressively protect their own intellectual property but have no qualms about violating the IP rights of others. Companies. Individuals have no such privilege. If you plug a laptop into a closet at MIT to download some scientific papers you forfeit your life.
All the sad poor people who might be hurt were already paid. The caterer on your favorite show is not getting residuals. NBC also isn't going to stop making TV shows because that is all they can do. Content creators also existed on the internet long before that was a job. They just did it because they cared about it not for ad money. If you really want to support the artist directly go to a concert or just mail them a check. If you can't actually identify a person who might be hurt, then do not care.
I'm pretty much at the point now where I don't buy the "copyright incentivizes creation" argument any more. Copyright, like advertising, incentivizes creation by enormous corporations, but also like advertising it incentivizes creations that overwhelmingly have little value.
Creative individuals don't need copyright to be incentivized to create—they need a safety net that gives them the freedom to spend time on the creativity that naturally wants to bubble out. If the goal is to encourage creativity, copyright is a lousy and enormously expensive substitute for Universal Basic Income.
In case anybody here doesn't know, that's a reference to Aaron Swartz, an activist (and Reddit co-founder) that was risking 35 years in prison and a $1 million fine just for downloading a lot of academic papers from JSTOR. He eventually took his life because of the pressure. May his soul rest in peace.
End users, not YouTube employees, right? And they would take things down following DMCA requests and what not, right? So, pretty much following the law?
> Google itself got big by indexing other people's data without compensation
Scraping public websites to build a search index isn't the same as making LLMs that can recreate the source verbatim devoid of even attribution. I do agree there's an argument to be had about the LLM's transformative nature in the end though.
> Spotify's music library was also pirated in the early days
Not any version generally available to the public, and with the copyright holder's permission to do so.
the americans cheated their way to competition,
heck, even before that, the english empire got jumpstarted by stealing gold from the spanish (who were themselves exploiting it away from aztec and other mexican natives)
I'm saying it's business as usual, but also, culture doesn't work like tangible physical widgets so we must stop letting a few steal this boon of digital copying by means of silly ideas like DRM, copyright, patents. all means to cause scarcity
The copyright holders then approved their concept, and subsequently Spotify got the rights to offer their service to customers. Everybody won.
I want to know more, please enlighten me (anyone who knows). I read the book "The Spotify Play" and it made it seem like the pirated music was an internal-only thing and not something available to customers. Is that true?
Just to point, but the material in question was public domain, so nobody had even a copyrights claim over it.
It's true, and relevant, that Google would feel those consequences much less sharply than Swartz did.
The limit is what you can actually get away with, not what the rules say you can get away with, and the system aggressively selects players who recognize this. It's amoral - there is no "ought", only "is". An actor gets punished or not, with absolutely no regard to whether it "should" get punished. One thing is consistent: following the rules as written means you lose.
You can see it in Y Combinator (and other) startups. The biggest ex-startups are things like AirBNB (hotels but we don't follow the rules but we don't get punished for not following them) and Uber (taxis but we don't follow the rules but we don't get punished for not following them).
One way to not get punished for not following the rules is to invent a variation of the game where the rules haven't been written yet. I again refer you to AirBNB and Uber; Omegle also comes to mind, although they didn't monetize.
Viewed in this light, Aaron Swartz's mistake was not the part where he downloaded journal articles, but the part where he got caught downloading journal articles. Shadow library sites are doing the same thing, minus the getting caught. So are Meta and Google and OpenAI. sci-hub is only involved in a lawsuit because it got caught and is now in the stage where it finds out whether it gets punished or not.
MegaUpload did the same, kim dotcom got raided in his sleep by FBI in New Zealand! So no I don't buy your reductionist argument, there are forces at play that allow companies with founders with the likes of Google to get away with it but not others.
To this day, there are a huge number of videos that show copyrighted content on YouTube; they are usually crappy clips, reversed and with different music playing in the background to avoid automated detection.
I don't understand why you wouldn't just buy copies of the books. Seems like such a relatively inexpensive way to strengthen your legal case.
Some can steal from stores and see no repercussions.
Some can steal from others and see no repercussions.
Some can violently harm others and see no repercussions.
Some can damage property and see no repercussions.
Some can’t. This world is not right.
"Ek, who had been the CEO of the piracy platform uTorrent, founded Spotify with his friend, another entrepreneur named Martin Lorentzon. Both-Ek at 23 and Lorentzon 37-were already millionaires from the sales of previous businesses. The name Spotify had no particular meaning, and was not associated with music. According to Spotify Teardown, the company developed a software for improved peer-to-peer network sharing, and the founders spoke of it as a general "media distribution platform." The initial choice to focus on music, the founders said at the time, was because audio files are smaller than video files, not because of a dream of saving music.
In 2007, when Spotify first publicly tested its software, it allowed users to stream songs downloaded from The Pirate Bay, a service for unlicensed downloads. By late 2008, Spotify would convince music labels in Sweden to license music to the site, and unlicensed music was removed. From there, Spotify would take off across Europe and then the world."
https://qz.com/1683609/how-the-music-industry-shifted-from-n...
So in other words, it got big by providing free user traffic to people's websites without asking for compensation?
You generally don't charge the phone book money to include you in it. It's actually the other way around.
I’m opposed to copyright and pro-aaronsw, but the state did not kill him.
Weird framing given how much value was and is still placed on Google driving traffic to you
Basically the entire legal system needs to be retooled and rethought for computers.
That's how the internet works. If you want private content, you need to put up a gate mechanism of some sort with authentication or other methods of restricting access. Without that, you are literally having your server "serve" the content to whoever asks for it, without restriction or exception, without ToS or meaningful contract or agreements.
You can't have it both ways. "But they didn't know" or other post-hoc claims of innocent people publishing content to the web being misled or confused or abused is infantilizing nonsense.
The web wouldn't have been as amazing and revolutionary and liberating if the fundamental public and open nature of its systems was private and walled off by default.
Your take on YouTube going viral initially over copyrighted content isn't correct, either - it was ease of use and access. It was fairly popular by the time Google bought it, and once it was reachable and advertised by google itself, it exploded, because by that time, everyone had defaulted to using google for search.
Other people corrected your Spotify take.
The reason they pirated is because it is functionally impossible to gain access to the data in any other way. For consumers, there are lots of old shows, music, and other content that aren't accessible, so they turn to piracy. A vast majority of the time, if content is accessible, people will pay and do the technically legal and "right" thing.
Publishers exploit authors and content creators in the name of "platforming" and "marketing" , effectively doing as little as possible to take 90%+ of the value of a product and providing as little as possible to the producer of content or books or music. They get by on technicalities and have captured the legal arena entirely, with any attempt at reform or revolution meeting a messy death at the hands of lawyers and big money publishers.
Screw those people. They lie, cheat, and steal, and somehow have gotten away with fooling the world into thinking they're the good guys.
Copying bits and bytes is not stealing, and the ones trying to shill that narrative are trying to fool as many people as possible into giving them more money without any return of value in kind. I'd download the hell out of a car. Pirate everything.
And in their face, with all the fierce ignorance, broligarchs deny, evade and totally pretend this never happened. The most non open company of all even went to lengths to accuse others of stealing their IP - not theirs to begin with.
Just think of it - why did all major content platforms closed their APIs the day after GPT-2 got the word going…? Cause they knew all this very well - the content is precious and needed. They been doing it all along. Distilling the essence of world’s writing and digital imagery they had no right to.
We have a saying where I come from - no mercy for the chicken, no laws for the millions. I thought it was a local thing at first, it turned is how the world goes. Nothing new under the sun, indeed.
Napster got shut down for widespread enabling of copyright infringement. So did numerous other filesharing startups, including Travis Kalanick's first startup, Scour. Lots of small startups get put out of business all the time for being sued and not having the money to defend themselves.
Likewise, individuals like Donald Trump or Elon Musk get away with all sorts of illegal shit, because they are big enough to shut down the court systems prosecuting them.
Google's genius was in staying under the radar and aligning their incentives with everyone that might dislike them, until they were big enough that they could simply crush anyone that might dislike them.
This is exactly what I immediately thought while reading the article. It almost feels like the legal system only punishes general public, while most of these guys are above it.
Wrong.
a) Robots.txt which defines what content you wish to make available to third parties predates every search engine including Google. Web site owners chose to make it available to Google and search engines have respected their wishes despite it not being in their best interest.
b) The difference here is that OpenAI, Meta etc have not even tried to honour the wishes of copyright holders. They just considered everything as theirs.
c) Google grew big because it had no ads, fast interface and PageRank was significantly better. It wasn't because it had the most comprehensive index.
What we should have been doing all along is YOLO-ing everything. It's only illegal if you get caught. And if you get big enough before you get caught then the rules never have to apply to you anyway.
Suckers. All of us.
No it isn't. The actual sucker attitude is copying what they do. You should act morally and with integrity out of respect for yourself. I never had any illusions that large tech companies act with respect towards the law, but it also has nothing to do with me.
Not quite. It's only illegal if you get caught and you are the wrong kind of person.
For the right kind of person not even a pat on the wrist.
Like when Trump said he is “smart” for evading taxes during the presidential debates (IIRC the first ones, not recent ones).
It’s absolutely despicable. Have a moral compass. Treat people fairly. Be nice. Let’s be better than toddlers who haven’t learned yet that hitting is bad, and you shouldn’t do it even if mommy and daddy aren’t in the room.
> We include two book corpora in our training dataset: the Gutenberg Project, [...], and the Books3 section of ThePile (Gao et al., 2020), a publicly available dataset for training large language models.
Following that reference:
> Books3 is a dataset of books derived from a copy of the contents of the Bibliotik private tracker made available by Shawn Presser (Presser, 2020).
(Presser, 2020) refers to https://twitter.com/theshawwn/status/1320282149329784833. (Which funnily refers to this DMCA policy: https://the-eye.eu/dmca.mp4)
Furthermore, they state they trained on GitHub, web pages, and ArXiv, which are all contain copyrighted content.
Surely the question is: is it legal to train and/or use and/or distribute an AI model (or its weights, or its outputs) that is trained using copyrighted material. That it was trained on copyrighted material is certain.
[Touvron et al., 2023] https://arxiv.org/pdf/2302.13971
[Gao et al., 2020] https://arxiv.org/pdf/2101.00027
1.) Training on copyright that is publicly available. You write a poem and publish it online for the world to read. That is your IP, no one else can take it an sell it, but they are free to read and be inspired by it. The legalitly of training on this is in the courts, but so far seems to be going in favor of LLMs.
2.) Training on copyright that is not publicly available. These are pretty much pirated works or works obtained by backdoor to avoid paying for them. Your poem is behind a paywall and you never got paid, yet the poem is known by the LLM. This is just straight illegal, as you legally must pay to view the work. However there might be conditions here too like paying for access to an archive and then training on everything in it.
Is it truly a violation of copyright when a user hacks out bits and pieces of easily restyled raw data points from a model to look samey? what about if it takes two models? Might be time to accept humans are just cooked in their ability to discern attempts at direct plagiarism - just as it is hard to discern Sky voice from Her voice.
In particular, people often cited the case of authors who had died leaving a family in destitution, and claimed that copyright extension would be a fair way of preventing this, but in most cases the remaining family had never held the copyright; the author had initally sold the reproduction rights to a publisher who had then sat on the work without publishing it. The author, driven into penury, was then induced to sell the copyright to the publisher outright for a pittance. So in such cases a copyright extension only benefited the publisher, and indeed increased their incentive to extort the copyright.
The one who got Hindu Sanskrit books translated in a horrible manner and then claimed: "I have no knowledge of either Sanskrit or Arabic. But I have done what I could to form a correct estimate of their value. I have read translations of the most celebrated Arabic and Sanskrit works. I have conversed both here and at home with men distinguished by their proficiency in the Eastern tongues. I am quite ready to take the Oriental learning at the valuation of the Orientalists themselves. I have never found one among them who could deny that a single shelf of a good European library was worth the whole native literature of India and Arabia."
This chap will educate us on copyright?
No thanks!
He was able to sell it because it is something valuable, exactly because of the copyright protections. Regardless of whether author sells the rights or not, he and his family would equally be better off with copyright.
When Metallica sued Napster, for many people the reaction was, "wait I can download music for free?"
LibGen gives you access to a much smaller body of works than either of those. It’s a little more convenient. But the big difference is that it doesn’t compensate the author at all.
Just go to a real library.
https://en.wikipedia.org/wiki/Aaron_Swartz#United_States_v._...
https://en.wikipedia.org/wiki/Aaron_Swartz#Death
While Aaron Swartz was bullied to suicide, these corporations will walk free and make billions. I say give every tech CEO the Swartz treatment, then change the law.
Big corporations are too big, they should just not exist. When you have corporations more powerful than the government of the biggest states, it's a bug, not a feature.
The IP laws may need rethinking. Saying that they should disappear because big corporations are above the law doesn't help, though. First kill the big corporations, then think about fair laws. Changing the law now would not change anything since those corporations are already above the law.
That said, I want them to burn for the right reasons.
Downloading data that should be available to the public is not one of them.
Also, change the law so this is legal for poor meta? smh..
That means lawsuits, prison sentences, and millions in fines. And that's just the piracy part, there's also the lying/fraud part.
Interestingly, a Dutch LLM project was sent a cease and desist after the local copyright lobby caught wind of it being trained on a bunch of pirated eBooks. The case unfortunately wasn't fought out in court, because I would be very interested to see if this could make that copyright lobby take down ChatGPT and the other AI companies for doing the same.
You mean Electronic Frontier Foundation? https://www.eff.org/issues/innovation
It's incredibly rare to find people who hold ideals that are detrimental to their own life.
Flippant response I know, but too many people worship at the alter of the job creater and believe these folks are moral upstanding citizens
Could make interesting case law.
Yeah, to perpetuate this system where only those who can afford lawyers get to benefit
What I mean is: when someone is prosecuted for copyright infringement, but Meta isn't, then could the case be put on hold until Meta is found guilty and pays a fine?
Also maybe the fine on the later case would have to be proportional to the prior case. So if Meta pays $1 per infringement, the penalty might be $1 for torrenting something else (which is immaterial and not worth the justice system's time) so pretty much all copyright infringement cases would get thrown out.
It reminds me of how mainstream drug addicts get convicted and spend years in prison, while celebrities get off with a warning or monetary fine.
It's a fundamental part of lawyer training, and if they want to let BigCorp go and bring the hammer down on the little guy, they can make up a hundred reasons for it.
Take for example 675k paid for 31 songs. So 20k a song. If we estimate book to be say 10MB that would 8 million works. So I think reasonable compensation is something along 163 billion. Not even 10 years of net income. Which I think is entirely fair punishment.
The only ethical problem here is that only Meta sized companies can afford to pay the "damages" for such blatant law violations at worst, or the fees of their lawyers at best.
Companies like Meta and OpenAI, however, should definitely have to pay to use the hard work of humans to train their AI.
They will be getting a lot of Frommer Legal letters...
Whether training on AI model on an array of diffentent works, many of which are copyright protected, is itself a copyright violation, in addition to or distinct from any copyright violation that goes on gathering the dataset for training (and separate from any copyright violation in the actual or intended use of the LLM), remains to be resolved as a legal question, and may or may not have a simple yes or no answer (or the same answer under every system of copyright laws globally).
My inclination is that it is probably generally not a violation in US law, but that's not something I am very confident in; how the definitions of copy and derivative work apply to determine if it would be without fair use, and how fair use analysis applies, are not clear from the available precedent.
> But legally, how does using a book to train a LLM differ from a teacher learning from a book and teaching its contents to their pupils.
It is very clear, by looking at how US copyright law is written and even more clear in its history of application, that information stored in brains of people are without exception neither copies nor new works that can be derivative works under US law, and so cannot be infringing, no matter how you gain them. It’s also very clear in the statute itself and the case law that data in media used by artificial digital computers, on the other hand, can constitute copies or derivative works that can be infringing. Even if the process is arguably similar in legally relevant manners, copyright law is critically focussed on the result and whether it is a particular kind of thing which can be infringing, not just the process.
I truly hope that whoever takes the case goes after Meta with 1000 times the pressure that was put on Swartz, but honestly I don't expect much just as the top comment precisly expressed.
And if we are going to be fair please also let's not forget about the other usual suspects, or anyone thinks they are falling behind?
Several EU countries, Switzerland, South Korea, Japan, etc. are viable countries to sue from. Even in Japan which has a law specifically permitting training on copyrighted material you must still obtain it legally-- i.e. you must license it.
Horse has functionally bolted on this already
I’m guessing slap on wrist despite courts going after individual for a couple of movies torrented pretty hard
The rules have always seemed different for corporations regardless.
https://www.businessinsider.com/trump-settles-lawsuit-meta-m...
So, barring further Might Makes Right shit--which I'm not willing to fully rule out--Trump can't fully shield Zuckerberg et al.
I'm pretty sure you can theoretically download torrents without seeding, although this is frowned upon. If they really seeded (with full bandwidth?) that's indeed pretty brazen.
It is sort of strange that Meta is being singled out here though, and sort of sad considering they at least release the model weights. What's the signal? Do illegal shit to be competitive, but make sure there is no evidence?
I'm also ok with abolishing copyright all together if he's too untouchable
The alternative is a futile legalistic attack against a monopoly entity too powerful to be meaningfully punished. That won't accomplish anything useful. It would, rather, help cement this status quo, where copyright infringement is selectively legal or illegal, for different entities at the same time; and companies like Meta thrive arbitraging that difference. You can't defeat Meta—but you can help dig them a moat.
> Level the playing field, incrementally, for everyone else who isn't a trillion-dollar corporation.
There is no level playing field when you have individuals and trillion-dollar companies in the same market.
- Ice Cube.
Meta will face no consequences. Say your a small publisher and you'd like a bit of compensation. If you dare sue Meta can just blacklist your books on its platforms. Even if they don't, you probably don't have the money to sue one of the biggest companies on earth.
I think copyrights should be limited to 25 years after first publication. This would fix plenty of issues and give the AIs of the world plenty to learn from.
Who am I kidding, Meta will take what they will. For that author making 20k a year, be honored to be of use to Meta.
but the masses are addicted to the slop that meta feeds them.
We will know why OpenAI isn't getting investigated.
Property is based on scarcity - if you take my car, I no longer have a car. But if you copy my book, I still have my book. No loss, no theft, just an outdated legal fiction designed to stifle innovation and enrich rent-seeking middlemen. An no, loss of potential sales doesn't count - it's like being able to claim a lottery ticket has real value.
Copyright was never about protecting creators—it’s about locking down ideas, preventing competition, and extracting endless fees. Shakespeare borrowed, tech companies iterate, and science thrives on free exchange. The idea that knowledge should be locked away indefinitely is absurd.
Meta’s mistake wasn’t using the data - it was pretending copyright still matters. AI is exposing the system for what it is: obsolete. The future belongs to those who create without asking permission.
https://www.engadget.com/2015-12-21-peter-sunde-kopimashin.h...
It's obviously absurd to enforce copyright as bytes are copied around instead of as it is used. Training an LLM is a different thing than re-hosting and giving away copies to other people.
If you don't want people to transform your works - keep them private. You don't own ideas.
From the article: Kopimashin, as in Copy Machine.
1) the concept of copyright is as old as the word suggests (copies are the least of our worries going forward - it should be possible to define processes for exploitation of ideas in a fair way)
2) we allow humans to learn from other people's ideas and transform them to commercial products and the same should happen for AIs in the future
3) we have an ill-defined concept of "personally identifying information" which gives people ownership to information that others have created via their own means - there should be better ways to ensure a level of privacy (but not absolute privacy) without overly-broad, nonsensical definitions of what is personally protected information
4) We allow social media and other telecommunications media to arbitrarily censor people's speech without recourse. This turns people's speech to property of the social media companies and imposes absolute power on it. This makes zero sense and is abusive towards the public at large. We need legal protections of speech in all media, not just state-owned media.
What information about me could a corporation create via its own means that would be legally protected but shouldn't be? PII is generally information that a corporation collects. Unless you mean that my cellphone provider creates the association between my name and phone number and should therefore be able to do with it as they please?
If you get a direct quote then you're good with your claim, surely.
Whatever the ruling one thing is for sure, plagiarism is no longer the sincerest form of flattery. The human authors are out for AI blood on this.
They need to make datasets which don’t have this problem or have entities in Singapore train the foundation models within their rules. The latter has a TDM exemption that would let AI’s use much of the Internet, maybe GPL code, licensed/purchased works they digitize, etc. Very flexible.
(imo not in accordance with the Constitution, after absurdities like deciding “limited time” the way mathematicians might define something of some order of infinity)
the alleged social contract was is not functional the way it was intended, and we see who benefits and who loses.
mass dynamic editing for vitriol and profanity occurred while writing this comment in order to remain within site rules
Meta does a lot of stuff I disagree with, but they're usually not just straight breaking the law.
They've thrown away a huge amount of communication to source code commit reinforcement training data as a result. They do it to avoid emails making it into trials like this.
If I were younger, I would be livid.
Zuckerberg has paid the vig several times [0,1,2], which is evidently the best legal strategy under this administration. OFC, considering there are already multiple payments, there is no assurance the vig payments won't substantially increase as the Capo sees more opportunity for profit.
[0] https://en.wikipedia.org/wiki/Vigorish
[1] https://www.politico.com/news/2025/01/29/meta-settles-trump-...
Meta, with its "open weights" models, is one of the least guilty parties, since at least they've made the resulting blobs of mass piracy available to us. Same with Mistral, Deepseek, etc.
ClosedAI, Google, and others have all probably done this and more and refuse to make even the model available.
I think the way to deal with this is very simple:
If you have trained your model on works to which you do not have rights or permission, the resulting model is not copyrightable and cannot be sold. It must either be kept for research purposes only or released free of charge and in the public domain. All these models that have been trained on pirated works should become public domain.
Of course now that we have full capture of the US Federal Government I'm sure any suggestion like that would be neutralized with one bribe to Trump.
But we live in this stupid society where you have to move mountains to change things an inch.
I'm going to assume as it's a corporation, then the laws no longer apply.
The fact that most of the world embraced hardcore copyright troll ludditism when the means of their (badly paying creative) jobs economic production was democratized implies that most people do not believe in any "egalitarianism" and especially not the left-wing form many profess to believe in. Certainly not "information wants to be free" or any of the other idealist shit that I or Aaron Swartz believed in. What meta did was software communism - full stop. They literally released their models to the public! I support all of this 10000%. The only issue is that they're not open enough (fully open source the dataset)
So, unironically, good! Thank you, please pirate more! Please destroy the US IP system while you're at it. Copyright abolitionism is good and thank you Zuckerberg!
Rules are just for us peasants.
After OpenAI trained their models on the famed books2 dataset, and seeing the technological implications of ChatGPT, there was a good chance they would let them get away with it.
Would the USA really surrender its AI technological advantage for trivial matters like copyright? They would make some royalty arrangement and get it over with
so its quite funny to see they freely share it too.
It's so funny to see the law blatantly ignored by the overlords. Like, there isn't even a pretext anymore. They just steal what they want and budget for the fines and campaign donations to make the consequences go away.
Same for all the other sleazy tech bros.
We are trying to advance civilization here. To accumulate and make available all human knowledge to date. And you stand there with your hand out to stop this? You are a villain. There is no sympathy for you.
Nothing in my life made me ever want to go back except for when I got back into playing hockey, and all the hockey leagues use facebook to communicate a few months ago.
I made a new account, had to literally upload a picture of my face to pass verification.. and then a few days later I was immediately banned and couldn't use my account. I assume because they searched previous data and compared my face to find out I have a "deleted" (lol) account and matched me. I've assumed they'll only let me log in if i use my original 10 years ago deleted account.
Fuck meta. Fuck zuck.
a) Financed via inflation/"cantillon effect" due to ZRP/Stimulus that absolutely flooded the market with funny money in the hand of the sharks. b) Trained upon copyrighted work without compensation. c) Trained upon open source without even asking politely for authorization.
The Robber Barons from the last century can't even get close to our modern Feudal Tech Lords.
Unless you're one of us that have amassed multi-generation wealth in a exit in the last 20 years, you're completely fucked.