No snark intended; I’m seriously asking. If the answer is “no” then where do you draw the line?
A few factors that come to mind would be:
- scale
- informed consent which there was none in this case
- how you are going to use that data. For example using everybody others work so the worlds richest company can make more money from it while giving back nothing in return is a bullshit move.
So here's the question:
Does a person reading a comment destroy the incentive for the author to post it? No. In fact, it is the only thing that produces the incentive for someone to post. People post here when they want that thing to be read by someone else.
Does a model sucking up all the artistic output of the last 400 years and using that to produce an image generator model destroy the incentive of producing and sharing said artistic output? Yes. At least, that is the goal of such a model -- to become so good it is competitive with human artists.
Of course you have plenty of people positioned benefit from this incentive-destruction claiming it does no such thing. I personally tend to put more credence in the words of people who have historically actually been incentivized by said incentives (i.e. artists) who generally seem to perceive this as destructive to their desire to create and share their work.
Scale: Many companies (e.g. Google, Bing) have been scraping at scale for decades without issue. Why does scale become an issue when an LLM is thrown into the mix?
Informed consent: I’m not sure I fully understand this point, but I’d say most people posting content on the public internet are generally aware that people and bots might view it. I guess you think it’s different when the data is used for an LLM? But why?
Data usage: Same question as above.
I just don’t see how ingestion into an LLM is fundamentally different than the existing scraping processes that the internet is built on.
Not something big, not something you can enforce, but you d feel very annoyed Im making good money on something you wrote while you get nothing. I think ?
If a human reads it that would be a reproduction of the work, but if you serve that page as a cache to a human you're okay, usually.
If you compile all that information in a database and use it to answer search queries that's also okay, and nothing forbids you from using machine learning on that data to better answer those search queries.
Both of the above are actually being challenged right now but for the time being they're fine.
But that database is a derivative work, in that it contains copyrighted material and so how you use it matters if you want to avoid infringement — for example a Google employee SSHing to a server to read NYT articles isn't kosher.
What isn't clear is whether the model is a derivative work. Does it contain the information or is it new information created from the training data Sure, if you're clever you could probably encode information in the weights and use it as a fancy zip file but that's a matter of intent. If you use Rewind or Windows Recall and it captures a screenshot of a NYT article and then displays it back to you later is that a reproduction? Surely not. And that's an autonomous system that stores copywritten data and regurgitates it verbatim.
So if it's impractical to actually use it for piracy and it very obviously isn't anyone's intent for it to be used as such then I think it's hard to argue it shouldn't be allowed, even on data that was acquired through back channels.
But copyright is more political than logical so who knows what the legal landscape will be in 5 years, especially when AI companies have every incentive to use their lawyers to pull the ladder up behind them.
AI is a unique third case in which we have billions of creators and no idea who contributed what parts of the model or any specific outputs. So we can't pay in exposure, aside from a brutally long list of unwilling data subjects that will never be read by anyone. Some of the training data is being regurgitated unmodified and needs to be attributed in full, some of it is just informing a general understanding of grammar and is probably being used under fair use, and yet more might not even wind up having any appreciable effect on the model weights.
None of this matters because nobody actually agreed to be paid in exposure, nor was it ever in any AI company's intent - including Apple - to pay in exposure. Data is free purely because it would be extraordinarily inconvenient if anyone in this space had to pay.
And, for the record, this applies far wider than just image or text generators. Apple is almost surely not the worst offender in the space. For example: all that facial recognition tech your local law enforcement uses? That was trained on your Facebook photos.
And if you run a website and want to opt-out then simply add a robots.txt.
The standard way of preventing bots for 30 years.
Also the AppleBot was known about before it appeared in Siri.