A similar instance that bugs me is on the documentation page for their GPTBot scraper (https://platform.openai.com/docs/gptbot) where they say "Allowing GPTBot to access your site can help AI models become more accurate". Strange wording, given that is specifically OpenAI's models you're allowing, not "AI models" in general.
The goal is both cases is to make you feel like you're standing in the way of progress by objecting.
OpenAI is actively receiving money from funders and (potentially, maybe, eventually will) make money by using others' copyrighted content at a much larger potential than what the Internet Archive was doing.
OpenAI should not have permission to soullessly suck up copyrighted material and use it to make money.
On the other hand, other countries who don't place ethical/moral/fiscal priority on creating and protecting copyrighted works will eat the wests' lunch when it comes to AI as there's no limitation that's preventing them from consuming the content.
Not sure what the answer is - maybe copyright is an archaic idea/belief built and maintained by a once well-intended, now corrupted economic system that needs a bit of a shakeup anyways...
> "Because copyright today covers virtually every sort of human expression — including blog posts, photographs, forum posts, scraps of software code, and government documents — it would be impossible to train today's leading AI models without using copyrighted materials," the company wrote in the evidence filing. "Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today's citizens."
> OpenAI went on to insist in the document, submitted before the House of Lords' communications and digital committee, that it complies with copyright laws and that the company believes "legally copyright law does not forbid training."
Why not just license them like everyone else?
> but would not provide AI systems that meet the needs of today’s citizens.
Needs is doing a lot of work here.
They need a new market. This is precisely the kind of AI system I'd love to use.
They are arguing that the current copyright laws do not forbid training. And they are arguing that they need to train on copyrighted data in order to be able to make an effective tool (and make money).
That second part of the argument is there because, so far as I know, nobody has ruled (in any country) on the legality of using copyrighted material as training for LLMs that will then produce commercially-available output. So the first part is a claim, but it's not a ruled-upon claim. It's not a claim that OpenAI can count on a court agreeing with. So they add the second argument, which amounts to "please interpret copyright law that way, and if the courts don't, please change copyright law that way, or else we can't sell what we make (and therefore can't make any money)".
I take no position on the first claim. All I'm saying is that the appropriate response to the second claim is, "So what? The world doesn't owe you a living."
I know that copyright covers blog posts and generally every immaterial creation published by humans that is reproducible and above a fuzzily defined threshold of "original creativity".
The other day, I was downvoted here for criticizing the often-cited "freeware" claim put out by MS.
The argument was: copyright already covers all this, I must lack knowledge about copyright law.
Now, the argument seems to have shifted to: copyright law doesn't apply the way it used to?
At this point, I think as a society we need to just say copyright as a concept and law has completely failed and scrap the whole thing.
The 0.01% of powerful copyright cartel publishers get rich while harming 99.99% of people, because we've seen further erosion of fair use rights, absurdly lengthy expansions of copyright to prop up Disney's profits and expansive interpretation of how much control copyright olders have and zero punishment for abuse of DMCA and other things.
Students should be able to learn from books, music, film. So should AI training models.
If there is any ambiguity about this, we should immediately write laws making it clear that training and education of all forms is explicitly allowed under fair use. Ideally, we also send anyone trying to prevent this to the guillotines.
I think it should be legal to train a model on anything that is legal to scrape (which is almost everything).
Then, if someone uses a generative AI output that violates someones existing IP in an infringing way, go after the person that's trying to monetize that output, whether it's software, an image, or writing.
The thing is, if you limit what these things can be trained on, it creates a huge power imbalance. The wealthy and nation states are still going to scrape everything under the sun and train AIs with that data along with whatever else their surveillance has gathered. If businesses are neutered from being able to do the same, we all lose.
> Students should be able to learn from books, music, film. So should AI training models.
An AI model is a thing. It is owned and fully controlled by some agent. A student is a sentient, thinking being. Both can be trained, only one can be educated. Treating the two as comparable is misleading and in my view, wrong.
Then it shouldn't. Bloody profitors.