this is not even remotely true.
There is an astronomical amount of data siloed by publishers, professional journals etc. that is yet to be tapped.
OpenAI is making inroads by making deals with these content owners for access to all that juicy data.
You seem to think these models haven't already been trained on pirated versions of this content, for some reason.
That’s not even considering AI crawlers or all the copyright text on archive.org