To the contrary, this just means companies can't make money from these models.
Those using models for research and personal use wouldn't be infringing under the fair use tests.
Maybe the strategy is something like this:
1) Survive long enough/get enough users that killing the generative AI industry is politically infeasible.
2) Negotiate a compromise similar to the compulsory mechanical royalty system used in the music business to “compensate” the rights holders whose content is used to train the models
The biggest AI companies could even run the enforcement cartels ala BMI/ASCAP to compute and collect royalties owed.
If you take this to its logical conclusion, the AI companies wouldn’t have to pre-license anything, and would just pay out all the royalties to the biggest rights holders (more or less what happens in the music industry) on the basis that figuring out what IP went into what model output is just too hard, so instead they just agree to distribute it to whomever is on the New York Times best seller list at any given moment.
the long tail exists, and there will always be a threshold for payments due to rights holders.
it used to be (like 10 years ago so i might not remember the details exactly) that if you earned less than £1 from youtube performing music rights in a quarter then any money you earned was put back into the pot and redistributed to those earning over £1.
it just wasn’t worth the cost to keep track of £0.00001 earnings for all the rights holder in the bottom of the long tail each quarter, or to pay the bank fees when the eventually earn £0.01 that can be paid to them.
definitely not perfect, but at least some people were getting paid, instead of none.
also, youtube’s data they gave us was fairly shit (video title, url). so that didn’t help. nor did the lack of compute/data proc infrastructure/skills. was historically a manual spreadsheet job trying to work out who to cut.
i had to do it a few times :/
edit —
> The biggest AI companies could even run the enforcement cartels ala BMI/ASCAP to compute and collect royalties owed.
what could happen, for music at least, is the same thing that happened with youtube, mashed up with live music analogies.
a licensing negotiation with BMI/ASCAP/PRS, and maybe major publishers directly if they get frustrated with the PROs. then PROs will use sampling of other revenue streams to work out what the likely popular things are for AI. then divvy up whatever the lump sum is between the most popular songs.
we used to do this for live music. i had to generate the sampled dataset in microsoft access each year and weed out the all the radio stings.
sorry for costing you a million pounds that one year ed sheeran :/
Check out this one cool trick companies found for skirting copyright restrictions.
Lawyers HATE them!
They don't need every copyrighted work and getting a fraction is entirely practical. They would go to some large conglomerate like Getty Images or large publishers or social media whose terms give the site a license to what you post and then the middle men would get a vig and the original authors would get peanuts if anything at all.
But in aggregate it would price out the little guy from creating a competing model, because each creator getting $3 is nothing to the creator but is real money to a small entity when there are a billion creators.
What is needed instead (I doubt politicians read HN, but someone go and tell them) is a new law that regulates training of these models if we want them to exist and be used in a legally safe way. This is needed for example because most jurisdictions have different copyright laws from one another, but software travels globally.
It would make sense to make all books available for non-commercial, perhaps even commercial R&D in AI, if society elected that to be beneficial in the same way that publishers must donate one copy of each new work to a copyright library (Library of Congress Library in the US, Oxford and Cambridge University libraries and British Library in the UK, Frankfurt and Leipzig Nationalbibliotheken for Germany etc.). Just add extra provisions that they need to send a plain text copy to the Linguistic Data Consortium (LDC), which manages datasets for NLP. Like for fair use, there can be provisions to make up for that use that happen automatically in the background (in some countries the price of photocopying machine includes a fee that gets passed on to copyright holders).
Otherwise you'll have one LLM being legal in one country but illegal in another because more than 15% from onw book were in the training data, and other messy situations.
Oh no. Anyway.