Of course, some of those books will definitely be AI generated or garbage quality, and we all know many of those journal articles can be worth less than the paper they're printed on.
Yet even if we cut it down to 100,000 books and half a million scientific papers, that's a lot of training data each year... And that is just considering print media, there are other ways to get more content too.
For example, there is also transcription of video/podcasts/tv-shows/movies, etc. along with descriptions of the scenes for video, which could be used to generate a lot more stuff.
With people speaking to their devices and using text-to-speech more often, that's another source too--wouldn't be surprised if some devices just start recording conversations, and transcribing them.
Seems like a ton of potential data sources to me, although it will certainly get more difficult to cull AI generated stuff to prevent feedback, I'm sure the tooling will evolve to enable easy AI content detection and exclusion.