undefined | Better HN

0 pointsgptgpp3y ago0 comments

Maybe, but there are something like 700,000 books published on average each year, and almost 2 million scientific journal articles. Let's not even consider newspaper.

Of course, some of those books will definitely be AI generated or garbage quality, and we all know many of those journal articles can be worth less than the paper they're printed on.

Yet even if we cut it down to 100,000 books and half a million scientific papers, that's a lot of training data each year... And that is just considering print media, there are other ways to get more content too.

For example, there is also transcription of video/podcasts/tv-shows/movies, etc. along with descriptions of the scenes for video, which could be used to generate a lot more stuff.

With people speaking to their devices and using text-to-speech more often, that's another source too--wouldn't be surprised if some devices just start recording conversations, and transcribing them.

Seems like a ton of potential data sources to me, although it will certainly get more difficult to cull AI generated stuff to prevent feedback, I'm sure the tooling will evolve to enable easy AI content detection and exclusion.

0 comments

hasmanean3y ago

Journal articles (and newspapers) are also plagued by bullshit.

Humans are capable of producing intelligently sounding word salads just like chatgpt can.

gptgppOP3y ago

Yep that's why I said let's not even consider newspapers. Many of those have been using AI generated/content-mill/sponsored content for years and years.

Also why I acknowledged journal articles can be worth less than the paper they're printed on. Even if you were to select for reputable, high impact journals, those also often experience scandals, retractions, potential data fabrication, etc.

But then there are textbooks and technical publications, also being published in the hundreds of thousands globally each year.

The fact is that with billions of human beings on the planet, and media increasingly being digitized by default, and AI-content detection, I don't see how we could possibly run out of new content to grow LLMs...

j / k navigate · click thread line to collapse