undefined | Better HN

0 pointsslidehero1y ago0 comments

>We are basically out of new, non-synthetic text to train models

this is not even remotely true.

There is an astronomical amount of data siloed by publishers, professional journals etc. that is yet to be tapped.

OpenAI is making inroads by making deals with these content owners for access to all that juicy data.

0 comments

Even assuming there is a ton of data companies are just now getting access to, the logarithmic curve of LLM improvements is clearly visible (granted that our LLM evaluation frameworks are not very good)

staticman21y ago

>>>There is an astronomical amount of data siloed by publishers, professional journals etc. that is yet to be tapped.

You seem to think these models haven't already been trained on pirated versions of this content, for some reason.

dartos1y ago

Yep, books3 is what llama was famously trained on before it was taken down.

That’s not even considering AI crawlers or all the copyright text on archive.org

j / k navigate · click thread line to collapse

0 comments

dartos1y ago

staticman21y ago

>>>There is an astronomical amount of data siloed by publishers, professional journals etc. that is yet to be tapped.

You seem to think these models haven't already been trained on pirated versions of this content, for some reason.

dartos1y ago

Yep, books3 is what llama was famously trained on before it was taken down.

That’s not even considering AI crawlers or all the copyright text on archive.org

j / k navigate · click thread line to collapse