I assumed we were talking about logistics, not tech. I'm sure it will be technically possibly to use less training data overtime (Deepseek is more or less demonstrating that in real time. Maybe there's copyright data but I'd be surprised if it used anything close to 80 TB like competittorz).
I know hindsight is 20/20, but I always felt the earlier approaches were absurdly brute forced.
>I'm not sure that there really has been a normal channel for licencing at the scale of "almost everything on the public Internet"
There isn't. So they'd need to do it the old fashioned way with agreements . Or make some incentive model that has media submit their works with that understanding of training. Or any number of marketing ideas.
I don't exactly pity their herculean effort. Those same companies spend decades suing individuals for much pettier uses and building those precedent up (some covered under free use).
>and large archives of torrented data without lawsuits was one of the factors contributing to their fast success relative to western counterparts.
And now they're being slowed down. If not litigsted out of the market. Public trust in AI is falling. The lack of oversight into hallucinations may have even cost a few lives. Content creators now need to take extra precautions so they aren't stolen from because they don't even bother trying to respect robots.txt. Even a few posts here on HN note how the scraping is so rampant that it can spike their hosting costs on websites (so now we need more capthas. And I hate myself for uttering such a sentence).
Was all that velocity worth it? Who benefitted from this outside of a few billipnaires? We can't even say we beat China on this.
>I don't think this is necessarily a given - humans evolved on ~4 billion years worth of data, after all
Humans inherit their data and slowly structure around that. Maybe if AI models collaborated together as humanity did, I would sympathize more with this argument.
We both know it's instead a rat race and the goal isn't survival and passing on knowledge (and genes) to the next generation. AI can evolve organically but it instead devolved into a thieve's den.
I take the approach more like Bell's Spacecraft paradox. If they started gaining data ethically, by the time they gather a decent chunk they probably would have already optimized a model that needs less data. It'd be slower but not actually much slower I'm the long run. But they aren't exactly trying to go for quality here.
>I believe lawsuits launched by or fuss kicked up by model developers will typically be on a contract basis (i.e "you agreed to our ToS then broke it") rather than a copyright basis.
I suppose we'll see. Too early to tell. This lawsuit will definitely be precedent in other ongoing cases, but others may shift to a copyright infringement case anyway. Unlike other llms there was some human tailoring going on here, so it's not fully comparable to something like the NYT case.