That's where the line is for it to be suspect IMO.
And maybe models trained on public data should be in the public domain, so that AI research can happen without requiring massive investments to obtain the training data.
Obvious licensing needs to be respected and it shouldn’t be hard to solve that problem. But 99.9% of code isn’t some unique algorithm, it’s gluing libraries and setting up basic structures.
Most of the examples I’ve seen done line up with the reality of code completion tools. Code is rarely valuable when broken up into its small parts.
Even copying a full codebase is rarely enough to draw value from… there’s way more to a software business than the raw code. But that’s a different problem.
You just described open source software.
That's the whole heart of this lawsuit, and equally Copilot. It was trained on OSS which is explicitly licensed for free use.
> It was trained on OSS which is explicitly licensed for free use.
That's not what the lawsuit is about. It's not about money, it's about licensing. OSS licenses have specific requirements and restrictions for using them, and Copilot explicitly ignores those requirements, thus violating the license agreement.
The GPL, for example, requires you to release your own source code if you use it in a publicly-released product. If you don't do that, you're committing copyright infringement, since you're copying someone's work without permission.
The same cannot be said for Copilot: there have been prior examples here on HN showing that it can emit large chunks of copyrighted code (without the license).
Most open-source software is not licensed for free use. MIT and GPL, the two most common licenses, both require attribution.
It's a bit like how GPT-3, Stable Diffusion and all those generative models use extensive amounts of copyrighted material in training to get as good as they do.
In those cases however the output space is so vast that plagiarism is very unlikely.
With code, not so much.
https://hyperallergic.com/766241/hes-bigger-than-picasso-on-...
The interesting thing is that the names get explicitly attached to these styles. It isn't exactly a copyright issue, but I'm sure it will get litigated regardless.
Telling apart what's public domain or not is not a trivially automatable task.
If one just relies on curated libraries of vetted public domain content you don't get, by far, the expected amout of variability and diversity.