undefined | Better HN

0 pointsnrb3y ago0 comments

Does anyone have a problem with it, so long as the material it trained on was with explicit permission/license and not potentially in violation of copyright?

That's where the line is for it to be suspect IMO.

0 comments

bogwog3y ago

This is what I hope comes out of the lawsuit. If a company wants to sell an AI model, they need to own all of the training data. It can't be "fair use" to take other peoples' works at zero cost, and use it to build a commercial product without compensation.

And maybe models trained on public data should be in the public domain, so that AI research can happen without requiring massive investments to obtain the training data.

dmix3y ago

There has to be a reasonable context here. Even if it’s trained on proprietary code it rarely ever is inserting that code directly in a way that is at all relevant to how it was used in the past.

Obvious licensing needs to be respected and it shouldn’t be hard to solve that problem. But 99.9% of code isn’t some unique algorithm, it’s gluing libraries and setting up basic structures.

Most of the examples I’ve seen done line up with the reality of code completion tools. Code is rarely valuable when broken up into its small parts.

Even copying a full codebase is rarely enough to draw value from… there’s way more to a software business than the raw code. But that’s a different problem.

bpicolo3y ago

> It can't be "fair use" to take other peoples' works at zero cost, and use it to build a commercial product without compensation.

You just described open source software.

That's the whole heart of this lawsuit, and equally Copilot. It was trained on OSS which is explicitly licensed for free use.

bogwog3y ago

Ok you got me, that wording was lazy on my part. But that's a really bad take on yours:

> It was trained on OSS which is explicitly licensed for free use.

That's not what the lawsuit is about. It's not about money, it's about licensing. OSS licenses have specific requirements and restrictions for using them, and Copilot explicitly ignores those requirements, thus violating the license agreement.

The GPL, for example, requires you to release your own source code if you use it in a publicly-released product. If you don't do that, you're committing copyright infringement, since you're copying someone's work without permission.

1 more reply

deathanatos3y ago

Most companies building commercial products on top of FOSS are obeying the license requirements. (I have been through due diligence reviews where we had to demonstrate that, for each library/tool/package.)

The same cannot be said for Copilot: there have been prior examples here on HN showing that it can emit large chunks of copyrighted code (without the license).

xigoi3y ago

> That's the whole heart of this lawsuit, and equally Copilot. It was trained on OSS which is explicitly licensed for free use.

Most open-source software is not licensed for free use. MIT and GPL, the two most common licenses, both require attribution.

chlorion3y ago

FOSS license does not mean "do whatever you want". The GPL requires all derived work to also be licensed under a GPL compatible license for example.

michaelmrose3y ago

It being permissively licensed is virtually irrelevant because only a minority of code is so permissively licensed you can just do what you like under any license. Far more is do what you like within the scope of the license. For example GPL do with it what you like so long as any derivative work is also GPL.

adlpz3y ago

I guess I'm just afraid that it might not be as good as it is that way.

It's a bit like how GPT-3, Stable Diffusion and all those generative models use extensive amounts of copyrighted material in training to get as good as they do.

In those cases however the output space is so vast that plagiarism is very unlikely.

With code, not so much.

jacobr13y ago

GPT-3 and Stable Diffusion might not copy things exactly - but they certainly do copy "style" There are many articles likes this:

https://hyperallergic.com/766241/hes-bigger-than-picasso-on-...

The interesting thing is that the names get explicitly attached to these styles. It isn't exactly a copyright issue, but I'm sure it will get litigated regardless.

bjourne3y ago

I think the prompt "GPT-3, tell me what the lyrics for the song Stan by Eminem is" is very likely to output copyrighted material. The same copyrighted material is, of course, already republished without permission on google.com.

odessacubbage3y ago

there are literally thousands of years of artwork that fall under public domain, the idea that the dataset isn't big enough to make good images without copyright infringement and attribution laundering is frankly laughable.

adlpz3y ago

My guess is that is not as much about the amount of available data but how accessible it is. Scraping the internet seems to be one of the preferred ways of gathering vast amounts of, in particular, text and images.

Telling apart what's public domain or not is not a trivially automatable task.

If one just relies on curated libraries of vetted public domain content you don't get, by far, the expected amout of variability and diversity.

j / k navigate · click thread line to collapse

0 comments

bogwog3y ago

And maybe models trained on public data should be in the public domain, so that AI research can happen without requiring massive investments to obtain the training data.

dmix3y ago

There has to be a reasonable context here. Even if it’s trained on proprietary code it rarely ever is inserting that code directly in a way that is at all relevant to how it was used in the past.

Obvious licensing needs to be respected and it shouldn’t be hard to solve that problem. But 99.9% of code isn’t some unique algorithm, it’s gluing libraries and setting up basic structures.

Most of the examples I’ve seen done line up with the reality of code completion tools. Code is rarely valuable when broken up into its small parts.

Even copying a full codebase is rarely enough to draw value from… there’s way more to a software business than the raw code. But that’s a different problem.

bpicolo3y ago

> It can't be "fair use" to take other peoples' works at zero cost, and use it to build a commercial product without compensation.

You just described open source software.

That's the whole heart of this lawsuit, and equally Copilot. It was trained on OSS which is explicitly licensed for free use.

bogwog3y ago

Ok you got me, that wording was lazy on my part. But that's a really bad take on yours:

> It was trained on OSS which is explicitly licensed for free use.

1 more reply

deathanatos3y ago

The same cannot be said for Copilot: there have been prior examples here on HN showing that it can emit large chunks of copyrighted code (without the license).

xigoi3y ago

> That's the whole heart of this lawsuit, and equally Copilot. It was trained on OSS which is explicitly licensed for free use.

Most open-source software is not licensed for free use. MIT and GPL, the two most common licenses, both require attribution.

chlorion3y ago

FOSS license does not mean "do whatever you want". The GPL requires all derived work to also be licensed under a GPL compatible license for example.

michaelmrose3y ago

adlpz3y ago

I guess I'm just afraid that it might not be as good as it is that way.

It's a bit like how GPT-3, Stable Diffusion and all those generative models use extensive amounts of copyrighted material in training to get as good as they do.

In those cases however the output space is so vast that plagiarism is very unlikely.

With code, not so much.

jacobr13y ago

GPT-3 and Stable Diffusion might not copy things exactly - but they certainly do copy "style" There are many articles likes this:

https://hyperallergic.com/766241/hes-bigger-than-picasso-on-...

The interesting thing is that the names get explicitly attached to these styles. It isn't exactly a copyright issue, but I'm sure it will get litigated regardless.

bjourne3y ago

odessacubbage3y ago

adlpz3y ago

Telling apart what's public domain or not is not a trivially automatable task.

If one just relies on curated libraries of vetted public domain content you don't get, by far, the expected amout of variability and diversity.

j / k navigate · click thread line to collapse