undefined | Better HN

0 pointspeytoncasper3y ago0 comments

What happens if GitHub didn't use GPL licensed code, but still generated code that was identical to GPL licensed code?

0 comments

We know that isn't the case because we can see code being reproduced even with comments, and Github has been open about the fact that they used everything they had in training.

That said, lets say there's a new model that explicitly excluded closed source and copyleft licenses. Well, the MIT, MPL, Apache, BSD- they all say you can't strip their licensing off.

Okay, so to get to the spirit of your question, lets say Github managed to program a model that worked using only their own code or code that was explicitly put in the public domain. If Github managed to reproduce code that wasn't in the training set, then it can't be accused of copying it. At that point the argument could be made that it independently created it.

At the same time algorithms can't be copyrighted, but implementations of an algorithm can be, so if Github was basically just spitting out an algorithm that just happened to be implemented similarly to how some other code it wasn't trained on implemented it, then I would say there was no copyright violation.

bryanrasmussen3y ago

>We know that isn't the case because we can see code being reproduced even with comments

If the comment is something like

//check fromIndex is greater than toIndex

then that is not any more individualistic or different than the actual function. Sadly, many people comment like this, on the other hand if it reproduced a comment with typos or something more complicated like

/* this hack is because Firefox's implementation of SVG z-indexing does not match how Chrome or Safari does it - please read this article ...url...*/

then yeah, then you would have something

marginalia_nu3y ago

Well yeah, we've already seen exactly this:

https://twitter.com/StefanKarpinski/status/14109710611816816...

theRealMe3y ago

In almost any other scenario this would be evidence. But Fast Inverse Square Root isn’t some tightly held secret. That exact code, with those specific comments included, is found in the Wikipedia page for that algo:

https://en.m.wikipedia.org/wiki/Fast_inverse_square_root

2 more replies

bryanrasmussen3y ago

OK that tracks as more than just lazy comments lookalikes.

visarga3y ago

How about rewording a code snippet so it doesn't exactly replicate the source, but is functionally identical? Could be applied before training. Can we say the LLM only learned the ideas not the expression? Copyright should protect expression and not restrict reusing ideas.

janoc3y ago

Except that's not how LLM works. LLM has no idea about "ideas", only probabilities of how certain words string together.

So you literally can't make it produce functionally identical but not verbatim identical code. It doesn't understand that the two are equivalent.

Also, such "functionally identical but not violating copyright" transformation is not possible to do, both given the complexity of the problem and the sheer volume of the data.

And training it on some simplistically obfuscated code wouldn't help - all it would learn would be production of obfuscated code. Not useful for the intended use.

chii3y ago

> It doesn't understand that the two are equivalent.

it doesn't need to understand the way a human might do the understanding.

The pattern that the LLM managed to extract could include the structure, rather than the pure text. And in reproducing the structure, the LLM can replace the variable names but keep the structure intact.

I am not sure if copilot is able to do this, but chatGPT was somewhat able to (if imperfectly at the moment).

1 more reply

nextaccountic3y ago

> So you literally can't make it produce functionally identical but not verbatim identical code. It doesn't understand that the two are equivalent.

But it does - similar but not identical code are closer in the embedding space

NoZebra120vClip3y ago

> Copyright should protect expression and not restrict reusing ideas.

That's what patents are for.

layer83y ago

They’d have to prove to the court that the former is true despite the latter happening, which I imagine would be difficult to do in practice.

j / k navigate · click thread line to collapse

0 comments

tedivm3y ago

We know that isn't the case because we can see code being reproduced even with comments, and Github has been open about the fact that they used everything they had in training.

That said, lets say there's a new model that explicitly excluded closed source and copyleft licenses. Well, the MIT, MPL, Apache, BSD- they all say you can't strip their licensing off.

bryanrasmussen3y ago

>We know that isn't the case because we can see code being reproduced even with comments

If the comment is something like

//check fromIndex is greater than toIndex

/* this hack is because Firefox's implementation of SVG z-indexing does not match how Chrome or Safari does it - please read this article ...url...*/

then yeah, then you would have something

marginalia_nu3y ago

Well yeah, we've already seen exactly this:

https://twitter.com/StefanKarpinski/status/14109710611816816...

theRealMe3y ago

https://en.m.wikipedia.org/wiki/Fast_inverse_square_root

2 more replies

bryanrasmussen3y ago

OK that tracks as more than just lazy comments lookalikes.

visarga3y ago

janoc3y ago

Except that's not how LLM works. LLM has no idea about "ideas", only probabilities of how certain words string together.

So you literally can't make it produce functionally identical but not verbatim identical code. It doesn't understand that the two are equivalent.

Also, such "functionally identical but not violating copyright" transformation is not possible to do, both given the complexity of the problem and the sheer volume of the data.

And training it on some simplistically obfuscated code wouldn't help - all it would learn would be production of obfuscated code. Not useful for the intended use.

chii3y ago

> It doesn't understand that the two are equivalent.

it doesn't need to understand the way a human might do the understanding.

I am not sure if copilot is able to do this, but chatGPT was somewhat able to (if imperfectly at the moment).

1 more reply

nextaccountic3y ago

> So you literally can't make it produce functionally identical but not verbatim identical code. It doesn't understand that the two are equivalent.

But it does - similar but not identical code are closer in the embedding space

NoZebra120vClip3y ago

> Copyright should protect expression and not restrict reusing ideas.

That's what patents are for.

layer83y ago

They’d have to prove to the court that the former is true despite the latter happening, which I imagine would be difficult to do in practice.

j / k navigate · click thread line to collapse