undefined | Better HN

0 pointsaz2263y ago0 comments

But Copilot doesn't take your code at best it has learned from a fraction of a fraction of your code and synthesized it with tens or thousands of like examples and the output may look similar to your code because it's trying to achieve the same thing. It's not like Copilot takes your entire repo and clones it and says "we washed the onerous license requirements away for ya".

0 comments

david_allison3y ago

There's a minimum level of complexity and creativity which constitutes a copyright violation. It's up to a legal professional to draw the line, but I believe it can be a single line of code (`i = 0x5f3759df - ( i >> 1 );`)

If I saw 100 LOC which was very similar to something which I wrote, AND contained a log statement copied verbatim, it's very easy to imply that the entire piece of code is a derivative work.

Let's say I write FizzBuzz:

    // Copyright (c) 2022 David Allison. All rights reserved.

    for num in range(100):
        if num % 3 == 0 and num % 5 == 0:
            print("DA: fizzbuzz")
        elif num % 3 == 0:
            print("DA: fizz")
        elif num % 5 == 0:
            print("DA: buzz")
        else:
            print(num)

If I found the modified FizzBuzz algorithm in the wild with one line containing the "DA" prefix, it may have been learned from a fraction of a fraction of my code but it still contains my 'unique' creativity, is that a copyright violation?

Aside: Due to some uniquely named code I've contributed to, I strongly suspect that Copilot would output my GitHub username. I don't really want to open Pandora's box here, but I'd be curious.

visarga3y ago

Replicating copyrighted code from the training set only happens 1% of the time, it's the exception not the rule. And when it happens it's usually because the same text appears multiple times in the training set. So it will memorize boilerplate and popular code snippets, not unique stuff. Even a replicated piece of code 100 lines long is no big deal in my opinion, unless it contains some kind of unique thing never seen before, like an optimized matrix multiplication function. Certainly not FizzBuzz.

On the practical side, it is actually easy to filter out sequences of words that are too similar to the training set from the output of the model. You just generate another snippet until it is "original" enough.

david_allison3y ago

I have ~400KLOC changed on GitHub. 1% of the time happens multiple times a day given scale.

Pragmatically, people are already knowingly committing commercially viable copyright violations of my work. I'd rather it wasn't encouraged further by a US-based 'big tech', especially if the people using my code aren't aware that they're doing anything questionable.

Some months, I earn over 100x less from OSS than I would in industry. I don't want people taking advantage any more than I'm comfortable with, especially for commercial purposes.

1 more reply

ranguna3y ago

Copilot used (for training) copyrighted code without respecting the license and can generate pieces of copyrighted code verbatim without respecting the original license as well.

I pay for copilot and this is very much the truth, but let's see what the court rules out.

simion3143y ago

So why in your opinion Microsoft did not had the courage to also train copilot on proprietary code or on their own proprietary code? Because from my perspective I conclude that MS knows that things are not as simple so they did not want to "upset" some companies while they can afford to screw over the open source people.

Btw I would have been behind MS if they have done one of this 2

1 use all code they have access , including MS code and including private code in GitHub because that would show they actually belive that the AI works as advertised

2 make the model open , let people use it locally, improve it, test it for copyright issues, do whatever they want

j / k navigate · click thread line to collapse

0 comments

david_allison3y ago

If I saw 100 LOC which was very similar to something which I wrote, AND contained a log statement copied verbatim, it's very easy to imply that the entire piece of code is a derivative work.

Let's say I write FizzBuzz:

    // Copyright (c) 2022 David Allison. All rights reserved.

    for num in range(100):
        if num % 3 == 0 and num % 5 == 0:
            print("DA: fizzbuzz")
        elif num % 3 == 0:
            print("DA: fizz")
        elif num % 5 == 0:
            print("DA: buzz")
        else:
            print(num)

Aside: Due to some uniquely named code I've contributed to, I strongly suspect that Copilot would output my GitHub username. I don't really want to open Pandora's box here, but I'd be curious.

visarga3y ago

david_allison3y ago

I have ~400KLOC changed on GitHub. 1% of the time happens multiple times a day given scale.

Some months, I earn over 100x less from OSS than I would in industry. I don't want people taking advantage any more than I'm comfortable with, especially for commercial purposes.

1 more reply

ranguna3y ago

Copilot used (for training) copyrighted code without respecting the license and can generate pieces of copyrighted code verbatim without respecting the original license as well.

I pay for copilot and this is very much the truth, but let's see what the court rules out.

simion3143y ago

Btw I would have been behind MS if they have done one of this 2

1 use all code they have access , including MS code and including private code in GitHub because that would show they actually belive that the AI works as advertised

2 make the model open , let people use it locally, improve it, test it for copyright issues, do whatever they want

j / k navigate · click thread line to collapse