If I saw 100 LOC which was very similar to something which I wrote, AND contained a log statement copied verbatim, it's very easy to imply that the entire piece of code is a derivative work.
Let's say I write FizzBuzz:
// Copyright (c) 2022 David Allison. All rights reserved.
for num in range(100):
if num % 3 == 0 and num % 5 == 0:
print("DA: fizzbuzz")
elif num % 3 == 0:
print("DA: fizz")
elif num % 5 == 0:
print("DA: buzz")
else:
print(num)
If I found the modified FizzBuzz algorithm in the wild with one line containing the "DA" prefix, it may have been learned from a fraction of a fraction of my code but it still contains my 'unique' creativity, is that a copyright violation?Aside: Due to some uniquely named code I've contributed to, I strongly suspect that Copilot would output my GitHub username. I don't really want to open Pandora's box here, but I'd be curious.
On the practical side, it is actually easy to filter out sequences of words that are too similar to the training set from the output of the model. You just generate another snippet until it is "original" enough.
Pragmatically, people are already knowingly committing commercially viable copyright violations of my work. I'd rather it wasn't encouraged further by a US-based 'big tech', especially if the people using my code aren't aware that they're doing anything questionable.
Some months, I earn over 100x less from OSS than I would in industry. I don't want people taking advantage any more than I'm comfortable with, especially for commercial purposes.
I pay for copilot and this is very much the truth, but let's see what the court rules out.
Btw I would have been behind MS if they have done one of this 2
1 use all code they have access , including MS code and including private code in GitHub because that would show they actually belive that the AI works as advertised
2 make the model open , let people use it locally, improve it, test it for copyright issues, do whatever they want