I wonder if a court would think that microsoft in this case has done their due diligent to verify that the license grant that they got from users are correct and in order.
e.g. 4. [..] You grant us [..] the right to [..] parse, and display Your Content [..] as necessary to provide the Service, This license includes [...] show it to [...] other users; parse it into a search index or otherwise analyze it
As the Service now includes copilot, publishing anything on Github seems to give them the right to use it in copilot. Maybe even for private repos
Besides of the issue we're currently discussing, I wonder also about:
5. [..] you grant each User of GitHub a [..] license to use, display, and perform Your Content through the GitHub Service and to reproduce Your Content solely on GitHub as permitted through GitHub's functionality (for example, through forking).
So if you find GPLed content on github, you might be allowed to violate the GPL as long as it happens only on github. I don't know how bad this is in practice. Their CI presumably allows you to run code for other people without granting them the rights the GPL should give them, but that might be a violation of the Github TOS as this might be abuse of the CI servers.
This might also mean you violate the GPL when publishing someone else's GPLed code on github, as you now granted Microsoft and others rights not included in the GPL.
Clearly, IANAL, don't know how valid this reading is, but publishing anything you didn't wrote yourself might not be on a very stable legal basis.
https://docs.github.com/en/site-policy/github-terms/github-t...
Yes. This was one of the legal theories behind why Apple refuse to allow GPL in the Mac App Store. The TOS that apple required from developers givens Apple specific rights which the GPL do not grant, and thus any software that get uploaded must be assumed as providing the software under two separate licenses. Given that many free and open source projects has multiple authors, it is a rather large assumption that the person who uploads the software has the complete authority to provide the software under multiple conflicting licenses.
It is after all the distributor that has to do the due diligence to confirm that they are in the right to distribute.
That doesn't sound right. Licences can allow sublicensing, and I think all the popular open-source ones do.
There is also additional problems specific to sublicenses. In the United States, only exclusive licensees are assumed by statute to have a right to sublicense. The theory is that licensees of exclusive licensees are assumed to have the control/authority similar to that of the author. Nonexclusive licensees are not assumed to be granted such a monopoly by the licensor.
In practice, I think the entire open source world knows that people post each other's open source code on GitHub. Even projects that have very purposefully chosen to primarily use other services or self-host their source code are well aware that their code gets mirrored on GitHub and/or included in other people's repos on GitHub. Up until now, I don't think this has been controversial and I don't think GitHub gets a lot of takedown requests for this practice. I think most developers see this as a feature, not a bug. Copilot might make people rethink whether or not they want to start sending take-down requests but that'll be a tough call for a lot of people because withholding code from GitHub to avoid its usage in Copilot also effectively means making their code less easily available to the rest of the world. It may be very disruptive to other projects that include the copyright owner's code in their own projects.
If my code was uploaded on GitHub, I would DMCA it because of Copilot, but it wouldn't matter because the information is already in the model. So the DMCA does not help here.
The only way it would help is if I could DMCA the entire model and force them to retrain without my code. As it stands, this lawsuit is the only way for GitHub to be reined in; I don't have the resources to do so on my own.
IANAL.
Also, about high impact, suppose Copilot has 1 million users that use it on average 10 times a day, 5 days a week. You claim that less than 1% of uses of Copilot would result in copyright violation. Let's assume 0.1%. How many times would copyright violation happen per day? It would happen 10,000 times per day. For five days a week.
It would take a mere twenty weeks (less than six months) to reach a million violations.
That seems impactful.
See Marvin Minsky’s comment regarding “suitcase words”.
If I ask Stable Diffusion to create a picture of Elon Musk wielding lightnings and riding a giant blue sparrow over a desert during a storm, the result would be more creative than what could be produced by most humans. I believe that counts as a proof.
And that is what copoilots AI mostly does.
It doesn't "understand the concepts and reproduce something alike" in the sense a human does. It might understand some concepts here and there but it also does a lot of heavy lifting my verbatim "remembering" (i.e. copy pasting) code.
This is also why some people argue that the cases for copilot and some of the image generation networks are different as some of the image generation networks get much closer to "understanding and reproducing a style". (Through potentially just by it being much easier to blend over copy-pasted snippets in images to a point its unrecognizable.)
One of the main problems GitHub has IMHO is that anyone who has studied such generative methods knows that:
1) they are prone to copy-pasting
2) you don't know what they remembered (i.e. stored copies of in a obscure human unreadable encoding, i.e. just distributing such a network can be a copyright infrigement)
3) you don't know when they copy past
4) the copy pasted code often is a bit obscured, ironically (and coincidentally) often comparable with how someone who knowingly commits copyright theft would obscure the code to avoid automated detection
Which means GitHub knowingly accepted and continued with tricking its copilote users into committing copyright infringement under the assumption that such infringement is most times obscured enough to evade automatic detection....
There is no equal sign between a person and a program.
There is also that thing called "scale" that is critical to the interpretation of the action.
Is eating meat fine? - maybe. Is eating all animals OK? - Hmm...
This argument is hardly less flawed than the one you are criticizing. And you statement that 'there is no equal sign ...' is also unconvincing, as we're not equating these two, but the process of learning, which is quite similar.
The question is whether a person is doing the same as Copilot for this particular case, i.e. reading source code to learn.
You have not really given any argument why this is not the case. Or maybe your reference to scale? So only because Copilot has read more code than a human possibly could, that makes it different? But why exactly is reading a bit of code fine w.r.t. copyright, but reading more code suddenly violates copyright?
Note that the reason why Copilot needs more code to learn is just because the learning currently is not as efficient as for humans.
Even novelists do not sit all day long in a closed room reading other people's work and then do a collage of what they've read. Otherwise no books would have been written in the first place.
Cut the AI off humans' work, let it interact with the real world and see what it produces. It will be nothing.
Once (if ever?) an AI is capable of producing an actual original work, I'm fine with other AIs stealing from the first one. Please leave humans alone.
That "experiment" could just as well be done on humans, though, cut them off of any work that any human has done before and you may get simple cave paintings, if you're lucky.
That's correct, but it misses the point.
This is about reconciling 1. being allowed but 2. not being allowed:
1. The human uses a machine, where the machine is an organic one it grew itself.
2. The human uses a machine, where the machine is one it made or acquired.
To a lot of us, there's no difference.
... yet.
Please study the series of events that unfolded in the music industry after folk begun incorporating recordings made by other artists in their own work and proceeded to sell the result.
Spoiler: The deeply nuanced question of feeding a mechanical recording through a series of complex physical and mathematical apparatus and whether that constituted a transformational creative act did not come up during the proceedings or final judgements!
Is CoPilot just trained on OSS, or on private repos too?
Am I violating copyright? Yes
Imagine they change the character names in those paragraphs. Am I still violating copyright? Yes
At some point you can change enough of the text to not violate copyright. The grey area involves the courts.
It feels very simple to me so I might be missing something.
> It feels very simple to me so I might be missing something.
In my opinion, you are missing something subtle:
In continental Europe, there is a different law tradition - civil law (https://en.wikipedia.org/wiki/Civil_law_(legal_system) ) - that is different from the Anglo-American common law tradition. To quote from the wikipedia article:
"The civil law system is often contrasted with the common law system, which originated in medieval England, whose intellectual framework historically came from uncodified judge-made case law, and gives precedential authority to prior court decisions. [...] Conceptually, civil law proceeds from abstractions, formulates general principles, and distinguishes substantive rules from procedural rules. It holds case law secondary and subordinate to statutory law."
So if you are attached to the civil law system, you seriously want to avoid this grey area involving the courts (which is much more accepted in common law) and instead want to codify into laws what you mean by this grey area.
The simple layman's version of copyright is that copyright applies to a specific form of a thing and not about the ideas behind that thing.
So, no, George Lucas was not infringing anything. Nor is hip hop music making use of samples infringing anything. Or Andy Warhol integrating photos into his works. Nor is it illegal to paraphrase or refer other authors. And as Oracle found out by challenging it in court, trying to claim ownership over APIs to prevent third party implementations is also not going to work.
All of that falls under fair use. Fair use is what makes copyright useful. Without it you'd have to live in fear that legal copyright holders might come after you if you apply the ideas that you might have been exposed to via their copyrighted work. Fair use exists such that you can make use of information provided to you via a copyrighted work.
It's an interesting test of open source licensing because I'm not aware of any other area of copyright where works come with an explicit "if you use this somewhere else you must credit me as the initial author" in the implied/provided license.
Comparing music, literature, etc. to code is difficult because of both this difference and the existence of software patents. The manner in which infringement happens (and the scale) is often different as well.
I don't want Github or any other megacorp-backed entity abusing the open source community in the way micro$oft is here, it's as simple as that. If they wish to train it on entirely proprietary Microsoft code, then by all means go nuts, but to take the work of open source projects and to hide behind the pretense of the mathematical model behind the A"I" learning something is simply ridiculous to me.
I find it quite curious that they're not doing that (training it on their own codebase). Perhaps they're afraid of their little intelligence spitting out proprietary code verbatim like it's been shown to do many times with licensed open source code.
Next hypothetical.
These can be quite inventive works; nevertheless, no-one seriously argues that the video content does not breach the original animators' copyright.
The video content of an amv is a much better analogy for what copilot does to third parties' code than anything else I've seen in this post's discussion so far.
What if I build an AGPL licenced service, using GitHub to coordinate development. According to the ToS MS could offer a version my service because I posted the code on GitHub, and they are using it to improve their service to me. According to my AGPL licence, they would need to share their source.
So which takes precedence. The licence or the ToS?
The most they could do is transfer any liability back to you for posting it in breach of some term in their ToS. But that would be absurd since posting someone else's code, licensed under a common (eg. OSI-approved) license, is an established and normal use case for GitHub. If their ToS really did ban the posting of some AGPL code, they really ought to have pointed it out, and of course it'd render GitHub useless for hosting AGPL code.
This would only apply when posting someone else's code. But of course you could always arrange that.
This is very similar to what happens when you sign a contributor agreement before contributing code to an open source project. When you sign the contributor agreement, you're granting a very broad license to your work to the project maintainers. They can then license your work out under any license they want. But likewise, because you are not granting them an exclusive license, you're free to put your contribution license out into the world under any license of your choosing separate and apart from the project that you contributed it to.
Technically, I think the scenario you're describing with AGPL code may well be possible and legal. But practically, I think people would stop using GitHub if they felt that doing so would lead to GitHub/Microsoft undercutting their projects, stealing their customers, or essentially stripping the project of any AGPL obligations. I think that from a business perspective, they're really gambling on the idea that developers will see Copilot as a big boon rather than a value suck. Time will tell whether their gamble has paid off.
Microsoft isn't going to train Copilot on Windows code for the same reason it didn't train it on private repos: the code is private and they don't want to risk leaking private code.
I imagine there would be no problem training it on e.g. the Unreal Engine code which is not open source but is available to read.
The big practical issue is that there's no warning when Copilot produces code that might violate copyright, so you're taking a bit of a risk if you use it to generate huge chunks of code as-is. I imagine they are working on a solution to that though. It's not difficult conceptually.
Other parties are not granted license under the ToS, and so will have to abide by the AGPL.
That's why it's GitHub, not CodeScribe ( or something)
Over many years it has now mostly become a tool for large companies to accumulate rights (on works they didn't create themselves) and monetize them.
Maybe a reform is needed, to find a way back to the original purpose.
No, it wasn’t. Copyright was originally intended to protect the publishers of a work. It was later transformed to nominally focus on the creators, but even this was lobbied for by publishers in their own self-interest after the old law directly protecting them was allowed to lapse, and because it still had the same net effect since realizing value meant licensing to a publisher in most practical cases, so the publishers were still major beneficiaries.
And, of course, US copyrights under the Constitution do not exist for the purpose of protecting creators, instead a private benefit for creators is a mechanism but the purpose is expressly to “promote the progress of science of useful arts”.
And sometimes the network improves,depending on the quality or direction of the output,the client can a valuable critic even without being an expert in the field
> [Github's Terms of Service] specifically identifies “GitHub” to include all of its affiliates (like Microsoft) and users of GitHub grant GitHub the right to use their content to perform and improve the “Service.” Diligent product counsel will not be surprised to learn that “Service” is defined as any services provided by “GitHub,” i.e. including all of GitHub’s affiliates.
"It looks a lot more like trolling if an otherwise incredibly useful and productivity-boosting technology is being stymied by people who want to receive payouts for a lack of meaningless attributions."
i'm not paying for copilot right now because i'm waiting for this to shake out. but i'd be happy to pay (even their current asking price) if i knew the model was also open source and could be self hosted.
maybe this is the wrong way to ask the question, but hopefully it makes sense
As it stands, Copilot is a black-box which strips copyright from a piece of code.
I'd be fine if it were a level playing field and GitHub also trained it on private repositories - that's a signal that they don't care about copyright at all.
I'd be fine as a developer who releases GPL'ed code if the output was licensed as GPL - obviously no license violation.
I'd be fine (within a reasonable scale) if a developer contacted me and asked to use my code under MIT.
I'm not fine that Copilot allows people to take my code, 'change the variable names' and remove the license. Especially because I have no visibility of the fact that this has occurred.
As is, it's EEE applied to open source-- Microsoft's ultimate play against the ethic that brought us Linux among other things. When your brainchild gets gobbled up faster than you can blink, pushed to people who never learn about your existence, and a megacorp that you are ethically opposed to profits from the process, the need for self-actualization is no longer addressed. The fundamental incentive that pushes us to publish in the open, to have other humans acknowledge you and your work and feel pride in it, is being eliminated.
I used to be a little more agreeable with Copilot with training money and all, but seeing Stable Diffusion is willing to open up hundreds of thousands in training, and more in engineering, and therefore create an active community dedicated to improving it everyday, I just can’t help but be so annoyed when one of the world’s biggest tech companies pulls such petty move.
For example,
> That rings a bit like the Facebook memes of yesteryear promising users that if they just copy and paste these magical sentences onto their timelines, then Facebook won’t be able to do something or other with their data or accounts.
The only legal way you can use copyrighted code is due to the license attached to it by the copyright holder.
If a license specifically prohibits copying the code for a purpose, then it is a violation of the copyright to copy the code for that purpose. You have no other legal way to do it.
These aren't magic words, they are legal obligations. Ok, well maybe legal obligations are magic words. But it is magic that works :). Otherwise things like GPL could not function.
(Section 6 here: https://opensource.org/osd)
Let's say I added a clause to my BSD license that prohibits the copying of this code to train ML models.
Would that not immediately make GitHub in violation of this license?
Or do they only train it where the license is explicitly one of the ones it knows about?
https://medium.com/@6StringMerc/artificial-intelligence-mach...
This one sentence threw off my entire opinion of the article as it demonstrates the author's clear bias in favor of Copilot, not just specifically in this case but in principle.
Legal opinion on Copilot and generative AI in general hinges entirely on metaphors. If the AI is understood to behave like a human being building knowledge and drawing from it for inspiration, Copilot is just another way to write code. But we've already established legal precedent that machines can not hold copyright, which suggests that they can not be deemed to be creative, which could be used to argue that they are therefore just creating an inventory of copyright works and creating mechanical mashups.
The author's dismissal also ignores that this would not JUST result in attribution. If Copilot indexed copyleft code and were required to provide attribution when using this code, the output might also be affected and this could in turn affect the entire code base. Worse yet, Copilot may output code with conflicting licenses. The author considers only the possibility that Copilot itself might have to inherit the license (and the dismissal that it would "help noone" because it runs on a server ignores both the existence of a (presumably self-hosted) enterprise service and the existence of licenses like AGPL, which would still apply) but it seems most people's concerns are with the output instead.
I also fail to understand how the argument that it doesn't reproduce the code exactly 99% of the time is helpful. If I copy code and rename the variables and run an autoformatter on it, it's still a copy of the code. It's odd to see a lawyer use what is essentially obfuscation as a defense against copyright claims. Also 1% is an incredibly large number given how Copilot is supposed to be used and how large the potential customer base is. Given the direction GitHub is heading with "Hello GitHub" (demoed at GitHub Universe yesterday) it's not unlikely that Copilot would in some cases be used to generate hundreds, thousands or tens of thousands of lines of code in a single project.
The question isn't just whether Copilot is violating the law or not, the question is why it is or isn't because that could have wide implications outside GitHub itself. But as the author points out, sadly the lawsuit doesn't try to settle this for copyright, which might be the most impactful question.
https://docs.github.com/en/site-policy/github-terms/github-t...
I'm actually surprised they allowed Copilot to happen, given this section:
> This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program.
One could make the argument they had no intrinsic right to use the software for Copilot except under the terms laid out under the respective softwares' licenses. This means any GPL code they copied by error is now in violation of the GPL by default. But IANAL.
edit: The exact quote was “Training machine learning models on publicly available data is considered fair use across the machine learning community” if you want to search for it.
edit 2: https://web.archive.org/web/20210629142841/http://copilot.gi...
> Frequently Asked Questions -> Training Set -> Why was GitHub Copilot trained on data from publicly available sources?
> Training machine learning models on publicly available data is now common practice across the machine learning community. The models gain insight and accuracy from the public collective intelligence. But this is a new space, and we are keen to engage in a discussion with developers on these topics and lead the industry in setting appropriate standards for training AI models.
You have apparently misunderstood copyright licenses as being something that attaches to a copyrighted work and now must be respected by all users of that work. But that is totally incorrect.
Licenses are individual agreements between copyright holders (or licensees who have been granted the right to re-license) and people who want to exercise one of the rights normally withheld under copyright. A LICENSE file is nothing but an offer to grant a license with specified terms to anyone who might want to use the work, without having to nag the licensor to sign an agreement. The existence of that offer doesn't have anything to do with any other agreement the licensor and a (potential) licensee might make.
In the GitHub case, GitHub has negotiated a different license with the uploader. (That negotiation happened to take the form of a ToS, which is another kind of binding offer.) The LICENSE file has nothing to do with it. It hasn't been overridden, it's just irrelevant. It doesn't add or subtract any terms from the separate and distinct license GitHub negotiated.
Other parties will still have to abide by your license.
Back when I was young, graph pathfinding algorithms where called AI. A few decades later they are a well understood commodity and I haven't seen anyone call them AI for a while. Maybe that'll happen to LLMs too, given a few years?
Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that."
https://news.ycombinator.com/newsguidelines.html