An open source lawyer’s view on the copilot class action lawsuit (opens in new tab)

A hypothetical question: imagine a filmmaker, who had studied a lot of obviously copyrighted movies by famous renowned directors. This means he has trained his neural network using their copyrighted licensed content. Does he breach copyright when he composes and films a scene? Are visual quotes copyright theft? Homages? Did George Lucas infringe copyright when he was borrowing compositions from "Triumph of the will"?

ssivark3y ago

Just because machine learning uses the word “learning” doesn’t mean it “learns” in the same way a human mind does — that analogy is doing a lot of load bearing in your argument, and needs proving why the program’s nature of creative remixing (for lack of a better word) is the same as a human’s. Right now it seems like you’re just reusing the same word for two phenomena we don’t understand, and therefore claiming they’re equivalent.

See Marvin Minsky’s comment regarding “suitcase words”.

albertzeyer3y ago

But effectively learning here really means the same thing: Based on the input (source code), you will adapt the synaptic weights between neurons, in a similar way for humans and for the artificial neural networks. Of course, it's not exactly the same. There are some differences in the details, and the artificial neural network is really much more simplified, and thus also less efficient at learning. But why is this relevant for the copyright question?

> and needs proving why the program’s nature of creative remixing (for lack of a better word

If I ask Stable Diffusion to create a picture of Elon Musk wielding lightnings and riding a giant blue sparrow over a desert during a storm, the result would be more creative than what could be produced by most humans. I believe that counts as a proof.

dathinab3y ago

If him "composing a scene" means copy pasting clips of the movies he studied and smooth things over, then yes that would be obvious infringement.

And that is what copoilots AI mostly does.

It doesn't "understand the concepts and reproduce something alike" in the sense a human does. It might understand some concepts here and there but it also does a lot of heavy lifting my verbatim "remembering" (i.e. copy pasting) code.

This is also why some people argue that the cases for copilot and some of the image generation networks are different as some of the image generation networks get much closer to "understanding and reproducing a style". (Through potentially just by it being much easier to blend over copy-pasted snippets in images to a point its unrecognizable.)

One of the main problems GitHub has IMHO is that anyone who has studied such generative methods knows that:

1) they are prone to copy-pasting

2) you don't know what they remembered (i.e. stored copies of in a obscure human unreadable encoding, i.e. just distributing such a network can be a copyright infrigement)

3) you don't know when they copy past

4) the copy pasted code often is a bit obscured, ironically (and coincidentally) often comparable with how someone who knowingly commits copyright theft would obscure the code to avoid automated detection

Which means GitHub knowingly accepted and continued with tricking its copilote users into committing copyright infringement under the assumption that such infringement is most times obscured enough to evade automatic detection....

jackdaniel3y ago

I see this argument over and over again, and it is so flawed that it is hard to bear.

There is no equal sign between a person and a program.

There is also that thing called "scale" that is critical to the interpretation of the action.

Is eating meat fine? - maybe. Is eating all animals OK? - Hmm...

> Is eating meat fine? - maybe. Is eating all animals OK? - Hmm...

This argument is hardly less flawed than the one you are criticizing. And you statement that 'there is no equal sign ...' is also unconvincing, as we're not equating these two, but the process of learning, which is quite similar.

albertzeyer3y ago

The question is not whether a person is equal to a program.

The question is whether a person is doing the same as Copilot for this particular case, i.e. reading source code to learn.

You have not really given any argument why this is not the case. Or maybe your reference to scale? So only because Copilot has read more code than a human possibly could, that makes it different? But why exactly is reading a bit of code fine w.r.t. copyright, but reading more code suddenly violates copyright?

Note that the reason why Copilot needs more code to learn is just because the learning currently is not as efficient as for humans.

uklgrant3y ago

Humans are not neural networks, that's just a thesis.

Even novelists do not sit all day long in a closed room reading other people's work and then do a collage of what they've read. Otherwise no books would have been written in the first place.

Cut the AI off humans' work, let it interact with the real world and see what it produces. It will be nothing.

Once (if ever?) an AI is capable of producing an actual original work, I'm fine with other AIs stealing from the first one. Please leave humans alone.

carpenecopinum3y ago

> Cut the AI off humans' work, let it interact with the real world and see what it produces. It will be nothing.

That "experiment" could just as well be done on humans, though, cut them off of any work that any human has done before and you may get simple cave paintings, if you're lucky.

nomilk3y ago

> Humans are not neural networks

That's correct, but it misses the point.

This is about reconciling 1. being allowed but 2. not being allowed:

1. The human uses a machine, where the machine is an organic one it grew itself.

2. The human uses a machine, where the machine is one it made or acquired.

To a lot of us, there's no difference.

polaris643y ago

A difference is that I can't just spin up a copy of George Lucas on my GPU in seconds and request it to produce something from a prompt like "a disappointing prequel".

> A difference is that I can't just spin up a copy of George Lucas

... yet.

orangesite3y ago

Your magic box is not a film maker and the inputs you are encoding with it are verbatim file content. Said content belonging to someone else.

Please study the series of events that unfolded in the music industry after folk begun incorporating recordings made by other artists in their own work and proceeded to sell the result.

Spoiler: The deeply nuanced question of feeding a mechanical recording through a series of complex physical and mathematical apparatus and whether that constituted a transformational creative act did not come up during the proceedings or final judgements!

nomilk3y ago

> Said content belonging to someone else.

Is CoPilot just trained on OSS, or on private repos too?

badcppdev3y ago

I like the scenario: Imagine I've hired an assistant with an eidetic memory who has read loads of books. I pay them to help me write a book and they reproduce a few paragraphs from a different book into my book.

Am I violating copyright? Yes

Imagine they change the character names in those paragraphs. Am I still violating copyright? Yes

At some point you can change enough of the text to not violate copyright. The grey area involves the courts.

It feels very simple to me so I might be missing something.

q-big3y ago

> At some point you can change enough of the text to not violate copyright. The grey area involves the courts.

> It feels very simple to me so I might be missing something.

In my opinion, you are missing something subtle:

In continental Europe, there is a different law tradition - civil law (https://en.wikipedia.org/wiki/Civil_law_(legal_system) ) - that is different from the Anglo-American common law tradition. To quote from the wikipedia article:

"The civil law system is often contrasted with the common law system, which originated in medieval England, whose intellectual framework historically came from uncodified judge-made case law, and gives precedential authority to prior court decisions. [...] Conceptually, civil law proceeds from abstractions, formulates general principles, and distinguishes substantive rules from procedural rules. It holds case law secondary and subordinate to statutory law."

So if you are attached to the civil law system, you seriously want to avoid this grey area involving the courts (which is much more accepted in common law) and instead want to codify into laws what you mean by this grey area.

jillesvangurp3y ago

The beauty of the law is that it does not take such philosophical things into consideration. The only thing that matters is the text of the law and it's documented interpretation in various court cases. That's why copyright is excluded from this court case because there are a lot of documented interpretations of fair use. Which also apply here.

The simple layman's version of copyright is that copyright applies to a specific form of a thing and not about the ideas behind that thing.

So, no, George Lucas was not infringing anything. Nor is hip hop music making use of samples infringing anything. Or Andy Warhol integrating photos into his works. Nor is it illegal to paraphrase or refer other authors. And as Oracle found out by challenging it in court, trying to claim ownership over APIs to prevent third party implementations is also not going to work.

All of that falls under fair use. Fair use is what makes copyright useful. Without it you'd have to live in fear that legal copyright holders might come after you if you apply the ideas that you might have been exposed to via their copyrighted work. Fair use exists such that you can make use of information provided to you via a copyrighted work.

BeefWellington3y ago

All those examples you give are transformative in some way or other.

It's an interesting test of open source licensing because I'm not aware of any other area of copyright where works come with an explicit "if you use this somewhere else you must credit me as the initial author" in the implied/provided license.

Comparing music, literature, etc. to code is difficult because of both this difference and the existence of software patents. The manner in which infringement happens (and the scale) is often different as well.

sensanaty3y ago

Philosophical bullshitting aside (and it really is philosophical bullshitting), I just genuinely don't care if a human or a machine "think" or "learn" in the same way.

I don't want Github or any other megacorp-backed entity abusing the open source community in the way micro$oft is here, it's as simple as that. If they wish to train it on entirely proprietary Microsoft code, then by all means go nuts, but to take the work of open source projects and to hide behind the pretense of the mathematical model behind the A"I" learning something is simply ridiculous to me.

I find it quite curious that they're not doing that (training it on their own codebase). Perhaps they're afraid of their little intelligence spitting out proprietary code verbatim like it's been shown to do many times with licensed open source code.

jules3y ago

I bet he would if his movie scene is pixel for pixel identical to the scene he watched.

6stringmerc3y ago

No.

Next hypothetical.

tsukikage3y ago

The production of anime music videos is a fan activity where tiny clips from animated shows are pasted together, with a piece of music replacing the audio track. The result is typically 3-4 minutes long. The audio may or may not be original; regardless, the video content never is, barring some very very light editing.

These can be quite inventive works; nevertheless, no-one seriously argues that the video content does not breach the original animators' copyright.

The video content of an amv is a much better analogy for what copilot does to third parties' code than anything else I've seen in this post's discussion so far.

steve_gh3y ago

Hmmm. I'm interested in the GitHub ToS, which (if I understand correctly) basically says that GitHub and it's affiliates (MS) can use anything you post on GitHub to improve their service.

What if I build an AGPL licenced service, using GitHub to coordinate development. According to the ToS MS could offer a version my service because I posted the code on GitHub, and they are using it to improve their service to me. According to my AGPL licence, they would need to share their source.

So which takes precedence. The licence or the ToS?

rlpb3y ago

Consider that you can post somebody else's code to GitHub, and that may be licensed AGPL (or anything else). In that case, somebody else is the copyright holder so clearly the ToS doesn't magically give GitHub any additional rights and the licence applies.

The most they could do is transfer any liability back to you for posting it in breach of some term in their ToS. But that would be absurd since posting someone else's code, licensed under a common (eg. OSI-approved) license, is an established and normal use case for GitHub. If their ToS really did ban the posting of some AGPL code, they really ought to have pointed it out, and of course it'd render GitHub useless for hosting AGPL code.

This would only apply when posting someone else's code. But of course you could always arrange that.

robinsonb53y ago

The ToS do give GitHub an indemnity against the consequences of that scenario - so if the actual copyright holder complains about copilot spitting out their code without proper attribution and license, they could indeed transfer to the liability to the uploader. (That scenario could apply to GPL and MIT code, too, not just AGPL.)

lindenksv13y ago

OP here. If you own the copyright to a work, you can license it in any way you like. You can offer it to some people under a commercial license and to other people under an open source license. Many entities practices dual (or tri or whatever) licensing. When you post things on GitHub, you are essentially dual licensing your work. You're providing it under a very broad license to GitHub and you are providing it under an OSS license (or whatever you like) to other GitHub users. Neither license takes precedence. One license applies to one group of people and the other license applies to the other group of people.

This is very similar to what happens when you sign a contributor agreement before contributing code to an open source project. When you sign the contributor agreement, you're granting a very broad license to your work to the project maintainers. They can then license your work out under any license they want. But likewise, because you are not granting them an exclusive license, you're free to put your contribution license out into the world under any license of your choosing separate and apart from the project that you contributed it to.

Technically, I think the scenario you're describing with AGPL code may well be possible and legal. But practically, I think people would stop using GitHub if they felt that doing so would lead to GitHub/Microsoft undercutting their projects, stealing their customers, or essentially stripping the project of any AGPL obligations. I think that from a business perspective, they're really gambling on the idea that developers will see Copilot as a big boon rather than a value suck. Time will tell whether their gamble has paid off.

david_allison3y ago

As a follow-on, what if you're mirroring code which is under an AGPL license? Are you allowed to post it on GitHub if you can't grant those rights under the ToS due to the license of the code?

VBprogrammer3y ago

An interesting though experiment is how keen Microsoft would be to allow Copilot to be trained on the Office or Windows source code. If the output is truly free of copyright from its training materials then if not, why not?

IshKebab3y ago

The output isn't guaranteed to be free of copyright from its training materials. It just usually is. There have been clear demonstrations of it regurgitating code from the training set verbatim, which would of course still be covered by the original license.

Microsoft isn't going to train Copilot on Windows code for the same reason it didn't train it on private repos: the code is private and they don't want to risk leaking private code.

I imagine there would be no problem training it on e.g. the Unreal Engine code which is not open source but is available to read.

The big practical issue is that there's no warning when Copilot produces code that might violate copyright, so you're taking a bit of a risk if you use it to generate huge chunks of code as-is. I imagine they are working on a solution to that though. It's not difficult conceptually.

jansommer3y ago

It's likely already trained on Windows source code unless they have specifically excluded the repositories of the leaked Win 2000 code. Perhaps someone who's never going to contribute to ReactOS and Wine can verify?

insanitybit3y ago

Why would they do that, regardless of whether the output could be restricted via Copyright or not? Also, this case isn't about copyright, as the lawyer clearly explains.

amarant3y ago

Probably the ToS. You've granted GitHub specifically license to use your code under the terms of the ToS, they effectively have 2 licenses. They can therefore choose under which licence they want to use your code, and will choose the most permissive one, or the one they have the best understanding of: in this case the ToS.

Other parties are not granted license under the ToS, and so will have to abide by the AGPL.

World1773y ago

License takes precedence when you don’t own the copyright. ToS takes precedence when legally allowed and you do own the copyright.

NicoJuicy3y ago

Their service is hosting code, not writing code.

That's why it's GitHub, not CodeScribe ( or something)

Xylakant3y ago

It’s definitely more than just hosting code - GitHub offers issue/PR management, light weight project management, an online IDE for collaborative editing and CI services at least. Arguing that GitHub provides services that aim to improve developer/development team productivity is not a stretch. And arguing that ML-assisted development support is part of that definition isn’t particularly far out either.

visarga3y ago

I think copyright itself might be on its way out. What meaning does a copyright have when I can click "Variations" on anything and get 4 suggestions in 10 seconds? Imagine how good they will be by 2030.

hooby3y ago

Over many years it has now mostly become a tool for large companies to accumulate rights (on works they didn't create themselves) and monetize them.

Maybe a reform is needed, to find a way back to the original purpose.

dragonwriter3y ago

> Copyright was originally intended to protect the creators of a work.

No, it wasn’t. Copyright was originally intended to protect the publishers of a work. It was later transformed to nominally focus on the creators, but even this was lobbied for by publishers in their own self-interest after the old law directly protecting them was allowed to lapse, and because it still had the same net effect since realizing value meant licensing to a publisher in most practical cases, so the publishers were still major beneficiaries.

And, of course, US copyrights under the Constitution do not exist for the purpose of protecting creators, instead a private benefit for creators is a mechanism but the purpose is expressly to “promote the progress of science of useful arts”.

izacus3y ago

There has never been more support for tightening and enforcing copyright than there is today. This is very unlikely to change due to megacorps like Microsoft, Disney, Apple et.al. having a massive vested interest to use it to extract maximum profits.

classified3y ago

Copyright protection for the rich and powerful, while those who cannot afford armies of lawyers get their stuff stolen by machine learning models. Sounds credible to me.

https://www.everythingisaremix.info/

LesZedCB3y ago

there's a great youtube doc about everything being remixing which i highly recommend

esalman3y ago

Have you tried that on any kind of music?

throwaway2903y ago

Copyright becomes especially important and valuable in these circumstances. Remember, original works is how your variation suggestion engine is trained. With remaining incentives taken away there is no more new stuff to train on, networks get trained on own output, the snake eats own tail.

guhayun3y ago

>networks get trained on own output

And sometimes the network improves,depending on the quality or direction of the output,the client can a valuable critic even without being an expert in the field

mjw10073y ago

I think this is the most interesting part:

> [Github's Terms of Service] specifically identifies “GitHub” to include all of its affiliates (like Microsoft) and users of GitHub grant GitHub the right to use their content to perform and improve the “Service.” Diligent product counsel will not be surprised to learn that “Service” is defined as any services provided by “GitHub,” i.e. including all of GitHub’s affiliates.

tryre3y ago

No, the misinterpretation of the ToS is not the most interesting part. The part that clearly shows her colors is:

"It looks a lot more like trolling if an otherwise incredibly useful and productivity-boosting technology is being stymied by people who want to receive payouts for a lack of meaningless attributions."

1MachineElf3y ago

Ah, so she is an "open source lawyer" in an OSI Foundation sense...

LesZedCB3y ago

out of curiosity, would anybody else cease to have an issue copilot if it was an open source model?

i'm not paying for copilot right now because i'm waiting for this to shake out. but i'd be happy to pay (even their current asking price) if i knew the model was also open source and could be self hosted.

maybe this is the wrong way to ask the question, but hopefully it makes sense

david_allison3y ago

It's not the license of the model, it's the license of the output.

As it stands, Copilot is a black-box which strips copyright from a piece of code.

I'd be fine if it were a level playing field and GitHub also trained it on private repositories - that's a signal that they don't care about copyright at all.

I'd be fine as a developer who releases GPL'ed code if the output was licensed as GPL - obviously no license violation.

I'd be fine (within a reasonable scale) if a developer contacted me and asked to use my code under MIT.

I'm not fine that Copilot allows people to take my code, 'change the variable names' and remove the license. Especially because I have no visibility of the fact that this has occurred.

rwmj3y ago

You could also imagine different Copilot models, eg Copilot-GPL, Copilot-MIT etc. Each would be trained only on GPL or MIT code from github. Then which model gets used depends on the license of the file being written at the time.

az2263y ago

But Copilot doesn't take your code at best it has learned from a fraction of a fraction of your code and synthesized it with tens or thousands of like examples and the output may look similar to your code because it's trying to achieve the same thing. It's not like Copilot takes your entire repo and clones it and says "we washed the onerous license requirements away for ya".

throwaway2903y ago

If it was a true OSS project, first it would not clearly benefit a single near-monopoly by using my code (as in, that wouldn't be its purpose), and second I'm sure its contributors would be well placed to understand the issue and from the start bake in a reliable, transparent mechanism for opting out.

As is, it's EEE applied to open source-- Microsoft's ultimate play against the ethic that brought us Linux among other things. When your brainchild gets gobbled up faster than you can blink, pushed to people who never learn about your existence, and a megacorp that you are ethically opposed to profits from the process, the need for self-actualization is no longer addressed. The fundamental incentive that pushes us to publish in the open, to have other humans acknowledge you and your work and feel pride in it, is being eliminated.

NoboruWataya3y ago

I agree - it's problematic enough that licensing information gets lost in the Copilot process, but as is we basically have developers contributing their time and expertise, for free, to the development of Microsoft's new paid proprietary product. Worse still, if Copilot is as revolutionary as some people make it out to be, those same developers are inadvertently helping Microsoft build a monopoly in a new market, with all the disastrous consequences that entails.

runnerup3y ago

If it was GPL it could use GPL code and legally there would be no debate.

comice3y ago

One of the requirements of the GPL is that credit is given (and indeed this is needed for enforcement to work because the GPL leverages copyright).

Jweb_Guru3y ago

The project could, yes. It wouldn't necessarily change the legality of using it in non-GPL projects, though. If people were only using it in license-compatible projects and it was license-compatible with GPL, I doubt anyone would have any complaints (even though in theory it could also be picking up stuff from other incompatible licenses).

synapse263y ago

Yes, I’d be one too. I have no legal opinions about this, but morally, Copilot just doesn’t hit me right. One of the purpose open source exist is for it to be, well, open. It’s so annoying seeing this tool Specifically use only open source code and then have the audacity to close source + paywall access to it.

I used to be a little more agreeable with Copilot with training money and all, but seeing Stable Diffusion is willing to open up hundreds of thousands in training, and more in engineering, and therefore create an active community dedicated to improving it everyday, I just can’t help but be so annoyed when one of the world’s biggest tech companies pulls such petty move.

Has anyone produced a legally watertight license or clause for other licenses that prevents code being used for training of copilot-like services?

insanitybit3y ago

The article addresses this in a number of ways.

For example,

> That rings a bit like the Facebook memes of yesteryear promising users that if they just copy and paste these magical sentences onto their timelines, then Facebook won’t be able to do something or other with their data or accounts.

I'm not sure I understand your point.

The only legal way you can use copyrighted code is due to the license attached to it by the copyright holder.

If a license specifically prohibits copying the code for a purpose, then it is a violation of the copyright to copy the code for that purpose. You have no other legal way to do it.

These aren't magic words, they are legal obligations. Ok, well maybe legal obligations are magic words. But it is magic that works :). Otherwise things like GPL could not function.

rwmj3y ago

It would be a Field of Endeavor restriction so the resulting license wouldn't be open source, and I don't think (?) Copilot is trained on proprietary code.

(Section 6 here: https://opensource.org/osd)

I don't really care if a license meets some arbitrary definition.

Let's say I added a clause to my BSD license that prohibits the copying of this code to train ML models.

Would that not immediately make GitHub in violation of this license?

Or do they only train it where the license is explicitly one of the ones it knows about?

https://medium.com/@6StringMerc/artificial-intelligence-mach...

6stringmerc3y ago

I have a companion piece talking about music and training AI/ML:

terminal_d3y ago

If this isn't enough incentive to move away from github, then I don't know what is.

hnbad3y ago

> It looks a lot more like trolling if an otherwise incredibly useful and productivity-boosting technology is being stymied by people who want to receive payouts for a lack of meaningless attributions.

This one sentence threw off my entire opinion of the article as it demonstrates the author's clear bias in favor of Copilot, not just specifically in this case but in principle.

Legal opinion on Copilot and generative AI in general hinges entirely on metaphors. If the AI is understood to behave like a human being building knowledge and drawing from it for inspiration, Copilot is just another way to write code. But we've already established legal precedent that machines can not hold copyright, which suggests that they can not be deemed to be creative, which could be used to argue that they are therefore just creating an inventory of copyright works and creating mechanical mashups.

The author's dismissal also ignores that this would not JUST result in attribution. If Copilot indexed copyleft code and were required to provide attribution when using this code, the output might also be affected and this could in turn affect the entire code base. Worse yet, Copilot may output code with conflicting licenses. The author considers only the possibility that Copilot itself might have to inherit the license (and the dismissal that it would "help noone" because it runs on a server ignores both the existence of a (presumably self-hosted) enterprise service and the existence of licenses like AGPL, which would still apply) but it seems most people's concerns are with the output instead.

I also fail to understand how the argument that it doesn't reproduce the code exactly 99% of the time is helpful. If I copy code and rename the variables and run an autoformatter on it, it's still a copy of the code. It's odd to see a lawyer use what is essentially obfuscation as a defense against copyright claims. Also 1% is an incredibly large number given how Copilot is supposed to be used and how large the potential customer base is. Given the direction GitHub is heading with "Hello GitHub" (demoed at GitHub Universe yesterday) it's not unlikely that Copilot would in some cases be used to generate hundreds, thousands or tens of thousands of lines of code in a single project.

The question isn't just whether Copilot is violating the law or not, the question is why it is or isn't because that could have wide implications outside GitHub itself. But as the author points out, sadly the lawsuit doesn't try to settle this for copyright, which might be the most impactful question.

iLoveOncall3y ago

This lawsuit is open-source developers destroying open-source.

Havoc3y ago

What’s the point of licenses if TOS overrides it?

junon3y ago

Github's TOS doesn't infringe on any licenses.

https://docs.github.com/en/site-policy/github-terms/github-t...

I'm actually surprised they allowed Copilot to happen, given this section:

> This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program.

One could make the argument they had no intrinsic right to use the software for Copilot except under the terms laid out under the respective softwares' licenses. This means any GPL code they copied by error is now in violation of the GPL by default. But IANAL.

World1773y ago

In my memory, when GitHub released it, they were explicit that using data like this “is common practice in machine learning.” Though, I tried to find the quote and couldn’t, so maybe my memory is wrong and I am remembering a blog post from another organization.

edit: The exact quote was “Training machine learning models on publicly available data is considered fair use across the machine learning community” if you want to search for it.

edit 2: https://web.archive.org/web/20210629142841/http://copilot.gi...

> Frequently Asked Questions -> Training Set -> Why was GitHub Copilot trained on data from publicly available sources?

> Training machine learning models on publicly available data is now common practice across the machine learning community. The models gain insight and accuracy from the public collective intelligence. But this is a new space, and we are keen to engage in a discussion with developers on these topics and lead the industry in setting appropriate standards for training AI models.

puffoflogic3y ago

Nothing is being overridden.

You have apparently misunderstood copyright licenses as being something that attaches to a copyrighted work and now must be respected by all users of that work. But that is totally incorrect.

Licenses are individual agreements between copyright holders (or licensees who have been granted the right to re-license) and people who want to exercise one of the rights normally withheld under copyright. A LICENSE file is nothing but an offer to grant a license with specified terms to anyone who might want to use the work, without having to nag the licensor to sign an agreement. The existence of that offer doesn't have anything to do with any other agreement the licensor and a (potential) licensee might make.

In the GitHub case, GitHub has negotiated a different license with the uploader. (That negotiation happened to take the form of a ToS, which is another kind of binding offer.) The LICENSE file has nothing to do with it. It hasn't been overridden, it's just irrelevant. It doesn't add or subtract any terms from the separate and distinct license GitHub negotiated.

amarant3y ago

The ToS only applies to GitHub(which includes Microsoft, apparently)

Other parties will still have to abide by your license.

baby3y ago

This is why we can’t have nice things. Copilot is the future

nomilk3y ago

If organic neural networks are allowed to read and learn from open source code, why should an artificial one be any different?

geysersam3y ago

1. Humans are not neural networks. 2. Humans are not allowed to directly copy even rather short snippets of licenced code. 3. Humans do not have the capacity to memorize the entirity GitHub.

fhd23y ago

I can't shake the feeling that a lot of the logic around ML models having more or less the same "rights" as humans comes from misleading marketing that they, in any shape or form, resemble human intelligence. AI is a buzzword applied to any kind of algorithm for an activity that people previously thought couldn't be automated.

Back when I was young, graph pathfinding algorithms where called AI. A few decades later they are a well understood commodity and I haven't seen anyone call them AI for a while. Maybe that'll happen to LLMs too, given a few years?

https://news.ycombinator.com/newsguidelines.html

throwaway2903y ago

For one, an organic network (for the sake of the argument I'll play along if you want to reduce a human to this) has rights, freedoms and ethical values and is not controlled by a single entity and has not specifically been instantiated to generate profit for such.

insanitybit3y ago

HN is so insanely frustrating, so many comments demonstrate that the user didn't read this article at all. Just immediately jumping into a "but what about this argument that I made?".

robocat3y ago

  Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that."

insanitybit3y ago

Yeah, I'm aware, this is just so extreme at this point it feels worth pointing out.

j / k navigate · click thread line to collapse

175 comments

belorn3y ago

I wonder if a court would think that microsoft in this case has done their due diligent to verify that the license grant that they got from users are correct and in order.

hyperman13y ago

I also wondered about this when I read the TOS.

As the Service now includes copilot, publishing anything on Github seems to give them the right to use it in copilot. Maybe even for private repos

Besides of the issue we're currently discussing, I wonder also about:

This might also mean you violate the GPL when publishing someone else's GPLed code on github, as you now granted Microsoft and others rights not included in the GPL.

Clearly, IANAL, don't know how valid this reading is, but publishing anything you didn't wrote yourself might not be on a very stable legal basis.

https://docs.github.com/en/site-policy/github-terms/github-t...

belorn3y ago

> Clearly, IANAL, don't know how valid this reading is, but publishing anything you didn't wrote yourself might not be on a very stable legal basis.

It is after all the distributor that has to do the due diligence to confirm that they are in the right to distribute.

dathinab3y ago

It also falls under the aspect of "hidden surprises" which could mean that this part of the TOS wrt. this specific aspect might not be legally binding/valid. At least in the EU. Or it might.

TazeTSchnitzel3y ago

That doesn't sound right. Licences can allow sublicensing, and I think all the popular open-source ones do.

belorn3y ago

lindenksv13y ago

ghoward3y ago

I am an Open Source developer. My code is not on GitHub and never will be.

If my code was uploaded on GitHub, I would DMCA it because of Copilot, but it wouldn't matter because the information is already in the model. So the DMCA does not help here.

IANAL.

It would take a mere twenty weeks (less than six months) to reach a million violations.

That seems impactful.

ssivark3y ago

See Marvin Minsky’s comment regarding “suitcase words”.

albertzeyer3y ago

> and needs proving why the program’s nature of creative remixing (for lack of a better word

dathinab3y ago

If him "composing a scene" means copy pasting clips of the movies he studied and smooth things over, then yes that would be obvious infringement.

And that is what copoilots AI mostly does.

One of the main problems GitHub has IMHO is that anyone who has studied such generative methods knows that:

1) they are prone to copy-pasting

2) you don't know what they remembered (i.e. stored copies of in a obscure human unreadable encoding, i.e. just distributing such a network can be a copyright infrigement)

3) you don't know when they copy past

jackdaniel3y ago

I see this argument over and over again, and it is so flawed that it is hard to bear.

There is no equal sign between a person and a program.

There is also that thing called "scale" that is critical to the interpretation of the action.

Is eating meat fine? - maybe. Is eating all animals OK? - Hmm...

> Is eating meat fine? - maybe. Is eating all animals OK? - Hmm...

albertzeyer3y ago

The question is not whether a person is equal to a program.

The question is whether a person is doing the same as Copilot for this particular case, i.e. reading source code to learn.

Note that the reason why Copilot needs more code to learn is just because the learning currently is not as efficient as for humans.

uklgrant3y ago

Humans are not neural networks, that's just a thesis.

Even novelists do not sit all day long in a closed room reading other people's work and then do a collage of what they've read. Otherwise no books would have been written in the first place.

Cut the AI off humans' work, let it interact with the real world and see what it produces. It will be nothing.

Once (if ever?) an AI is capable of producing an actual original work, I'm fine with other AIs stealing from the first one. Please leave humans alone.

carpenecopinum3y ago

> Cut the AI off humans' work, let it interact with the real world and see what it produces. It will be nothing.

That "experiment" could just as well be done on humans, though, cut them off of any work that any human has done before and you may get simple cave paintings, if you're lucky.

nomilk3y ago

> Humans are not neural networks

That's correct, but it misses the point.

This is about reconciling 1. being allowed but 2. not being allowed:

1. The human uses a machine, where the machine is an organic one it grew itself.

2. The human uses a machine, where the machine is one it made or acquired.

To a lot of us, there's no difference.

polaris643y ago

A difference is that I can't just spin up a copy of George Lucas on my GPU in seconds and request it to produce something from a prompt like "a disappointing prequel".

> A difference is that I can't just spin up a copy of George Lucas

... yet.

orangesite3y ago

Your magic box is not a film maker and the inputs you are encoding with it are verbatim file content. Said content belonging to someone else.

Please study the series of events that unfolded in the music industry after folk begun incorporating recordings made by other artists in their own work and proceeded to sell the result.

nomilk3y ago

> Said content belonging to someone else.

Is CoPilot just trained on OSS, or on private repos too?

badcppdev3y ago

Am I violating copyright? Yes

Imagine they change the character names in those paragraphs. Am I still violating copyright? Yes

At some point you can change enough of the text to not violate copyright. The grey area involves the courts.

It feels very simple to me so I might be missing something.

q-big3y ago

> At some point you can change enough of the text to not violate copyright. The grey area involves the courts.

> It feels very simple to me so I might be missing something.

In my opinion, you are missing something subtle:

jillesvangurp3y ago

The simple layman's version of copyright is that copyright applies to a specific form of a thing and not about the ideas behind that thing.

BeefWellington3y ago

All those examples you give are transformative in some way or other.

sensanaty3y ago

Philosophical bullshitting aside (and it really is philosophical bullshitting), I just genuinely don't care if a human or a machine "think" or "learn" in the same way.

jules3y ago

I bet he would if his movie scene is pixel for pixel identical to the scene he watched.

6stringmerc3y ago

No.

Next hypothetical.

tsukikage3y ago

These can be quite inventive works; nevertheless, no-one seriously argues that the video content does not breach the original animators' copyright.

The video content of an amv is a much better analogy for what copilot does to third parties' code than anything else I've seen in this post's discussion so far.

steve_gh3y ago

Hmmm. I'm interested in the GitHub ToS, which (if I understand correctly) basically says that GitHub and it's affiliates (MS) can use anything you post on GitHub to improve their service.

So which takes precedence. The licence or the ToS?

rlpb3y ago

This would only apply when posting someone else's code. But of course you could always arrange that.

robinsonb53y ago

lindenksv13y ago

david_allison3y ago

As a follow-on, what if you're mirroring code which is under an AGPL license? Are you allowed to post it on GitHub if you can't grant those rights under the ToS due to the license of the code?

VBprogrammer3y ago

IshKebab3y ago

Microsoft isn't going to train Copilot on Windows code for the same reason it didn't train it on private repos: the code is private and they don't want to risk leaking private code.

I imagine there would be no problem training it on e.g. the Unreal Engine code which is not open source but is available to read.

jansommer3y ago

insanitybit3y ago

Why would they do that, regardless of whether the output could be restricted via Copyright or not? Also, this case isn't about copyright, as the lawyer clearly explains.

amarant3y ago

Other parties are not granted license under the ToS, and so will have to abide by the AGPL.

World1773y ago

License takes precedence when you don’t own the copyright. ToS takes precedence when legally allowed and you do own the copyright.

NicoJuicy3y ago

Their service is hosting code, not writing code.

That's why it's GitHub, not CodeScribe ( or something)

Xylakant3y ago

visarga3y ago

hooby3y ago

Over many years it has now mostly become a tool for large companies to accumulate rights (on works they didn't create themselves) and monetize them.

Maybe a reform is needed, to find a way back to the original purpose.

dragonwriter3y ago

> Copyright was originally intended to protect the creators of a work.

izacus3y ago

classified3y ago

Copyright protection for the rich and powerful, while those who cannot afford armies of lawyers get their stuff stolen by machine learning models. Sounds credible to me.

https://www.everythingisaremix.info/

LesZedCB3y ago

there's a great youtube doc about everything being remixing which i highly recommend

esalman3y ago

Have you tried that on any kind of music?

throwaway2903y ago

guhayun3y ago

>networks get trained on own output

And sometimes the network improves,depending on the quality or direction of the output,the client can a valuable critic even without being an expert in the field

mjw10073y ago

I think this is the most interesting part:

tryre3y ago

No, the misinterpretation of the ToS is not the most interesting part. The part that clearly shows her colors is:

1MachineElf3y ago

Ah, so she is an "open source lawyer" in an OSI Foundation sense...

LesZedCB3y ago

out of curiosity, would anybody else cease to have an issue copilot if it was an open source model?

maybe this is the wrong way to ask the question, but hopefully it makes sense

david_allison3y ago

It's not the license of the model, it's the license of the output.

As it stands, Copilot is a black-box which strips copyright from a piece of code.

I'd be fine if it were a level playing field and GitHub also trained it on private repositories - that's a signal that they don't care about copyright at all.

I'd be fine as a developer who releases GPL'ed code if the output was licensed as GPL - obviously no license violation.

I'd be fine (within a reasonable scale) if a developer contacted me and asked to use my code under MIT.

I'm not fine that Copilot allows people to take my code, 'change the variable names' and remove the license. Especially because I have no visibility of the fact that this has occurred.

rwmj3y ago

az2263y ago

throwaway2903y ago

NoboruWataya3y ago

runnerup3y ago

If it was GPL it could use GPL code and legally there would be no debate.

comice3y ago

One of the requirements of the GPL is that credit is given (and indeed this is needed for enforcement to work because the GPL leverages copyright).

Jweb_Guru3y ago

synapse263y ago

Has anyone produced a legally watertight license or clause for other licenses that prevents code being used for training of copilot-like services?

insanitybit3y ago

The article addresses this in a number of ways.

For example,

I'm not sure I understand your point.

The only legal way you can use copyrighted code is due to the license attached to it by the copyright holder.

If a license specifically prohibits copying the code for a purpose, then it is a violation of the copyright to copy the code for that purpose. You have no other legal way to do it.

These aren't magic words, they are legal obligations. Ok, well maybe legal obligations are magic words. But it is magic that works :). Otherwise things like GPL could not function.

rwmj3y ago

It would be a Field of Endeavor restriction so the resulting license wouldn't be open source, and I don't think (?) Copilot is trained on proprietary code.

(Section 6 here: https://opensource.org/osd)

I don't really care if a license meets some arbitrary definition.

Let's say I added a clause to my BSD license that prohibits the copying of this code to train ML models.

Would that not immediately make GitHub in violation of this license?

Or do they only train it where the license is explicitly one of the ones it knows about?

https://medium.com/@6StringMerc/artificial-intelligence-mach...

6stringmerc3y ago

I have a companion piece talking about music and training AI/ML:

terminal_d3y ago

If this isn't enough incentive to move away from github, then I don't know what is.

hnbad3y ago

This one sentence threw off my entire opinion of the article as it demonstrates the author's clear bias in favor of Copilot, not just specifically in this case but in principle.

iLoveOncall3y ago

This lawsuit is open-source developers destroying open-source.

Havoc3y ago

What’s the point of licenses if TOS overrides it?

junon3y ago

Github's TOS doesn't infringe on any licenses.

https://docs.github.com/en/site-policy/github-terms/github-t...

I'm actually surprised they allowed Copilot to happen, given this section:

World1773y ago

edit: The exact quote was “Training machine learning models on publicly available data is considered fair use across the machine learning community” if you want to search for it.

edit 2: https://web.archive.org/web/20210629142841/http://copilot.gi...

> Frequently Asked Questions -> Training Set -> Why was GitHub Copilot trained on data from publicly available sources?

puffoflogic3y ago

Nothing is being overridden.

You have apparently misunderstood copyright licenses as being something that attaches to a copyrighted work and now must be respected by all users of that work. But that is totally incorrect.

amarant3y ago

The ToS only applies to GitHub(which includes Microsoft, apparently)

Other parties will still have to abide by your license.

baby3y ago

This is why we can’t have nice things. Copilot is the future

nomilk3y ago

If organic neural networks are allowed to read and learn from open source code, why should an artificial one be any different?

geysersam3y ago

1. Humans are not neural networks. 2. Humans are not allowed to directly copy even rather short snippets of licenced code. 3. Humans do not have the capacity to memorize the entirity GitHub.

fhd23y ago