We've filed a lawsuit against GitHub Copilot

781 comments

Seems important to point out that the announcement on this page (https://githubcopilotlitigation.com/) is a followup to https://githubcopilotinvestigation.com/ previously discussed here: https://news.ycombinator.com/item?id=33240341 (with 1219 comments)

Cort3z3y ago

I’m not a lawyer, but here is why I believe a class action lawsuit is correct;

“AI” is just fancy speak for “complex math program”. If I make a program that’s simply given an arbitrary input then, thought math operations, outputs Microsoft copyright code, am I in the clear just because it’s “AI”? I think they would sue the heck out of me if I did that, and I believe the opposite should be true as well.

I’m sure my own open source code is in that thing. I did not see any attributions, thus they break the fundamentals of open source.

In the spirit of Rick Sanchez; It’s just compression with extra steps.

williamcotton3y ago

I read most of the complaint. The only examples of supposed copyright infringement are isEven and isPrime functions. Here's what Copilot gives me in a Typescript file:

  function isPrime(n: number): boolean {
    for (let i = 2; i < n; i++) {
      if (n % i === 0) {
        return false;
      }
    }
    return n > 1;
  }
  
  function isEven(n: number): boolean {
    return n % 2 === 0;
  }

These are clearly not covered by copyright in the first place. This case is really quite pathetic.

8 more replies

D13Fd3y ago

Correct legally, morally, or both?

Legally a copyright claim seems weak, but they didn't assert one. Some of their claims look stronger than others. The DMCA claim in particular strikes me as strong-ish at first glance, though.

Morally I think this class action is dead wrong. This is how innovation dies. Many of the class members likely do not want to kill Copilot and every future service that operates similarly. Beyond that, the class members aren't likely to get much if any money. The only party here who stands to clearly benefit is the attorneys.

5 more replies

njharman3y ago

Say you read a bunch of code, say over years of developer career. What you write is influenced by all that. Will include similar patterns, similar code and identical snippets, knowingly or not. How large does snippet have to be before it's copyright? "x"? "x==1"? "if x==1\n print('x is one')"? [obviously, replace with actual common code like if not found return 404].

Do you want to be vulnerable to copyright litigation for code you write? Can you afford to respond to every lawsuit filed by disgruntled wingbat, large corp wanting to shut down open source / competing project?

3 more replies

cdrini3y ago

I haven't heard anyone saying that copilot is legal "just because it's AI." That's a pretty bad faith, reductive, and disingenuous representation. The core argument I've seen is that the output is sufficiently transformative and not straight up copying.

1 more reply

benlivengood3y ago

Humans are just compression with extra steps by that logic.

There's a fairly simple technical fix for codex/copilot anyway; stick a search engine on the back end and index the training data and don't output things found in the search engine.

3 more replies

ugh1233y ago

Attributions are fundamental to open source? I thought having source openly available was fundamental to open source (and allowed use without liability/warranty) as per apache, mit, and other licenses.

If they just stick to using permissive-licensed source code then i'm not sure what the actual 'harm' is with co-pilot.

If they auto-generate an acknowledgement file for all source repos used in co-pilot, and then asked clients of co-pilot to ship that file with their product, would that be enough? Call it "The Extended Github Co-Pilot Derivative Use License" or something.

neongreen3y ago

Apparently they are using GPL-licensed code as well, see https://twitter.com/DocSparse/status/1581461734665367554

After five minutes of googling I'm still not sure if using MIT code requires an attribution, but many people claim it does, see https://opensource.stackexchange.com/a/8163 as one example

3 more replies

TAForObvReasons3y ago

Attributions are fundamental to permissive licenses as well. It's worth reading the licenses in question. MIT:

> The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

This is the "attribution" requirement that even a Copilot trained on only-MIT code would miss.

If it were just about sharing code, there are public domain declarations and variants like CC0 licenses

Cort3z3y ago

People would likely not share any code if they could not trust that their work would be respected, and attributed. So yes, I believe it to be fundamental to open source.

1 more reply

heavyset_go3y ago

Attribution and inclusion of copies of licenses are stipulations in almost all of the popular open source licenses, including BSD and MIT licenses.

1 more reply

smoldesu3y ago

> “AI” is just fancy speak for “complex math program”

Not really? It's less about arithmetic and more about inferencing data in higher dimensions than we can understand. Comparing it to traditional computation is a trap, same as treating it like a human mind. They've very different, under the surface.

IMO, if this is a data problem then we should treat it like one. Simple fix - find a legal basis for which licenses are permissive enough to allow for ML training, and train your models on that. The problem here isn't developers crying out in fear of being replaced by robots, it's more that the code that it is reproducing is not licensed for reproduction (and the AI doesn't know that). People who can prove that proprietary code made it into Copilot deserve a settlement. Schlubs like me who upload my dotfiles under BSD don't fall under the same umbrella, at least the way I see it.

5 more replies

galaxyLogic3y ago

Who should be sued? Microsoft who produces an application known as "Copilot" which itself contains nobody else's code but Microsoft's? OR the person who USES Copilot, to produce code which contains somebody else's copyrighted code?

Using Copilot is a bit like using a shotgun, can be very illegal depending on what you shoot at. Creating and distributing the app Copilot is like creating and selling a shotgun.

4 more replies

drvortex3y ago

Your code is not in that thing. That thing has merely read your code and adjusted its own generative code.

It is not directly using your code any more than programmers are using print statements. A book can be copyrighted, the vocabulary of language cannot. A particular program can be copyrighted, but snippets of it cannot, especially when they are used in a different context.

And that is why this lawsuit is dead on arrival.

klabb33y ago

> Your code is not in that thing. That thing has merely read your code and adjusted its own generative code.

This is kinda smug, because it overcomplicates things for no reason, and only serves as a faux technocentric strawman. It just muddies the waters for a sane discussion of the topic, which people can participate in without a CS degree.

The AI models of today are very simple to explain: its a product built from code (already regulated, produced by the implementors) and source data (usually works that are protected by copyright and produced by other people). It would be a different product if it didn't have used the training data.

The fact that some outputs are similar enough to source data is circumstantial, and not important other than for small snippets. The elephant in the room is the act of using source data to produce the product, and whether the right to decide that lies with the (already copyright protected) creator or not. That's not something to dismiss.

2 more replies

xtracto3y ago

Say you publish a song and copyright it. Then I record it and save it in a .xz format. It's not an MP3, it is not an audio file. Say I split it into N several chunks and I share it with N different people. Or with the same people, but I share it at N different dates. Say I charge them $10 a month for doing that, and I don't pay you anything.

Am I violating your copyright? Are you entitled to do that?

To make it funnier: Say instead of the .xz, I "compress" it via π compression [1]. So what I share with you is a pair of π indices and data lengths for each of them, from which you can "reconstruct" the audio. Am I illegally violating your copyrights by sharing that?

[1] https://github.com/philipl/pifs

3 more replies

andrewmcwatters3y ago

This is demonstrably false. It is a system outputting character-for-character repository code.[1]

[1]: https://news.ycombinator.com/item?id=33457517

4 more replies

Cort3z3y ago

Just to be clear; I cannot prove that they have used my code, but for the sake of argument, lets assume so.

They would have directly used my code when they trained the thing. I see it as an equivalent of creating a zip-file. My code is not directly in the zip file either. Only by the act of un-zipping does it come back, which requires a sequence of math-steps.

1 more reply

heavyset_go3y ago

Neutral nets can and do encode and compress the information they're trained on, and can regurgitate it given the right inputs. It is very likely that someone's code is in that neural net, encoded/compressed/however you want to look at it, which Copilot doesn't have a license to distribute.

You can easily see this happen, the regurgitation of training data, in an over fitted neural net.

2 more replies

vkou3y ago

> It is not directly using your code any more than programmers are using print statements. A book can be copyrighted, the vocabulary of language cannot. A particular program can be copyrighted, but snippets of it cannot, especially when they are used in a different context.

So what? Why shouldn't we update the rules of copyright to catch up to advances in technology?

Prior to the invention of the printing press, we didn't have copyright law. Nobody could stop you from taking any book you liked, and paying a scribe to reproduce it, word for word, over and over again. You could then lend, gift, or sell those copies.

The printing press introduced nothing novel to this process! It simply increased the rate at which ink could be put to pages. And yet, in response to its invention, copyright law was created, that banned the most obvious and simple application of this new technology.

I think it's entirely reasonable for copyright law to be updated, to ban the most obvious and simple application of this new technology, both for generating images, and code.

civilized3y ago

> Your code is not in that thing. That thing has merely read your code and adjusted its own generative code.

Completely incorrect. False dichotomy. It's widely known that AI can and does memorize things just like humans do. Memorization isn't a defense to violating copyright, and calling memorization "adjusting a generative model" doesn't make it stop being memorization.

If you memorized Microsoft's code in your brain while working there and exfiltrated it, the fact that it passed through your brain wouldn't be a defense. Substituting "generative model" for "brain" and the fact that it's a tool used by third parties doesn't change this.

moralestapia3y ago

Whatever you say man :^)

https://twitter.com/docsparse/status/1581461734665367554

NicoleJO3y ago

You're wrong. See exposed code. https://justoutsourcing.blogspot.com/2022/03/gpts-plagiarism...

lamontcg3y ago

> but snippets of it cannot

Yeah they can, and the whole functions that Copilot spits out are quite obviously covered by copyright.

> especially when they are used in a different context.

That doesn't matter.

ouid3y ago

it is essentially a weighted sum of your code and other copyright holders code. Do not let the mystique of AI fool you. Copilot does not learn, it glues.

tevon3y ago

I agree.

If I read JRR Tolkien and then go and write a fantasy novel following a unexpected hero on his dangerous quest to undo evil, I haven't infringed, even if I use some of Tolkien's better turns of phrase.

2 more replies

Ptchd3y ago

I agree with you, buy what if you share 1% of a copyright binary file that is completely useless when shared partially, is that infringement?

1 more reply

williamcotton3y ago

Here's what actual lawyers will focus on:

Abstraction-Filtration-Comparison

The AFC test is a three-step process for determining substantial similarity of the non-literal elements of a computer program. The process requires the court to first identify the increasing levels of abstraction of the program. Then, at each level of abstraction, material that is not protectable by copyright is identified and filtered out from further examination. The final step is to compare the defendant's program to the plaintiff's, looking only at the copyright-protected material as identified in the previous two steps, and determine whether the plaintiff's work was copied. In addition, the court will assess the relative significance of any copied material with respect to the entire program.

Abstraction

The purpose of the abstraction step is to identify which aspects of the program constitute its expression and which are the ideas. By what is commonly referred to as the idea/expression dichotomy, copyright law protects an author's expression, but not the idea behind that expression. In a computer program, the lowest level of abstraction, the concrete code of the program, is clearly expression, while the highest level of abstraction, the general function of the program, might be better classified as the idea behind the program. The abstractions test was first developed by the Second Circuit for use in literary works, but in the AFC test, they outline how it might be applied to computer programs. The court identifies possible levels of abstraction that can be defined. In increasing order of abstraction; these are: individual instructions, groups of instructions organized into a "hierarchy of modules", the functions of the lowest-level modules, the functions of the higher-level modules, the "ultimate function" of the code.

Filtration

The second step is to remove from consideration aspects of the program which are not legally protectable by copyright. The analysis is done at each level of abstraction identified in the previous step. The court identifies three factors to consider during this step: elements dictated by efficiency, elements dictated by external factors, and elements taken from the public domain.

The court explains that elements dictated by efficiency are removed from consideration based on the merger doctrine which states that a form of expression that is incidental to the idea cannot be protected by copyright. In computer programs, concerns for efficiency may limit the possible ways to achieve a particular function, making a particular expression necessary to achieving the idea. In this case, the expression is not protected by copyright.

Eliminating elements dictated by external factors is an application of the scènes à faire doctrine to computer programs. The doctrine holds that elements necessary for, or standard to, expression in some particular theme cannot be protected by copyright. Elements dictated by external factors may include hardware specifications, interoperability and compatibility requirements, design standards, demands of the market being served, and standard programming techniques.

Finally, material that exists in the public domain can not be copyrighted and is also removed from the analysis.

Comparison

The final step of the AFC test is to consider the elements of the program identified in the first step and remaining after the second step, and for each of these compare the defendant's work with the plaintiff's to determine if the one is a copy of the other. In addition, the court will look at the importance of the copied portion with respect to the entire program.

https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...

sigzero3y ago

> I’m not a lawyer, but

Should have stopped there.

2 more replies

rowanG0773y ago

The brain is also just a "complex math program". Since math is just the language we use to describe the world. I don't feel this argument has any weight at all.

kadoban3y ago

The legal world tends to be less interested in these kind of logical gotchas than engineering types would like. I don't see a judge caring about that brain framing at all.

Not to mention, if your brain starts outputting Microsoft copyright code, they're going to sue the shit out of you and win, so I'm not sure how that would help even so.

yoyohello133y ago

So if I read the windows explorer source code, then later produced a line for line copy (without referring back to the source). Microsoft couldn't sue me?

Supermancho3y ago

> The brain is also just a "complex math program".

This is not a fact.

1 more reply

fsflover3y ago

It might be. If your brain generated verbatim someone's code without following its license, you would also break copyright, wouldn't you?

bombolo3y ago

> The brain is also just a "complex math program"

Source?

1 more reply

lisper3y ago

Somewhere in the complex math is the origin of whatever it is in intellectual property that we deem worthy of protection. Because we are humans, we take the complex math done by human brains as worthy of protection by fiat. When a painter paints a tree, we assign the property interest in the painting to the human painter, not the tree, notwithstanding that the tree made an essential contribution to the content. The whole point is to protect the interests of humans (to give them an incentive to work). There is no other reason to even entertain the concept of "property".

1 more reply

willhslade3y ago

Also, to pile on, if I use my brain to read GPL source code and then type it in again on my own, I'm pretty sure I'm guilty of violating copyright.

willhslade3y ago

So how and why do we solve the halting problem?

kyruzic3y ago

No it's actually not.

spiralpolitik3y ago

At this point we are back in the territory that the idea and the expression of the idea are inseparable, therefore the conclusion will be that copyright protection does not apply to code.

Personally I think this has the potential to blow up in everyones faces.

1 more reply

bloppe3y ago

The problem with the class action lawsuit against GitHub is this: if you host your code on GitHub, it doesn't matter what license you use. Microsoft can do whatever they want with it. You agreed to this by agreeing to their terms and conditions.

1 more reply

blackbrokkoli3y ago

I am sorry for not bringing any kind of legal perspective here, but:

*Jesus Christ*, I hope I live long enough to see copyright die. Here we are at the cusp of a new paradigm of commanding computers to do stuff for us, right at the beginning of the first AI development which actually impresses me.

And we are fucking bickering about how we were cheated out of $0.00034 because our repo from 2015 might have been used for training.

I am also deeply disappointed in HackerNews; where is that deep hatred of patent trolls and smug satisfaction whenever something gets cracked or pirated now?

commoner3y ago

On piracy, HN users defend Sci-Hub to protest against the academic publishing industry, which involves large corporations such as Elsevier charging publishing and subscription fees that are much more than the value that these corporations bring to the actual research, review, and publication. Academics need to publish in order to survive, and they individually do not have enough power to subvert the existing academic publishing system. Since academics do not receive royalties, Sci-Hub enables academics to pay less into the same system that exploits them for profit. By supporting Sci-Hub, HN users take a populist stance by supporting individuals against the system.

The situation with Microsoft and Copilot is the exact opposite. Here, Microsoft is misusing its acquisition of GitHub to repackage the work of individual free and open source contributors into a proprietary product in violation of the authors' software licenses. These licenses do not even require Microsoft to pay. They only require attribution and redistribution under a compatible license. Supporting Microsoft's misuse of GitHub is an anti-populist stance that puts the interests of the corporation over the interests of the individuals.

1 more reply

Guid_NewGuid3y ago

Yes, the passion to defend copyright trolling for isEven functions (an example included in the filing) from people here is bizarre.

I can't decide if people just hate Microsoft enough that a future where you must pay to include an iseven function in your code is a price worth paying to give them a bloody nose, or there are just a large contingent of users making millions off their GPL code who are put out.

More seriously yes, copilot damages copyright (or is perceived to) and that is a good outcome irrespective of the actor. I will never see eye to eye with people defending the existing legal framework.

1 more reply

joe_the_user3y ago

1) This "copilot is great 'cause copyright is evil" argument breaks when you look at the fact that copilot is copyrighted, closed software tool for producing closed, copyrighted software. If you trained copilot on GPL'd software and specified that copilot's output was also GPL'd maybe you'd have some reasonable claim (but even then, the attribution claim would come in).

2) So far, these tools are "better search" schemes, not actual intelligence. Sure, many find them very useful. But given this, the (voluntary or involuntary) providers of data ought to get credit/benefit for/from this phenomena, along with the tool creators. Especially giving the current situation is Microsoft/OpenAI selling to commercial software developers who sell to general public.

2 more replies

D13Fd3y ago

Wow, thank you. This is exactly right.

The entire response to this suit on this site is mind-blowing to me. Everyone is up in arms that someone trained an AI model that could potentially spit out tiny, twisted fragments of public, open-source code. This response is nothing but selfish behavior that runs counter to the core principles of open source development and the free software movement.

1 more reply

chlorion3y ago

Nobody wants to completely ban AI or training. You can have progress in this area and still respect people's intellectual property. Framing this as destroying progress is silly.

Why can't they train on the code they own, such as Windows sources for example?

Or even better, why can't they release CoPilot itself under an open source license that is compatible with the licenses of code they would like to train on?

Also, I don't think anyone cares about the monetary aspects. The idea behind the GPL style license is to make sure that code remains free, regardless of what or who uses it. Freedom in this context refers to the ability study the code, modify the code, and distribute any modifications. Without the GPL the code can be used in a proprietary product which strips those rights away from users of the product.

Copeleft uses copyright laws to attempt to guarantee freedom for users. This is the inverse of what normal copyright does, which is allow a single entity to sit on the ideas and not allow other's to benefit from them.

If we can just strip copyleft licenses from projects, we are giving up those guarantees that GPL code will remain free for all users.

The GPL is trying to do it's job here, not slow down progress. Progress would be everyone benefiting from the technology behind CoPilot, rather than just MicroSoft sitting on the project and selling it as a service.

RunSet3y ago

> And we are fucking bickering about how we were cheated out of $0.00034 because our repo from 2015 might have been used for training.

I just hope Microsoft AutoPlagiarist is not the Final Solution to Free Software they have been seeking since before the millennium's turn.

Seems to me this discussion is likely to pivot on a fulcrum located between "old enough to remember Microsoft before Bill Gates began spending his ill-gotten gains on philanthropy" and "young enough to see Microsoft primarily as the Xbox people".

MarcellusDrum3y ago

I'm one of those who are against copyrights, but fully support this movement, for 2 reasons:

1) If Github Copilot is a free software liscensed under GPL, I'm all for it. Microsoft is using other's collective hard labour to benefit itself.

2) Its Microsoft. The king of dark patterns, monopolization, and the enemy of software freedom. They can't have their cake and eat it too.

franciscop3y ago

I am only slightly hopeful for a lawsuit so that it loses spectacularly and sets a precedent that this is legal; however I don't trust the legal system enough to think they'll logically reach that conclusion.

My code is 100% in Github Pilot, is there any way to publicly say that I'm against the lawsuit even if they pretend to represent me?

1 more reply

trention3y ago

>where is that deep hatred of patent trolls

Is it patent trolling when you are defending your future labor from being made obsolete by megacorps and signularitarians using your past labor without permission?

No, it's justice.

matjet3y ago

Patent trolls that are hated look like: You develop package to do X from first principles, then get sued because someone patented using a known algorithm for the purpose of X.

Copyright working in a supported/non hated way: You develop a package to do X by cribbing off someone else's package X. They sue you for stealing their work, not to make money off you. Situation at hand is case 2, hence the lack of interest in financial gain.

Why is this case 2, when it does not always reproduce the copyrighted works exactly? Situation: You realise that rather than cribbing off of one persons package X, you can crib off two other package X's and mix/average their contents. Scale this to 100's of packages.

Eventually, ML should avoid this by developing to work from first principles, writing in it's own style, with public code used only for validation of it's ability to understand and write code.

chillfox3y ago

I don't think this is really about the money at all. To me, it looks like this is about consent.

truetraveller3y ago

Nope. It's about a fairness. Until Microsoft/Google/Apple/BigCorp all release their software/designs/maps, count me in favor of copyright for the small guy. And I'm someone who especially hates copyright/patents.

throwaway6753093y ago

Agreed. My only consolation is that as the technology improves and it becomes easier to train these types of models on modest hardware at home, the detractors of this technology have already lost, but rather than a mercy killing they prefer to bleed out slowly.

pelagicAustral3y ago

I stand behind you on this one. I just hope they fail spectacularly at their attempt to hinder innovation.

erezsh3y ago

The only thing worse than not having AI, is having all AI owned by a small group of people.

obiefernandez3y ago

Amen. This lawsuit is sickening me at a “fuck this entire industry” magnitude.

1 more reply

fsflover3y ago

> I hope I live long enough to see copyright die.

I actually agree. However this is not what's happening here.

https://news.ycombinator.com/item?id=33463441

https://news.ycombinator.com/item?id=33463800

az2263y ago

Not just copyright, what Copilot does it fair use. If Google taking tens of thousands of lines directly and gets away with it, it's going to be impossible to see any logic that gets AI being trained on millions of lines of code, not being fair use for any individual open source copyright holder.

an1sotropy3y ago

You haven’t actually thought through what kind of world it would be if there was no copyright law, have you? I don’t know what your political leanings are, but I’ve met some libertarians who are blissfully naive about the extent to which their world and worldview is buttressed by laws and the governments that enforce them, and your comment reminds me of that.

2 more replies

pjfin1233y ago

This is ridiculous; we created A is that can program themselves and people are worried about incidental copyright infringement.

CobrastanJorji3y ago

As a non-lawyer, I am very suspicious of the claim that "Plaintiffs and the Class have suffered monetary damages as a result of Defendants’ conduct." Flagrant disregard for copyright? Sure, maybe. The output of the model is subject to copyright? Who knows! But the copyright holders being damaged in some what? Seems doubtful. The best argument I could think of would be "GitHub would have had to pay us for this, and they didn't pay us, so we lost money," but that'd presumably work out to pennies per person.

belorn3y ago

The common practice in copyright cases is to calculate damages based on the theoretical cost that the infringer would have paid if they have bought the rights in the first place. This method was used during the piratebay case to calculate damages caused by the sites founders.

They did not actually calculate damages in terms of lost movie tickets or estimates vs actually sales number of sold game copies. When it came to pre-releases where such product wouldn't have been sold legally in the first place, they simply added a multiplier to indicate that the copyright owner wouldn't have been willing to sell.

For software code, an other practice I have read is to use the man-hours that rewriting copyrighted code would cost. Using such calculations they would likely estimate the man hours based on number of lines of code and multiply that with the average salary of a programmer.

karaterobot3y ago

The one thing we can say with complete certainty is that most programmers who had their code used without permission will not receive very much money at all if this class action lawsuit is decided in their favor.

3 more replies

whiddershins3y ago

I believe there are statutory damages or penalties in many cases. At least with music and images.

michaelmrose3y ago

So for iseven can we go for how much a student might accept 20 an hour say and multiply that by the one minute required to create it and offer them 33 cents?

pmoriarty3y ago

"Using such calculations they would likely estimate the man hours based on number of lines of code and multiply that with the average salary of a programmer."

The average salary of a programmer in which country?

So much programming is outsourced these days, and in some places programmers are very cheap.

2 more replies

kube-system3y ago

Those damages are enumerated on pages 50-52. Remember, "damages" is being used in a legal sense here -- for a non-lawyer, you can interpret it more like "a dollar value on something someone did that was wrong". This is more broad than the colloquial use of the word.

Sometimes damages are statutory, i.e. they have a fixed dollar amount written right into the law. This lawsuit references one such law: https://www.law.cornell.edu/uscode/text/17/1203

citilife3y ago

Say I produce a licensed library. Someone can pay me $5/year per license. I keep the code private and compile the code before sending it to customers.

If you have co-pilot trained on my code base (which was private), that then reproduces near replica's of my code then they sell it for $5/year...

Well, I'm eligible for damages.

cheriot3y ago

> that then reproduces near replica's of my code

Copying a few lines is not the same as copying the whole thing. Sharing quotes from a book is not copyright infringement.

3 more replies

yawnxyz3y ago

I don't think this is possible for co-pilot to do?

(If it was, please tell me how, since that would save me $5/year across multiple libraries..!)

sigzero3y ago

I don't believe it does anything with private repos and that isn't what is being alleged.

1 more reply

joxel3y ago

But that isn’t what is being alleged

toomuchtodo3y ago

The parallels to music sampling are somewhat humorous. Where is fair use vs misappropriation? To be discovered!

schappim3y ago

Soon we'll have to use Mechanical Turk[0] to identify existing opensource code similar to what Girl Talk did with "Feed the Animals"[1].

Unrelated, how is it that Mechanical Turk was never truely integrated w/ AWS?

[0] https://www.mturk.com/

[1] https://waxy.org/2008/09/girl_turk/

TheCoelacanth3y ago

Aren't there statutory damages for copyright infringement, i.e. there is a presumption that each work infringed is worth at least a certain amount without proving actual damages?

BenjiWiebe3y ago

Well, the code I write is under GPL, at least it is when I remember to put an explicit license to it or if anyone asks be for permission to use it.

If someone wants to use it commercially without complying with the GPL, I have no problem with allowing that, for a price.

Either use the code freely and openly, or pay me so you can make money on my code.

Copilot could conceivably allow someone to use my code commercially (and in a closed manner) without negotiating with me, the copyright holder.

r3trohack3r3y ago

I'm not confident in this stance - sharing it to have a conversation. Hopefully some folks can help me think through this!

The value of copyleft licenses, for me, was that we were fighting back against the notion of copyright. That you couldn't sell me a product that I wasn't allowed to modify and share my modifications back with others. The right to modify and redistribute transitively though the software license gave a "virality" to software freedom.

If training a NN against a GPL licensed code "launders" away the copyleft license, isn't that a good thing for software freedom? If you can launder away a copyleft license, why couldn't you launder away a proprietary license? If training a NN is fair use, couldn't we bring proprietary software into the commons using this?

It seems like the end goal of copyleft was to fight back against copyright, not to have copyleft. Tools like copilot seem to be an exceptionally powerful tool (perhaps more powerful than the GPL) for liberating software.

What am I missing?

flatline3y ago

Nobody is laundering away proprietary livenses, because that code is not open source and not in public github repos. And OSS capabilities are now present in copilot, which is neither free nor open. Furthermore these contributions are making their way into proprietary code and the OSS licensing becomes even further watered down. This is the epitome of what copyleft is against!

yjk3y ago

Indeed, the ability to 'launder away' proprietary licenses when source is available means that companies in the future (that would otherwise provide source under a non-permissive license) will shift in favour of not providing source code at all.

TheCoelacanth3y ago

Code published on Github is not necessarily open source. There is a lot of code there that has no particular license attached, which means that all rights are reserved except for those covered in the Github TOS, which I believe just covers viewing the code on Github.

1 more reply

r3trohack3r3y ago

I'm not sure this is true. Proprietary source code gets leaked and that can be used to train a NN. I find it likely that Copilot was trained against at least one non-OSS code base hosted on GitHub.

Second, if copyright is being laundered away we can get increasingly clever with how we liberate proprietary software. Today, decompiling and reverse engineering is a labor intensive process. That's the whole point of "open source" - that working in source is easier than working in bytecode. Given the hockey-stick of innovation happening in AI right now, I'd be surprised if we don't see AI assisted disassembly happening in the next decade. If you can go from bytecode to source code, that unlocks a lot. Even more so if you can go from bytecode to source code and feed that into a NN to liberate the code from its original license.

blackbrokkoli3y ago

I follow you explanation but not your end statement.

What I think GP is getting at in my understanding is that all this OSS/licensing stuff was a cautious attempt to assert a radical idea into an atmosphere of extrem secrecy: That information wants to be free.

Now we have a fat cooperation making a public statement of putting the value of advancing humanity over the value of honoring weird old Victorian ideas of "intellectual property" - which is what we are always tried to do, no?

Not that there is nothing to criticize, but I think that's a good thing on the whole.

1 more reply

thomastjeffery3y ago

It looks like you're missing the entire purpose of copyleft vs public domain.

The point is that copyleft source code cannot be used to improve proprietary software. That limitation is enforced with copyright.

Proprietary software is closed source. You can't train your NN on it, because you can't read it in the first place.

If someone takes your open source code and incorporates it into their proprietary software, then they are effectively using your work for their private gain. The entire purpose of copyleft is to compel that person to "pay it forward", by publishing their code as copyleft. This is why Stallman is a proponent of copyright law. Without copyright, there is no copyleft.

Gigachad3y ago

Copyleft wouldn’t need to exist without copyright because there would be no proprietary software to fight against.

Sure, there would be software with code not published, but if it was ever leaked which it often is, you could do whatever you want with it.

But in a world where copyright does exist, copyleft is a tool to fight back.

1 more reply

r3trohack3r3y ago

> If someone takes your open source code and incorporates it into their proprietary software, then they are effectively using your work for their private gain.

And then if we can close that loop by taking their proprietary software and feeding it into a NN to re-liberate it isn't that a net win for software freedom?

Today crossing the sourcecode->bytecode veil effectively obfuscates the implementation beyond most human's ability to modify the software. Humans work best in sourcecode. Nothing saying our AI overlords won't be able to work well in bytecode or take it in the other direction.

I guess what I'm saying is, today a compiler is a one-way door for software freedom. Once it goes through the compiler, we lose a lot of freedom without a massive human investment or the original source code. Maybe that door is about to become a two way door with copyright law supporting moving back and forth through that door?

1 more reply

an1sotropy3y ago

I think (1) you're mainly missing that copyleft vs non-copyleft is actually irrelevant for the copilot case. You also (2) may be missing the legal footing of copyleft licenses.

(1) The problem with copilot is that when it blurps out code X that is arguably not under fair use (given how large and non-transformed the code segment is), copilot users have no idea who owns copyright on X, and thus they are in a legal minefield because they have no idea what the terms of licensing X are.

Copilot creates legal risk regardless of whether the licensing terms of X are copyleft or not. Many permissive licenses (MIT, BSD, etc) still require attribution (identifying who owns copyright on X), and copilot screws you out doing that too.

(2) Whatever legal power copyleft licenses have, it is ultimately derived from copyright law, and people who take FOSS seriously know that. The point of "copyleft" licenses is to use the power of copyright law to implement "share and share alike" in an enforceable way. When your WiFi router includes info about the GPL code it uses, that's the legal of power of copyright at work. The point of copyleft licenses is not to create a free-for-all by "liberating" code.

1 more reply

bjourne3y ago

You can "launder" away the license of any source code you have copied simply by deleting it! No snazzy neural network needed.. The litigants argument is that this is what GitHub CoPilot does. It allows others to publish derivative works of copyrighted works with the license deleted. Given that it apparently is trivial to get CoPilot to spit out nearly verbatim copies of the code that it was trained on, I don't think it satisfies the "transformative" requisite of the (American) Fair use doctrine.

cactusplant73743y ago

Is stable diffusion any different when including a famous artwork or artist in the prompt? The images produced are eerily similar to training data.

1 more reply

krono3y ago

Farmers plant their crops out in the open too. Should Boston Dynamics be allowed to have their robots rob those fields empty and sell the produce without having to at least pay the farmer? They'd be walking and plucking just like any human would be.

Some source code might be published but not open source licensed. At least some such code has been taken with complete disregard of their licenses and/or other legal protections, and it's impossible to find and properly map out any similar violations for the purposes of a legal response.

blackbrokkoli3y ago

This is literally the "you wouldn't steal a car" meme.

To spell it out: No, this analogy does not hold. "Stealing" data does not deprive the owner of anything, so it should not be treated remotely the same as physical stealing (usually not even of potential revenue, as piracy studies show).

2 more replies

swhalen3y ago

> It seems like the end goal of copyleft was to fight back against copyright, not to have copyleft.

Whether this was the original motivation depends on whom you are asking.

You may disagree, but the "Free Software" movement (RMS and the people who agree with him) essentially wants everything to be copyleft. The "Open Source" movement is probably more aligned with your views.

adgjlsfhk13y ago

the problem is you can't launder copyrighted code with this because you don't see the copyrighted code in the first place.

zeven73y ago

The only thing you're missing is that some people lost the plot and think it is all about copy left.

adlpz3y ago

It feels weird saying this but, for once, I hope the big evil corporation gets to keep selling their big bad product.

I find the pattern matching and repetitive code generation really helpful. And the library autocomplete on steroids, too.

Meh. Tricky subject.

dmix3y ago

TabNine has absolutely improved my life as a programmer. There's something really rewarding about having a robot read your mind for entire blocks of code.

It's not just functions either, one of the most common things that it helps me with daily is simple stuff like this:

Typing

    const x = {
        a: 'one',
        b: 'two',
        ...
    }

And later I'll be typing

    y = [
      a['one'], 
      b[' <-- it auto-completes the rest here
    ]

It's really amazing the amount of busy-work typing in programming that a smart pattern matching algo could help with.

Aeolun3y ago

That autocomplete was sort of ok in tabnine, but Copilot completely blows it out of the water. Resource consumption for Copilot is also much more restrained.

Which reminds me I have to cancel my tabnine subscription. Been paying them for a year without using it.

1 more reply

bogwog3y ago

I don't think this is a good example of the value of these things. You can just as easily do that same thing with advanced text editor features. Sublime for example supports multi-cursor editing. Just hold alt+shift+arrow keys to add a cursor, then type in the brackets you want. Ctrl+D can be used to select the next occurrence of the current selection with multiple cursors, built-in commands from the command pallete can do anything to your current selection (e.g. convert case), etc.

All of that efficiency without having to pay a monthly subscription, wasting electricity on some AI model, and worrying about the legal/moral implications.

2 more replies

nrb3y ago

Does anyone have a problem with it, so long as the material it trained on was with explicit permission/license and not potentially in violation of copyright?

That's where the line is for it to be suspect IMO.

bogwog3y ago

This is what I hope comes out of the lawsuit. If a company wants to sell an AI model, they need to own all of the training data. It can't be "fair use" to take other peoples' works at zero cost, and use it to build a commercial product without compensation.

And maybe models trained on public data should be in the public domain, so that AI research can happen without requiring massive investments to obtain the training data.

2 more replies

michaelmrose3y ago

It being permissively licensed is virtually irrelevant because only a minority of code is so permissively licensed you can just do what you like under any license. Far more is do what you like within the scope of the license. For example GPL do with it what you like so long as any derivative work is also GPL.

adlpz3y ago

I guess I'm just afraid that it might not be as good as it is that way.

It's a bit like how GPT-3, Stable Diffusion and all those generative models use extensive amounts of copyrighted material in training to get as good as they do.

In those cases however the output space is so vast that plagiarism is very unlikely.

With code, not so much.

3 more replies

sireat3y ago

I feel like Charlie Gordon from Algernon with and without Copilot.

Literally 10x faster development.

Case in point: had an unexpected project and no time to complete it. Within an hour Copilot helped me:

* Write a couple of tricky matplotlib plots

* Do some extensive analysis with Pandas

* Write a couple of SQL queries

* Write a Flask back-end and deploy it

* Write a bit of a front-end

* This all with extra comments , links to documentation and pretty reasonable style

I have experience with all of the above mentioned but the speed increase was considerable.

This would a a good day's work without Copilot and there would be less commenting and hackier code.

Before Copilot I would be cursing a lot more reading various docs...

The key thing that Copilot does it reduces latency for your thoughts-action-results loop.

Does the open source really suffer if less people read documentation directly? Would you really be less likely to create an open source library if you knew someone can now use your library at 10x speed?

The inference ability has crossed uncanny valley so many times.

I find myself wondering whether there is a speech recognition component at times.

When teaching a lecture I will start saying something and write a prompt at the same time and the sentence produced by Copilot will be spot on what I've just said.

Ideally there would an open source version of Copilot that respects everyone's wishes. I fear that is impossible.

odessacubbage3y ago

why not just train it on your own code or an opt-in data base of voluntarily contributed code? why does everyone else have to make your life easier [and generate enormous wealth for a third party with zero compensation for their work] involuntarily?

albertzeyer3y ago

I really don't understand how there can be a problem with how Copilot works. Any human just works in the same way. A human is trained on lots and lots of of copyrighted material. Still, what a human produces in the end is not automatically derived work from all the human has seen in his life before.

So, why should an AI be treated different here? I don't understand the argument for this.

I actually see quite some danger in this line of thinking, that there are different copyright rules for an AI compared to a human intelligence. Once you allow for such arbitrary distinction, it will get restricted more and more, much more than humans are, and that will just arbitrarily restrict the usefulness of AI, and effectively be a net negative for the whole humanity.

I think we must really fight against such undertaking, and better educate people on how Copilot actually works, such that no such misunderstanding arises.

S2013y ago

I think there's a parallel in surveillance systems. For example, it's perfectly reasonable for a police officer conducting an investigation to follow a suspect as they drive around town. After all, it's happening in public and it's not illegal to watch what someone does in public (caveat being taking it to the level of stalking).

However, is it reasonable to write an AI system that monitors the time and location of all license plates seen around town, puts them into a database, and then that same officer can simply put in the suspect's license plate instead of actually following them around? Maybe, maybe not, that's not my point here. But the creation of that functionality can easily lead to its abuse.

Is this exactly the same case as Copilot? Of course not, these are two wildly different systems. But I think it's an interesting parallel to consider when discussing the point of "it's okay when a human does it" because humans and algorithms operate at two very different levels of scale. The potential for abuse of the latter being far higher and far easier than something a human has to do manually.

layer83y ago

Humans are able to recognize when they are plagiarizing someone else’s work. AIs currently aren’t.

2 more replies

trention3y ago

>So, why should an AI be treated different here? I don't understand the argument for this.

Because the AI is not a human and only humans have rights, including the right to learn.

1 more reply

theamk3y ago

AI is not treated differently here. If a human produced this kind of code: https://twitter.com/DocSparse/status/1581461734665367554 they would be sued as well

I am not sure how can anyone root for AI after seeing those kinds of outputs. It's like high-school level plagiatrism.

1 more reply

herpderperator3y ago

The title of the submitted PDF document: "Microsoft Word - 2022-11-02 Copilot Complaint (near final)"[0]

I've noticed this a lot and it's quite funny seeing what the actual filename of the document was. Does this just get included as metadata by default when you export to PDF?

[0] https://githubcopilotlitigation.com/pdf/1-0-github_complaint...

bombcar3y ago

In word you can go to document properties or whatever and set the Title and some other fields to control what gets into the PDF.

tasuki3y ago

The typography on that document is not great. Perhaps they should read Matthew Butterick's book?

senkora3y ago

It does, yes. It’s very annoying and I have occasionally stripped it off of PDFs I’ve made, using exiftool.

mirekrusin3y ago

They should use github instead of sending "(final, 2nd revision, really final, amended)" emails.

D13Fd3y ago

If only you could, with Word docs. Sadly you can't in any meaningful way.

deanjones3y ago

This will fail very quickly. The licence that project owners publish with their code on Github applies to third parties who wish to use the code, but does not apply to Github. Authors who publish their code on Github grant Github a licence under the Github Terms: https://docs.github.com/en/site-policy/github-terms/github-t...

Specifically, sections D.4 to D.7 grant Github the right to "to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video."

acdha3y ago

I don’t see that being “quickly” - they’d have to get a judge to agree that passing your code off without attribution for other people to use as their own work is a normal service improvement. Given that it’s a separate feature with different billing terms, I’m skeptical that it’s anywhere near the given that you’re portraying it as.

1 more reply

mldq3y ago

This is the standard content display license that everyone uses. Even in your quoted text I don't see any hint that snippets can be shown without attribution or the code license.

It also says they can't sell the code, which CoPilot is doing.

Also, in a very high number of cases it isn't the author who uploads.

Repeating your line of argumentation (which occurs in every CoPilot thread) does not make it true.

1 more reply

saurik3y ago

So, it isn't clear to me which of these clauses you are citing grants them the forced right to "Copilot" (which I'm using as a verb to avoid defining what stage of production we are talking about) that wasn't granted by the license of the code, but let's assume for a moment that you are correct: that just means that GitHub as a service makes no sense, right? Like, there are a ton of people using GitHub to develop using code I've published in the past... code which is under various of these example licenses, and which I've never myself (as the copyright holder) published to GitHub (and, in fact, would never as I despise GitHub). There are also a number of very popular projects--such as the Linux kernel--which people no only upload to GitHub but which have official mirrors of on GitHub where no party even owns the copyright in order to agree to these terms of service. Meaning, if you are correct, GitHub is often being used illegal and a ton of the source code they are training against wasn't legally provided to them in the first place.

1 more reply

maxloh3y ago

How about codebase that were uploaded to GitHub, by someone other than the original copyright owner?

e.g. I can clone the GNU codebase and publish it to GitHub. Clearly I don't own the code and do not have any rights to grant GitHub a license.

1 more reply

klabb33y ago

> Authors who publish their code on Github grant Github a licence under the Github Terms: https://docs.github.com/en/site-policy/github-terms/github-t...

This sounds unenforceable in the general case. How could github know whether someone pushes their own code or not? Is it a license violation to push someone's FOSS code to github because the author didn't sign up with GH?

1 more reply

sigzero3y ago

If that is pretty much verbatim under their terms, then yes the lawsuit is going nowhere.

karaterobot3y ago

Does everybody credit the author when using Stack Overflow code? I have, but don't always. Not that I'm trying to steal, I just don't take the time, especially in personal projects.

This isn't exactly the same thing, but it seems to me that three of the biggest differences are:

1. Stack Overflow code is posted for people to use it (fair enough, but they do have a license that requires attribution anyway, so that's not an escape)

2. Scale (true; but is it a fundamental difference?)

3. People are paying attention in this case. Nobody is scanning my old code, or yours, but if they did, would they have a case?

I dunno. I'm more sympathetic to visual artists who have their work slurped up to be recapitulated as someone else's work via text to image models. Code, especially if it is posted publicly, doesn't feel like it needs to be guarded. I'm not saying this is correct, just saying that's my reaction, and I wonder why it's wrong.

Imnimo3y ago

On page 18, they show Copilot produces the following code:

>function isEven(n) {

> return n % 2 === 0;

They then say, "Copilot’s Output, like Codex’s, is derived from existing code. Namely, sample code that appears in the online book Mastering JS, written by Valeri Karpov."

Surely everyone reading this has written that code verbatim at some point in their lives. How can they assert that this code is derived specifically from Mastering JS, or that Karpov has any copyright to that code?

williamcotton3y ago

There is no way in hell that isEven is covered by copyright.

"In computer programs, concerns for efficiency may limit the possible ways to achieve a particular function, making a particular expression necessary to achieving the idea. In this case, the expression is not protected by copyright."

https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...

Think about how absurd this is. So if Microsoft was the first company to write and publish an isEven function then no one else can legally use it?

Phrodo_003y ago

> There is no way in hell that isEven is covered by copyright.

Hey, I said the same thing about APIs, but here we are.

Edit: Actually, the Supreme Court declined ruling whether APIs are copyrightable, but they did say that if they are, reusing them like google reused the java apis in android would fall under fair use. Given that lower courts did think that APIs should be copyrightable, we don't know if they are anymore.

1 more reply

kevin_thibedeau3y ago

There are software patents on bit twiddling operations that people do end up having to work around.

2 more replies

eurasiantiger3y ago

Does that mean any perfectly optimal function is copyright-free?

1 more reply

leepowers3y ago

It's possible the complaint is using a trivial example to illustrate the type of argument plaintiffs want to make during any trial. A 200-line example is too unwieldy for non-programmers to digest, especially given the formatting constraints of a legal brief.

Look at paragraphs 90 and 91 on page 27 of the complaint[1]:

"90. GitHub concedes that in ordinary use, Copilot will reproduce passages of code verbatim: “Our latest internal research shows that about 1% of the time, a suggestion [Output] may contain some code snippets longer than ~150 characters that matches” code from the training data. This standard is more limited than is necessary for copyright infringement. But even using GitHub’s own metric and the most conservative possible criteria, Copilot has violated the DMCA at least tens of thousands of times."

Does distributing licensed code without attribution on a mass scale count as fair use?

If Copilot is inadvertently providing a programmer with copyrighted code, is that programmer and/or their employer responsible for copyright infringement?

There's a lot of interesting legal complications I think the courts will want to adjudicate.

[1] https://githubcopilotlitigation.com/pdf/1-0-github_complaint...

schleck83y ago

> Surely everyone reading this has written that code verbatim at some point in their lives

Ironically their Twitter account uses a screenshot from a TV series as profile picture. I wonder how legal that is, even if meant as a joke.

https://twitter.com/saverlawfirm

Edit: It's been changed 2 minutes after I wrote this comment

0x_rs3y ago

>their

"Joined November 2022", following one account and no followers. It's generous to consider it a genuine account, no?

1 more reply

zeven73y ago

This comment is 1 minute old and I only see a plain black profile picture.

Or is your comment itself the joke?

1 more reply

hdjjhhvvhga3y ago

Is there a Wayback Machine for Twitter?

lelandfe3y ago

They determined the other `isEven()` function was cribbed from Eloquent Javascript because of matching comments. I wonder if the complaint just left off telltale comments from that Mastering JS one?

Imnimo3y ago

Yeah, the other one I found much more persuasive. The extra comments were unequivocally reproduced from the claimed source. (although that output was from Codex, rather than Copilot).

counttheforks3y ago

I wrote that exact function the other day, and I've never even heard of that book.

eddsh19943y ago

Yep, same. Not in JS, but in Haskell, for the Even Fib project Euler problem. Something like a million people have submitted right answers for that problem and assuming half wrote their own filter rather than importing a isEven library then that's half a million people there.

1 more reply

moffkalast3y ago

I'd hire a legal team if I were you, the injunction is on the way. /s

0cf8612b2e1e3y ago

Should have used snake case. Would have avoided legal hot water and established precedent.

bogwog3y ago

That seems like a really bad choice of an example for this, but as I haven't read the document I don't have any other context beyond what you've posted here, I have to take your word for it that that's the purpose of this snippet.

However, if you are looking to understand the reasoning behind this lawsuit, there are lots of better examples online where Copilot blatantly ripped off open source code.

az2263y ago

Shows how hilariously dumb this lawsuit is. The plaintiff is a lawyer but somehow missed the merger doctrine class.

nikanj3y ago

This reminds me of the SCO vs Linux lawsuits.

janef04213y ago

That is likely only a brief example of the general principle underlying the lawsuit.

celestialcheese3y ago

Maybe I'm being too cynical, but this feels like it's more a law firm and individual looking to profit and make their mark in legal history rather than an aggrieved individual looking for justice.

Programmer/Lawyer Plaintiff + upstart SF Based Law Firm + novel technology = a good shot at a case that'll last a long time, and fertile ground to establish yourself as experts in what looks to be a heavily litigated area over the next decade+.

iudqnolq3y ago

One of the core principles of the American system of government is that we outsource enforcement to private parties. Instead of the public needing to fund enforcement with tax dollars private parties undertake risky litigation in exchange for the chance of a big payoff.

There is a reasonable argument that's a horrible system. But it doesn't make sense to criticize the plaintiff looking for a profit - the entire system has been set up such that that's what they're supposed to do. If you're angry about it lobby for either no rules or properly funded government enforcement of rules.

lovich3y ago

> But it doesn't make sense to criticize the plaintiff looking for a profit…

I don’t know man, I can simultaneously see the systemic issue that needs to be solved and also critique someone for subcoming to base needs like greed when they don’t have the need.

3 more replies

celestialcheese3y ago

That's entirely fair - and I'm not angry, just not convinced in their arguments, especially when the motive is likely not genuine.

As an aside - I'm almost positive MSFT/Github expected this and their legal teams have been prepping for this moment. Copyright Law and Fair Use in the US is so nuanced and vague that anything created involving prior art by big-pocket individuals or corporations will be litigated swiftly.

I expected one of these lawsuits to come first from Getty or one of the big money artist estates against OpenAI or Stability.ai, but Getty and OpenAI seem to be partnering instead of litigating.

onlycoffee3y ago

It's the two words, "government enforcement", that bothers me. If your party is in control the words sound fine, otherwise, they sound ominous.

2 more replies

thaumasiotes3y ago

> If you're angry about it lobby for either no rules or properly funded government enforcement of rules.

No, there are plenty of other changes you might want to see.

For example, in the American system, judges are generally not allowed to be aware of anything not mentioned by a party to the case. There is no good reason for this.

1 more reply

cube003y ago

Sounds like healthcare

heavyset_go3y ago

This is a classic example of the ad hominem fallacy. Stating that "they are no angels" doesn't detract from whether they're right or capable of effecting positive legal change.

Frankly, I don't care if anyone makes a name for themselves for doing this. In fact, I applaud them and would happily give them recognition should they be successful.

Similarly, I'd hope that there are opportunties for profit in this space, given that I don't want cheap lawyers botching this case and setting terrible legal precedent for the rest of us. Microsoft has a billion dollar legal team and they will do everything they can to protect their bottom line.

squokko3y ago

Just like good people can try to do good things and end up screwing things up badly, bad people can do bad things that have positive effects.

efitz3y ago

I fail to see the positive effect here.

Just like Google’s noble but misguided attempt to make all the world’s books searchable a few years back, what we have here is IP law getting in the way of a societal goodness.

Copyright and patent are not natural; they’re granted by law “to promote progress in the useful arts”. At first glance here it appears that GitHub is promoting progress and the plaintiffs are just rent-seeking.

henryfjordan3y ago

It can be and is both what you describe and a necessary feature of our adversarial legal system.

Github can't really go to a court by themselves and ask "is this legal?". There is the concept declaratory relief but you need to be at least threatened with a lawsuit before that's on the table.

So Github kinda just has to try releasing CoPilot and get sued to find out. The legal system is setup to reward the lawyer who will go to bat against them to find out if it is legal. The plantiff (and maybe lawyer, depending on how the case is financed) take the risk they are wrong just as Github had to.

It is setup this way to incentivize lawyers to protect everyone's rights.

dkjaudyeqooe3y ago

But who cares? Who else is willing to fund litigation on this important legal question? The real justice here is declarative and benefits everyone.

No matter who litigates and for what reasons it will be extremely valuable for good precedents to be set around the question of things like Copilot and DALL-E with respect to copyright and ownership. I'd rather have self interested lawyers dedicated to winning their case than self interested corporations fighting this out.

jedberg3y ago

As my lawyer friend told me, a class action lawsuit is a lawyer's startup. A lot of work for little pay with the chance of a huge payout.

Auracle3y ago

I brought a class action suit against Sharp and I was the class representative. They settled. The judge awarded me a whopping $1,000 from the settlement money. From the time I put into it, including 3 or 4 full days in NYC because my deposition coincided with a snowstorm, I didn’t exactly come out ahead financially.

Obviously this is different for the reasons you stated, but I didn’t want people to think bringing a class action lawsuit forward is a way to get rich. It’s a bit of a joke, really.

varispeed3y ago

> rather than an aggrieved individual looking for justice.

How an aggravated individual can seek justice from a big multinational corporation? That's not possible unless that individual is a retired billionaire wanting to become a millionaire.

sam3453y ago

yes, of course that's what it is. plaintiffs if they win will get a few pennies, lawyers will get a lot.

grogenaut3y ago

I have a friend from highschool who does class action lawsuits. He spends a very large amount of money funding his suits on things like expert witnesses among other things, only 1 in 5 pays off, so it has to pay off well. His model is similar to venture capitalism. Most of these cases take 5-7 years to execute. So he basically takes out loans from another laywer to fund them. His average pay for the last 10 years has been around $140k/year. Some years he makes nothing and pays out a lot, others he makes several million and pays back all the loans. Another way to think of it is like giving money to tax fraud wistleblowers.

Yes he does think of it somewhat like that, establishing himself in an area. However a lot of his work comes from finding people aggrieved by something not them finding him.

undoware3y ago

If it wasn't Butterick I wouldn't be interested.

But I write this to you in Hermes Maia

xchip3y ago

LOL we look like taxi drivers fighting Uber.

If Kasparov uses chess programs to be better at chess maybe we can use copilot to be better developers?

Also, anyone, either a person or a machine, is welcome to learn from the code I wrote, actually that is how I learnt how to code, so why would I stop others from doing the same?.

elefantastisch3y ago

Judging by the majority opinion in this thread, it seems pretty clear GitHub could have asked and gotten enough people to opt-in to have no problem training their model. They probably would have been thrilled to do it and proud of being included in the training data.

But the preference of the majority does not override the conditions placed by people who prefer not to participate.

jacooper3y ago

No human perfectly reproduces the learning material they used. If that was true, one might as well just higher engineers from Twitter and make a new platform from the code they remember!

blackbrokkoli3y ago

Well, we humans do it occasionally. You probably remember a few specific code snippets in your lang of choice because they kept annoying you/you love them/you wrote them a lot. So if I would put you in the exactly right situation, you would indeed reproduce code verbatim.

So does Copilot.

I am not trying to insinuate that Copilot works like a human, but it is literally the same situation.

abouttyme3y ago

I suspect this will be the first of many lawsuits over training data sets. Just because it is obscured by artificial neural networks doesn't mean it's an original work that is not subject to copyright restrictions.

ketralnis3y ago

Yeah yeah my code produces the complete works of Micky Mouse but it's it's okay because _algorithms_!

judge20203y ago

I don't know why we're treating it as anything less than a human brain. A human can replicate a painting from memory or a picture of mickey mouse and that would likely be copyright infringement, but they could also take a drawing of Mickey Mouse sitting on the beach and given him a bloody knife & some sunglasses and it'd likely be fair use of the original art.

The AI can copy things if it wants, but it can also modify things to the point of being fair use, and it can even create new works with so little of any particular work that it's effectively creativity on the same level of humans when they draw something that popped into their heads.

m00x3y ago

naillo3y ago

I'm kinda sceptical that this goes anywhere given that basically they say that whatever copilot outputs is your responsibility to vet that it doesn't break any copyright (obviously that goes against the promise of it and the PR but that's the small print that gets them out of trouble).

heavyset_go3y ago

Saying "it's your responsibility to not breach licenses or violate copyright" doesn't absolve your service from breaching licenses and violating copyright itself.

mdaEyebot3y ago

"It is the customer's responsibility to ensure that they only drink the water molecules which come out of their tap, and not the lead ones."

golemotron3y ago

Yet we all use web browsers that copy copyrighted text from buffer to buffer all the time. This doesn't even include all of the copying that ISPs perform.

It might be fair to say that the read performed in training has the same character since no human is involved.

The real copyright violation would be using a derived work.

2 more replies

shoshoshosho3y ago

You could argue that it’s the individual projects using copilot that are violating here, I guess? Like you can use curl or git to dump some AGPL code into your commercial closed software but no one would (hopefully) blame those tools.

So copilot is fine but anyone using it must abide by the collective set of licenses that it used to write code for you…?

BeefWellington3y ago

If a license requires attribution, and you reproduce the code without attribution using your editor plugin, it seems to me the infringement is on the editor plugin.

Note that even licenses like MIT ostensibly require attribution.

dmitrygr3y ago

So, if i made napster 2.0 and said that it is your job to make sure that you do not download anything copyrighted, that would be ok?

heavyset_go3y ago

This is a bad analogy because P2P networks exist that are legal to operate, because Section 230 of the CDA prevents interactive computer services from being held responsible for user generated content.

What made Napster illegal is that the company did not create their network for fair use of content, but to explicitly violate copyright for profit.

Copilot is like Napster in this case, in that both services launder copyrighted data and distributed it to users for profit.

Copilot is not like other P2P networks that exist to share data that is either free to distribute or can be used under the fair use doctrine. Copilot explicitly takes copyrighted content and distributes it to users in violation of licenses, that's its explicit purpose.

It's entirely possible to make a Copilot-like product that was trained on data that doesn't have restrictive licensing in the same way it's entirely possible to create a P2P network for sharing files that you have the right to share legally.

brookst3y ago

The legal system takes intent into account.

So if you produce napster 2.0 to be the best music piracy tool, and you test it for piracy, and you promote it for piracy... you're going to have trouble.

If you produce napster 2.0 as a general purpose file sharing system, let's call it a torrent client, and you can claim no ill intent... you may have trouble but it's a lot more defensible in court.

I would find it a big stretch to say Github's intent here is to illegally distribute copyrighted code. No judgment on whether the class action has any merit, just saying I would be very surprised if discovery turns up lots of emails where Github execs are saying "this is great, it'll let people steal code."

1 more reply

ketralnis3y ago

I think you're looking for consistency that the legal system just doesn't provide. The music industry is more organised and litigious than the software industry and that gives them power that you and I don't have. If you called it "Napster 2.0" specifically you'd probably be prevented from shipping by a preliminary injunction. Is that fair or consistent? No. But it's the world we live in. Programmers want laws to be irrefutable and executable logic but they just aren't.

jasonlotito3y ago

Now, IANAL, but iirc, that is all 100% okay and legal. In fact, I can even download copyrighted music and movies without issue. So, I don't even need to make sure I don't download anything under copyright.

The issue isn't downloading copyrighted stuff.

Rather, it's making available and letting others download it. That was where you got in trouble.

1 more reply

donatj3y ago

Yep. That's exactly why Bittorrent clients can exist.

eurasiantiger3y ago

Isn’t that already how everything on the internet works?

1 more reply

charcircuit3y ago

Yes that would be okay. It would also be okay to create Internet 2.0.

nicolashahn3y ago

That's basically the situation for any torrent client

1 more reply

dmalik3y ago

You mean like every torrent client that currently exists?

stonemetal123y ago

If I remember correctly that only works if you can prove that your system has "substantial non infringing use".

iworshipfaangs2OP3y ago

It's also a class action,

> behalf of a proposed class of possibly millions of GitHub users...

The appendix includes the 11 licenses that the plaintiffs say GitHub Copilot violates: https://githubcopilotlitigation.com/pdf/1-1-github_complaint...

cmrdporcupine3y ago

If Microsoft is so confident in the legality and ethics of Copilot, and that it doesn't leak or steal proprietary IP... they should go train it on the MS Word and Windows and Excel source trees.

What's that? They don't want to do that? Why not?

blackbrokkoli3y ago

Did they make a statement that they did not want to do that?

Because if not I would offer the very mundane explanation that the Copilot team probably just couldn't be bothered hitting up the other software teams and jumping through 3,046 internal red tape compliance steps to make their product 0.001% better (I am pretty sure the code base of all of GH dwarfs MS code base quite a lot)

I can't believe I am actually defending fucking Microsoft, but just want to say there isn't a conspiracy everwhere...

az2263y ago

I have no doubt they will -- but the specific models will be used for Microsoft engineers. There will be a Copilot for Enterprise that trains on customers' private code.

jeffhwang3y ago

Wow, this is interesting iteration in the ongoing divide between "East Coast code" vs. "West Coast code" as defined by Larry Lessig. For background, see https://lwn.net/Articles/588055/

IceWreck3y ago

I am not against this lawsuit but I'm against the implications of this because it can lead to disastrous laws.

A programmer can read available but not oss licensed code and learn from it. Thats fair use. If a machine does it, is it wrong ? What is the line between copying and machine learning ? Where does overfitting come in ?

Today they're filing a lawsuit against copilot.

Tomorrow it will be against stable diffusion or (dall-e, gpt-3 whatever)

And then eventually against Wine/Proton and emulators (are APIs copyrightable)

mkeeter3y ago

Wine literally bans contributions from anyone that has seen Microsoft Windows source code:

https://wiki.winehq.org/Developer_FAQ#Who_can.27t_contribute...

c0balt3y ago

Well they are a special case here however since they don't solve a specific problem nor build a programm per se but instead (re)build a programm after existing specs. Their explicit goal is to match the behaviour of another piece of software with a translation layer.

Forbidding people who have seen the "source" programm is most likely to protect their version from going from "matching behaviour" to "behaving like", as in the same code, point. This might also be intended to build a safeguard for good intentioned developers to not break their (most likely existing) own NDAs accidently.

sedatk3y ago

> A programmer can read available but not oss licensed code and learn from it

Actually, we were forbidden to look at open source code at Microsoft (circa 2009) because it might influence our coding and violate licenses.

EMIRELADERO3y ago

That was out of abundance of caution, not based on any legal precedent.

In fact, the little precedent that exists over learning from copyrightable code is in favor of it.

More important, the rule urged by Sony would require that a software engineer, faced with two engineering solutions that each require intermediate copying of protected and unprotected material, often follow the least efficient solution (In cases in which the solution that required the fewest number of intermediate copies was also the most efficient, an engineer would pursue it, presumably, without our urging.) This is precisely the kind of “wasted effort that the proscription against the copyright of ideas and facts . . . [is] designed to prevent.” (Sony v. Connectix)

elil173y ago

That demonstrates that copyright laws are already stifling innovation.

5 more replies

kens3y ago

Way, way back in 1992, Unix Systems Laboratories sued BSDI for copyright infringement. Among other things, they claimed that since the BSD folks had seen the Unix source code, they were "mentally contaminated" and their code would be a copyright violation. This led to the BSD folks wearing "mentally contaminated" buttons for a while.

__alexs3y ago

Do the TypeScript team code with their eyes closed?

2 more replies

andrewmcwatters3y ago

GitHub Copilot has been proven to use code without license attribution. This doesn't need to be as controversial as it is today.

If you're using code and know that it will be output in some form, just stick a license attribution in the autocomplete.

In fact, did you know this is what Apple Books does by default? Say, for example, you copy and paste a code sample from The C Programming Language. 2nd Edition. What comes out? The code you copy and pasted, plus attribution.

swhalen3y ago

> A programmer can read available but not oss licensed code and learn from it. Thats fair use.

If a human programmer reads some else's copyrighted code, OSS or otherwise, memorizes it and later reproduces it verbatim or nearly so, that is copyright infringement. If it wasn't, copyright would be meaningless.

The argument, so far as I understand it, is that Copilot is essentially a compressed copy of some or all of the repositories it was trained on. The idea that Copilot is "learning from" and transforming its training corpus seems, to me, like a fiction that has been created to excuse the copyright infringement. I guess we will have to see how it plays out in court.

As a non-lawyer it seems to me that stable diffusion is also on pretty shaky ground.

APIs are not copyrightable (in the US), so Wine is safe (in the US).

Iv3y ago

AI companies are running against the clock to normalize training against copyrighted data.

Let me tell you the story of Google Books, also known as "Authors Guild Inc. v. Google Inc"

https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....

In 2004, Google added copyrighted books to is Google Books search engine, that does search among millions of book text and shows full page results without any authors authorization. Any sane lawyer of the time would have bet on this being illegal because, well, it most certainly was. And you may be shocked to learn that it is actually not.

in 2005 the Authors Guild sues for this pretty straightforward copyright violation.

Now an important part of the story: IT TOOK 10 YEARS FOR THE JUDGEMENT TO BE DECIDED (8 years + 2 years appeal) during which, well, tech continued its little stroll. Ten year is a lot in the web world, it is even more for ML.

The judgement decided Google use of the books was fair use. Why? Not because of the law, silly. A common error we geeks do is to believe that the law is like code and that it is an invincible argument in court. No, the court was impressed by the array of people who were supporting Google, calling it an invaluable tool to find books, that actually caused many sales to increase, and therefore the harm the laws were trying to prevent was not happening while a lot of good came from it.

Now the second important part of the story: MOST OF THESE USEFUL USES HAPPENED AFTER THE LITIGATION STARTS. That's the kind of crazy world we are living in: the laws are badly designed and badly enforced, so the way to get around them is to disregard them for the greater good, and hope the tribunal won't be competent enough to be fast but not incompetent enough to fail and understand the greater picture.

Rants aside, I doubt training data use will be considered copyright infringement if the courts have a similar mindset than in 2005-2015. Copyright laws were designed to preserve the authors right to profit from copies of their work, not to give them absolute control on every possible use of every copy ever made.

cromka3y ago

> A programmer can read available but not oss licensed code and learn from it. Thats fair use. If a machine does it

Quite sure the issue at hand is about the code being copied verbatim without the license terms, not "learning" from it.

bawolff3y ago

> A programmer can read available but not oss licensed code and learn from it. Thats fair use. If a machine does it, is it wrong ?

You can learn from it, but if you start copying snippets or base your code on it to such an extent that its clear your work is based on it, things start to get risky.

For comparison, people have tried to get around copyright of photos by hiring an illustrator to "draw" the photo, which doesn't work legally. This situation seems similar.

michaelmrose3y ago

Why wouldn't drawing the photo be fair use can you cite a case?

1 more reply

TimTheTinker3y ago

At least in legal terms, the difference between humans and machines couldn't be more clear.

amelius3y ago

> If a machine does it, is it wrong ? What is the line between copying and machine learning ?

What is the difference between a neighbor watching you leave your home to visit the local grocery store and mass surveillance? Where do you draw the line?

It is pretty simple, actually.

kmeisthax3y ago

Wine/Proton are safe because there is controlling 9th and SCOTUS precedent in favor of reimplementation of APIs.

The reason why those wouldn't apply to Copilot is because they aren't separating out APIs from implementation and just implementing what they need for the goal of compatibility or "programmer convenience". AI takes the whole work and shreds it in a blender in the hopes of creating something new. The hope of the AI community is that the fair use argument is more like Authors Guild v. Google rather than Sony v. Connectix.

bogwog3y ago

> Today they're filing a lawsuit against copilot.

> Tomorrow it will be against stable diffusion or (dall-e, gpt-3 whatever)

> And then eventually against Wine/Proton and emulators (are APIs copyrightable)

Textbook definition of F.U.D.

laputan_machine3y ago

Genuinely one of the worst takes I've ever read. I'm not against the 'slippery slope' argument in principle, but this example is ridiculous.

mardifoufs3y ago

Slippery slope? Are you familiar with judicial precedent? Being bound to precedents is central to common law legal systems, so I don't think the GP's take was so outlandish. "Slippery slopes" and "whataboutism" might be thought-terminating buzzwords online, but not in front of a judge.

1 more reply

Barrin923y ago

>A programmer can read available but not oss licensed code and learn from it. Thats fair use.

No it isn't, at least not automatically which is why infringement of licenses exists at all, the fact that you have a brain doesn't change that and never has. If you reproduce someone's code you can be in hot water, and that should be the case for an operator of a machine.

It's also why the concept of a clean room implementation exists at all.

EMIRELADERO3y ago

I think the commenter you replied to was talking about using the functional, non-copyrightable elements of the copyrighted code. Clean-room is not even required by case law. There's precedent that explicitly calls it out as inefficient.

zbentley3y ago

> A programmer can read available but not oss licensed code and learn from it. Thats fair use. If a machine does it, is it wrong ?

My (extremely amateur) understanding is that what is meant by "learn from it" is one of the hinge points of the legal question.

If a programmer reads licensed code and reproduces it verbatim or near-verbatim in a project with a conflicting license, that becomes a legal problem in certain circumstances.

If a programmer reads the same code and gets an idea to implement something different, that's less troublesome (or at least, if it is troublesome it's in a different area; if the idea was related to a patentable process, then other questions arise, but I'm even less qualified to speak to that area of law).

There's nothing special about copy/paste buttons that make them the only way you can infringe copyright.

Fair use doesn't automatically kick in just because someone uses what they took/copied as part of a larger artifact; it's a really complicated legal line.

arpowers3y ago

In some ways all these AIs are plagiarizing... I think creators should opt-in to ai models, as no current license was developed with this in mind.

grayfaced3y ago

Maybe its time for Creative Commons License to address this. I'm curious if No-Derivative would already prohibit this? Does the ND language need tweaking? Or do they need a whole new clause.

Edit: I guess they do address it in their faq and I'd summarize it "Depends if copyright law applies and depends if it's considered derivative". https://creativecommons.org/faq/#artificial-intelligence-and...

1 more reply

belorn3y ago

It would be good to have a definitive and simple line for fair use that could be applied to all forms of copyright. Right now fair use is defined by four guidelines:

The purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes

The nature of the copyrighted work

The amount and substantiality of the portion used in relation to the copyrighted work as a whole

The effect of the use upon the potential market for or value of the copyrighted work.

A programmer who studied in school and learned to code did so clearly for and educational purpose. The nature of the work is primarily facts and ideas, while expression and fixation is generally not what the school is focusing on (obviously some copying of style and implementation could occur). The amount and substantiality of the original works is likely to be so minor as to be unrecognized, and the effect of the use upon the potential market when student learn from existing works would be very hard to measure (if it could be detected).

When a machine do this, are we going to give the same answers? Their purpose is explicitly commercial. Machines operate on expression and fixation, and the operators can't extract the idea that a model should have learned in order to explain how a given output is generated. Machines makes no distinction of the amount and substantiality of the original works, with no ability to argue for how they intentionally limited their use of the original work. And finally, GitHub Copilot and other tools like them do not consider the potential market of the infringed work.

API's are generally covered by the interoperability exception. I am unsure how that is related copilot or dall-e (and the likes). In the Oracle v. Google case the court also found that the API in question was neither an expression or fixation of an idea. A co-pilot that only generated header code could in theory be more likely to fall within fair use, but then the scope of the project would be tiny compared to what exist now.

whateveracct3y ago

> A programmer can read available but not oss licensed code and learn from it. Thats fair use. If a machine does it, is it wrong ?

Just because both activities are calling "learning" does not mean they are the same thing. They are fundamentally, physically different activities.

chiefalchemist3y ago

Agreed. But it could go the other way as well. Let's say MS / HB wins and the decision establishes and even less healthy / profitable (?) outcome over the long term.

Remember when Napster was all the rage. And then Jobs and Apple stepped in and set an expectation for the value of a song (at 99 cents)? And that made music into the razor and the iPod the much more profitable blades. Sure it pushed back Napster but artists - as the creator of the goods - have yet to recover.

I'm not saying this is the same thing. It's not. Only noting that today's "win" is tomorrow's loss. This very well could be a case of be careful what you wish for.

bdcravens3y ago

In most copyright cases, exposure to the material in question is always discussed.

elcomet3y ago

This is why we can't have nice things. Copilot is the best thing that happened in developper tools since a long time, it increased a lot my productivity. Please don't ruin it.

theamk3y ago

Write a whole bunch of code and permit copilot learning on it! Then it would be great even without violating others' copyrights.

1 more reply

protomyth3y ago

I really feel that Andy Warhol Foundation for the Visual Arts, Inc. v. Goldsmith[0] is going to have a big effect on this type of thing. They are basically relying on their AI magic to make it transformative. I'm starting to think the era of learning from material other people own without a license / permission is going to end quickly.

0) https://www.scotusblog.com/case-files/cases/andy-warhol-foun...

topher63453y ago

Is it not in the agency of the developer to hit the save button?

It seems like GitHub Copilot can spit out copyrighted works all day but the person running the text editor has to "choose" which Copilot output to actually save/commit/deploy.

Does it really matter that much "how" the text in your text editor gets there? You write it yourself or copy/paste it or have Copilot generate it. Ultimately the individual that "approved" it to be saved to the disk is the one violating the copyright, Copilot is just making a "suggestion".

nullc3y ago

I think if this is successful it will be very bad for the open world.

Large platforms like github will just stick blanket agreements into the TOS which grant them permission (and require you indemnify them for any third party code you submit). By doing so they'll gain a monopoly on comprehensively trained AI, and the open world that doesn't have the lever of a TOS will not at all be able to compete with that.

Copilot has seemed to have some outright copying problems, presumably because its a bit over-fit. (perhaps to work at all it must be because its just failing to generalize enough at the current state of development) --- but I'm doubtful that this litigation could distinguish the outright copying from training in a way that doesn't substantially infringe any copyright protected right (e.g. where the AI learns the 'ideas' rather than verbatim reproducing their exact expressions).

The same goes for many other initiatives around AI training material-- e.g. people not wanting their own pictures being used to train facial recognition. Litigating won't be able to stop it but it will be able to hand the few largest quasi-monopolisits like facebook, google, and microsoft a near monopoly over new AI tools when they're the only ones that can overcome the defaults set by legislation or litigation.

It's particularly bad because the spectacular data requirements and training costs already create big centralization pressures in the control of the technology. We will not be better off if we amplify these pressures further with bad legal precedents.

az2263y ago

GitHub already has this in TOS -- that is the irony of the lawsuit, it is actually in GitHub's favor this happens. GitHub can in such a case jack up the price 10x as the sole provider.

bkuhn3y ago

In case folks here were curious, we at the Software Freedom Conservancy have asked the Plaintiffs to endorse the Principles of Community-Oriented GPL enforcement: https://sfconservancy.org/news/2022/nov/04/class-action-laws...

… & of course we again ask Microsoft's GitHub to start respecting FOSS licenses, cooperate with the community, & retract their incorrect claim that their behavior is “fair use”.

A few more links to our work on this issue:

https://sfconservancy.org/blog/2022/feb/03/github-copilot-co... https://sfconservancy.org/news/2022/feb/23/committee-ai-assi...

foooobaba3y ago

It seems like we should come to agreement on what the license is intended for, given that when the licenses were created in a time before AI like this existed. If the authors did not intend their code to be used like this, should we not respect it? Also, does it make sense to create new licenses which explicitly state whether using it for AI training is acceptable or not - or are our current licenses good enough?

solomatov3y ago

The most important part of this is not whether the lawsuit will be won or lost by one of the parties, but what is the legality of fair use in machine learning, and language models. There's a good chance that it gets to Supreme Court and there will be a defining precedent to be used by future entrepreneurs about what's possible and what's not.

P.S. I am not a lawyer.

warbler733y ago

It seems obvious that AI models are derivative works of the works they are trained on but it also seems obvious that it is totally legally untested whether they are derivative works in the formal legal sense of copyright law. So it should be a good case assuming we have wise and enlightened judges who understand all nuances and can guide us into the future.

buzzy_hacker3y ago

Copilot has always seemed like a blatant GPL violation to me.

puffoflogic3y ago

Code is not licensed to GitHub under the GPL. Your comment is word salad.

m00x3y ago

Care to explain in legal terms why this stance is qualified?

buzzy_hacker3y ago

You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions:

a) The work must carry prominent notices stating that you modified it, and giving a relevant date. b) The work must carry prominent notices stating that it is released under this License and any conditions added under section 7. This requirement modifies the requirement in section 4 to “keep intact all notices”. c) You must license the entire work, as a whole, under this License to anyone who comes into possession of a copy. This License will therefore apply, along with any applicable section 7 additional terms, to the whole of the work, and all its parts, regardless of how they are packaged. This License gives no permission to license the work in any other way, but it does not invalidate such permission if you have separately received it.

——

I don’t see how one could argue that training on GPL code is not “based on” GPL code.

1 more reply

foooobaba3y ago

If github or google indexes source code using a neural net to help you find it, given a query, is that also illegal? If you think of copilot as something that helps you find code you’re looking for, is it all that different, and if so, why?

In this case, wouldn’t the users of copilot be the ones responsible for any copyrighted code they may have accessed using copilot?

lbotos3y ago

The crux of the issue: Is the code that is being generated being used in a way that it's license allows? That's it. I'm confident that this problem would go away if copilot said:

//below output code is MIT licensed (source: github/repo/blah)

And yes, the "users" are responsible, but it's possible that copilot could be implicated in a case depending on how it's access is licensed.

Stable diffusion has this same problem btw, but in visual arts "fair use" is even murkier.

For code, if you could use the code and respect the license, why wouldn't you? Copilot takes away that opportunity and replaces it with "trust us".

foooobaba3y ago

This makes sense, it produces chunks not the whole source where a search engine would also give you the license.

leni5363y ago

Both services already accept DMCA notices to take content down.

foooobaba3y ago

True, that’s another good point.

hu33y ago

A a GitHub user, is there a way to support GitHub against this lawsuit?

Obviously not financially as Microsoft has basically YES amounts of money.

michaelmrose3y ago

If you had legal expertise and a strong opinion on the matter I suppose you could write a persuasive brief for the consideration of the court. If you have a strong opinion but aren't a legal eagle you could write to your legislators in support of legislation explicitly supporting this use case or organize the support of people more capable in that arena.

If you are opinionated but lazy, no judgement here as I sit here watching TV, you could add a notation at the top of your repos explicitly supporting the usage of your code in such tools as fair use.

Notably if your code is derivative of other works you have no power to grant permission for such use for code you don't own so best include some weasel words to that effect. Say.

I SUPPORT AND EXPLICITLY GRANT PERMISSION FOR THE USAGE OF THE BELOW CODE TO TRAIN ML SYSTEMS TO PRODUCE USEFUL HIGH QUALITY AUTOCOMPLETE FOR THE BETTERMENT AND UTILITY OF MY FELLOW PROGRAMMERS TO THE EXTENT ALLOWABLE BY LICENSE AND LAW. NOTHING ABOUT THIS GRANT SHALL BE CONSTRUED TO GRANT PERMISSION TO ANY CODE I DO NOT OWN THE RIGHTS TO NOR ENCOURAGE ANY INFRINGING USE OF SAID CODE.

Years from now when such cases are being heard and appealed ad nauseam a large portion of repos bearing such notices may persuade a judge that such use is a desired and normal use.

You could even make a GPLesque modification if you were so included where you said. SO LONG AS THE RESULTING TOOLING AND DATA IS MADE AVAILABLE TO ALL

Note not only am I not your lawyer, I am not a lawyer of any sort so if you think you'll end up in court best buy the time of an actual lawyer instead of a smart ass from the internet.

awestroke3y ago

If this leads anywhere I'll be pissed. I love CoPilot.

an1sotropy3y ago

copilot is great, and ignorance is bliss, isn't it

The situation that this lawsuit is trying to save you from is this: (1) copilot blurps out some code X that you use, and then redistribute in some form (monetized or not); (2) it turns out company C owns copyright on something Y that copilot was trained on, and then (3) C makes a strong case that X is part of Y, and that your use of X does not fall under "fair use", i.e. you infringed on the licensing terms that C set for Y.

You are now in legal trouble, and copilot put you there, because it never warned that you X is part of Y, and that Y comes with such and such licensing terms.

Whether we like copilot or not, we should be grateful that this case is seeking to clarify some things are currently legally untested. Microsoft's assertions may muddy the waters, but that doesn't make law.

awestroke3y ago

It's pretty obvious when it does emit copyrightable code, and you mostly have to really try to make that happen. Have you even used copilot yourself?

yamtaddle3y ago

I expect I'd love it but I've been holding off until I find out whether MS lets devs on their core products use it.

If not, it's a pretty clear sign they consider it radioactive.

still_grokking3y ago

I hope MS used a lot of AGPL code to train Copilot… This would be fun.

But no matter how this goes, in case training AI with copyrighted inputs is "fair use" that'll end up as the ultimate "copyright laundry machine" like this "joke" project here:

https://web.archive.org/web/20220104214929/https://fairuseif...

https://news.ycombinator.com/item?id=27796124 (302 points, 151 comments)

rafaelturk3y ago

Like everything legally related: This is not about open source fairness, protecting innovation, it's all about making money.

throwaway6753093y ago

Even if this succeeds, you've already lost.

1. The ability to be able to run and train these models is going to eventually be perfectly plausible on a home machine.

2. It's only a matter of time before models, e.g. a popular model scraped from all of the code on GitHub, is a publicly available torrent.

3. People will be able to just run it locally as an integrated plug-in in jet brains or VS code.

4. You'll never know if somebody has lifted their code in violation of a license anymore than you would be able to tell if somebody used code from stack overflow without attribution in any commercial endeavor.

The End.

kevincox3y ago

Just because some people get away with copyright infringement doesn't mean that copyright infringement is now legal.

I don't think 1-3 matter at all. The point is that GitHub is selling a tool that can commit copyright infringement. This lawsuit is trying to get them to pay the consequences for the infringement that they have enabled.

falcolas3y ago

Crackpot Theory: Copilot (and by association many ML tools) is a form of probabilistic encryption. Once encoded, it's virtually impossible to pull the code (plaintext) directly out of the raw ML model (the cyphertext), yet when the proper key is input ('//sparse matrix transpose'), you get the relevant segment of the original function (the plaintext) back.

We've even seen this with stable diffusion image generation, where specific watermarks can be re-created (decrypted?) deterministically with the proper input.

az2263y ago

This is not crackpot -- this is literally how it works. Here's an example that points to this, https://arstechnica.com/information-technology/2022/09/bette...

Anybody looking at the source image and the generated result would say they are the same.

spir3y ago

The part of GitHub Copilot to which I object is that it's trained on private repos. Where does GitHub get off consuming explicitly private intellectual property for their own purposes?

garfieldnate3y ago

If GitHub ends up having to tweak their product to avoid ethical/legal concerns, I actually imagine it could still be pretty cool. Right now Copilot is a black box that spits out code with no attributes; what if they worked on instead making it a glass box, where it always brings up snippets of other projects along with their licensing info so that you can decide how to incorporate the ideas fairly yourself? Or they could still output the same code suggestions, but always include attribution and license data along with it. Making the product more transparent would probably make more people comfortable with using it, anyway.

Cloudef3y ago

Unless the copilot spits out complete programs or libraries that are 1:1 to someone elses who cares? Caring about random small code snippets is dumb.

bilsbie3y ago

Laws need to change to match technology.

Did you know before airplanes were invented common law said you owned the air above your land all the way to the heavens.

m00x3y ago

Can you explain what damages you incur from Copilot?

jacooper3y ago

People not following your license ? And not making their derived works under the same license like I require?

1 more reply

brookst3y ago

I wonder if the plaintiffs' code would stand up to scrutiny of whether any of it was copied, even unintentionally, from other code they saw in their years of learning to program? I know that I have more-or-less transcribed from Stack Overflow/etc, and I have a strong suspicion that I have probably produced code identical to snippets I've seen in the past.

zach_garwood3y ago

But have you done so on an industrial scale?

brookst3y ago

I'm just one person! Give me a team of 1000 and I'll get right on that.

layer83y ago

Copilot reminds me of the Borg: You will be assimilated. We will add your technological distinctiveness to our own. Resistance is futile.

omegacharlie3y ago

Think some of the negativity about Copilot may be the perception that if an individual or small startup attempted training an ML model from public source-code and commercialised a service from it they would be drowning in legal issues from big companies not happy with their code used in such a product.

In addition just because code is available publicly on GitHub does not necessarily mean it is permissively licensed to use elsewhere, even with attribution. Copyright holders not happy with their copyrighted works publicly accessible can use the DMCA to issue take-downs that GitHub does comply with but how that interacts with Copilot and any of its training data is a different question.

As much as the DMCA is bad law rather funny seeing Microsoft be charged in this lawsuit with the less known provision against 'removal of copyright management information'. Microsoft does have more resources to mount at defence so it will probably end up different compared to a smaller player facing this action.

rolenthedeep3y ago

Consider each repo on github to be a movie. What copilot does is to search for sequences of frames from any movie which line up to create a new coherent movie.

Individually, each frame is protected by the copyright of the movie it belongs to. But what happens if you take a million frames from a million different movies and just arrange them in a new way?

That's the core question here. Is the new movie a new copyrightable work, or is it plagiarizing a million other works at once? Is it legal to use copyrighted works in this way?

The other question is if it is right to use copyrighted works this way. Is this within the spirit of open source software? Or is this just a bad corporation taking advantage of your good will?

I'm not sure where I stand on this, it's a complicated problem for sure. Definitely interested to see how this plays out in court.

az2263y ago

Fair use.

poulpy1233y ago

>By training their AI systems on public GitHub repositories (though based on their public statements, possibly much more) we contend that the defendants have violated the legal rights of a vast number of creators who posted code or other work under certain open-source licenses on GitHub.

I don't know about the US laws in copyright so I can't comment on the legal documents but this website is not complaining that copilot is reproducing copyrighted content but it was trained on copyrighted content. I don't see how you can forbid someone or something to read and learn from something that is public (once again producing is another problem)

throwaway6753093y ago

How much code is necessary to be considered a copyright infringement from an existing code base?

For example let's say I'll take a single frame of animation from a cartoon, The frame contains a mountain, house, and a couple characters although those characters are not integral to the actual cartoon maybe they're extras (villagers and not named characters something like Mickey Mouse for example)

I draw a picture of a lake with a cabin next to it, then start to draw a frontiersman but I trace one of his arms from a villager of that previous frame of animation... Number one am I in danger of copyright infringement (have I hit some arbitrary threshold), and number two: am I causing monetary losses for the cartoon?

jasonladuke03113y ago

Merits of the case aside, I'm befuddled that a company with a legal team like Microsoft approved this product. Is their assumption that this would bring in more revenue than potentially defending it in court? The math doesn't make sense to me.

RamblingCTO3y ago

lol @ "open-source software piracy"

If I'm being honest I'm a bit annoyed at this. What's the problem and what's the point of this?

opine-at-random3y ago

If you'd ever read even a single one of the licenses to the software I'm sure you use everyday, you'd understand. This is such an obvious and pathetic strawman.

I notice often on hackernews that people don't seem to understand anything about free or open-source software outside of the pragmatics of whether they can abuse the work for free.

RamblingCTO3y ago

You read a lot into my not so serious comment. Maybe internet comment sections aren't the right place for you.

But I'll bite: I know licensing, thank you. But what's copyrightable is not so easy. Licenses are not so easy. Copilot does not copy entire works and it's very questionable if a few lines of code are "piracy". It's a repeating discussion again and again, there's nothing novel about it except for the fact that a machine learns (and overfits for small portions of code). So please get off your high horse. I don't care for your fundamentalism.

1 more reply

bpodgursky3y ago

Lawyers want $$$$.

RamblingCTO3y ago

Yeah I guess so. This website reads like bullshit bingo from some weird twitter dude trying to sell you his newest product:

"AI needs to be fair & ethical for everyone. If it’s not, then it can never achieve its vaunted aims of elevating humanity. It will just become another way for the privileged few to profit from the work of the many."

Blah blah. Can we get back to the hacking on stuff mentality?

1 more reply

renewiltord3y ago

It doesn't make sense. If I make a piece of software that curls a random gist and then puts it into your editor am I infringing or are you infringing when you run it or are you infringing when you use that file and distribute it somewhere?

lbotos3y ago

> If I make a piece of software that curls a random gist and then puts it into your editor am I infringing

Depends on the license. If it's MIT and you serve the license, no, you are not infringing at all. A trimmed version of MIT for the relevant bits:

Permission is hereby granted [...[ to any person obtaining a copy of this software [..] to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, [...] subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

> are you infringing when you run it

Depends on the license

> are you infringing when you use that file and distribute it somewhere

Depends on the license

----

When copilot gives you code without the license, you can't even know!

renewiltord3y ago

Well, `curl` will download a gist without checking its license. So curl is infringing?

1 more reply

mezbot3y ago

This issue seems to have an obvious solution that I fail to see anyone mention: Treat copilot simply as a tool, let it be trained on whatever without any consent requirements. However the outputs should be subject to copyright as with any other code produced by a human. Then on a case by case basis courts can decide if infringement has occurred. The idea of banning copilot or other AI models as a whole just seems like a collective case of sour grapes because innovation and automation is finally threatening some people who only expected these things to affect the working class

EMIRELADERO3y ago

I think it's a great time to explain why this won't hit AI art such as Stable Diffusion, even if GitHub loses this case.

The crux of the lawsuit's argument is that the AI unlawfully outputs copyrighted material. This is evident in many tests with many people here and on Twitter even getting verbatim comments out of it.

AI art, in the other hand, is not capable of outputting the images from its training set, as it's not a collage-maker, but an artificial brain with a paintbrush and virtual hand.

jrochkind13y ago

Eh... I don't know. It sounds to me like you are saying because the code example outputs exact lines, it's a copyright violation; but the image AI's necessarily don't output exact copies of even portions of pre-existing images, that's not how they work.

But I don't think copyright on visual images actually works like that, that it needs to be an exact copy to infringe.

If I draw my own pictures of Mickey Mouse and Goofy having a tea party, it's still a copyright infringement if it is substantially similar to copyright depictions of mickey mouse and goofy. (subject to fair use defenses; I'm allowed to do what would otherwise have been a copyright infringement if it meets a fair use defense, which is also not cut and dry, but if it's, say, a parody it's likely to be fair use. There is probably a legal argument that Copilot is fair use.... the more money Github makes on it, the harder it is though, but making money off something is not relevant to whether it's a copyright violation in the first place, but is to fair use defense).

(yes, it might also be a trademark infringement; but there's a reason Disney is so concerned with copyright on mickey expiring, and it's not that they think there's lots of money to be spent on selling copies of the specific Steamboat Willy movie...)

> There is actually no percentage by which you must change an image to avoid copyright infringement. While some say that you have to change 10-30% of a copyrighted work to avoid infringement, that has been proven to be a myth. The standard is whether the artworks are “substantially similar,” or a “substantial part” has been changed, which of course is subjective.

https://www.epgdlaw.com/how-can-my-artwork-steer-clear-of-co...

I think Stable Diffusion etc are quite capable of creating art that is "substantially similar" to pre-existing art.

EMIRELADERO3y ago

I believe fair use is the way to go then. SD would definitely be so, in my opinion.

1 more reply

PuddleCheese3y ago

These models can actually output images that can be extremely close to the material present in training models:

- https://i.imgur.com/VikPFDT.png

I also don't know if I would anthropomorphize ML to that degree. It's a poor metaphor and isn't really analogous to a human brain, especially considering our current understanding, or lack thereof, of the brain, and even the limited insight we have into how some of these models work from the people who work on them.

kmnc3y ago

I don’t understand this argument… if image AI gets good enough then generating exact copies of its training model seems trivial.

az2263y ago

https://arstechnica.com/information-technology/2022/09/bette...

Want to say that again?

solomatov3y ago

IMO, the case is exactly the same for copilot and generative models for images. That's why it's so important to have some precedent as a guide for future products.

P.S. I am not a lawyer.

fancyfredbot3y ago

If a software developer learns how to code better by reading GPL software and then later uses the skills they developed to build closed source for profit software should they be sued?

thomastjeffery3y ago

If a software developer writes a program to remember a million lines of GPL code, then uses that dataset to "generate" some of that code, then they are essentially violating that license with extra steps.

The extra steps aren't enough to exhonorate them. It's just a convoluted copy operation.

Is just like how a lossy encoding of a song is still - with respect to copyright - a copy of that song. The data is totally different, and some of the original is missing. It's still a derivative work. So is a remix. So is a reperformance.

buzzy_hacker3y ago

Copilot is not a person, it is a piece of software.

Phrodo_003y ago

Depends on how closely they reuse the code. Writing it verbatim or nearly? Yes.

jacooper3y ago

A human doesn't perfectly reproduce the same code he learned from.

throwaway6753093y ago

A person with eidetic memory absolutely could do so.

hjroberts3y ago

Whether it is legally wrong or not to scan OSS code (I think it is wrong), there has been a time-honored precedent for disallowing automated scanning:

  robots.txt

This is exactly what is needed for source code, and the default (no robots.txt) should be "disallow".

The fact that the Web has considered this moral issue should be a strong hint for the AI people not to take a purely legal stance but consider the OSS community that they are so heavily using.

atum473y ago

Forgive my ignorance, but who is going to benefit from this lawsuit? I have a lot of code on GitHub, can I, for instance, expect a check in the mail in case of a win?

gpm3y ago

(Not a lawyer, so this is really definitely absolutely not legal advice and if you're looking to profit you should speak to a lawyer... for instance the lawyers who just filed the lawsuit)

They're asking for two things, injunctive relief (ordering github/openai/microsoft to stop doing this) and damages.

I suppose the injunctive relief really benefits anyone who doesn't want AI models to exist, because that's what it's asking for.

The damages will go the members of the class certified for damages, with more going to the lead plaintiffs (those actually involved in the suit) and some going to the lawyers. They're asking for the following class definition for damages

> All persons or entities domiciled in the United States that, (1) owned an interest in at least one US copyright in any work; (2) offered that work under one of GitHub’s Suggested Licenses; and (3) stored Licensed Materials in any public GitHub repositories at any time during the Class Period.

1 more reply

datacruncher013y ago

I think the software is probably ok provided that, the sources are credited (ie, if co-pilot copies code from say SDL, then the relevant code sections need to be correctly attributed, the mandatory license readme copied to the project so all code is following the open source licenses used. That's literally the purpose of open source licenses. If Copilot can't be bothered to do that, then yeah it should be shut down.

cothrowaway883y ago

Made a throwaway since I guess this stance is controversial. I could not care less about how copilot was made and what kind of code it outputs. It's useful and was inevitable.

I'm 1000% on team open source and have had to refer to things like tldrlegal.com many times to make sure I get all my software licensing puzzle pieces right. Totally get the argument for why this litigation exists in the present.

Just saying in general my friends I hope you have an absolutely great day. Someone will be wrong on the internet tomorrow, no doubt about it. Worry about something productive instead.

This one has the feel of being nothing more than tilting at windmills in the long run.

0cf8612b2e1e3y ago

Is there any amount of public data/code/whatever I can make an offline backup of today in the event this gets pulled?

kyleee3y ago

That’s what I am wondering, as a contingency plan so at least a replica service can be created if copilot shuts down.

matthewwolfe3y ago

I will never understand why people push code to public repos and then complain when someone or something uses that code. Code that you want to keep private or make money off of should be private. Only publish stuff to the public that you want other people to see and learn from. All the complaints about attribution… who cares.

YoshiRulz3y ago

> All the complaints about attribution… who cares.

I may not care if some guy I've never met uses my niche library without attribution. (I do care, really.) But Microsoft certainly cares if you use their code without attribution, so why shouldn't I take the same belligerent, copyright-enforcing attitude towards them? That's the main reason why people are angry, because MS has "rules for thee but not for me" by virtue of being big enough to have ~~good~~effective lawyers and lobbyists.

1 more reply

pmarreck3y ago

This will fail. Copilot is too good, and only suggests snippets or small functions, not entire classes for example.

User233y ago

Copilot is clearly a derivative work. So is every other similar model. How is this even up for discussion?

stovenctl3y ago

The comparison I would draw is it's a statistics based search engine for code.

Sometimes the query is the first half of a small statement that we can fill in with common patterns. Useful, fair.

Sometimes the query is a signature like `fn fast_inv_sqrt` that copies someone's code and doesn't attribute it.

nuc1e0n3y ago

My own view is that it is not legal for humans to produce derivatives of copyrighted works currently. So therefore it is probably already not legal to train an artificial intelligence using copyrighted works to in order to produce derivatives either.

jjgon17813y ago

I am surprise in the amount of people that in favor in copilot being train with copyright data.

scoot3y ago

The editorialized title isn't correct. The lawsuit is against GitHub for Copilot not against GitHub Copilot, which is not a "legal person".

A better shortening if the original title is simple "We’ve filed a lawsuit challenging GitHub Copilot"

reachableceo3y ago

Let me (start or join the call) for federal investigation and the filing of criminal complaints in all relevant locales.

Grand theft , interstate wire fraud and conspiracy for same.

This is a criminal matter as well as civil. Intentional and knowing violation of the law.

We must not let our work be taken!

gcau3y ago

As much as I love the little guy beating the big evil company, I hope the lawsuit doesn't cause anything to happen to copilot. Maybe some changes, like better protection against emitting 1:1 licensed code or opting out your code from training.

vlovich1233y ago

Can someone explain to me Microsoft’s decision here to use GPL code in the training set? It would seem like sticking to non-attribution / non-viral licenses would have kept them in the clear. Was that an insufficient size data set?

az2263y ago

It only trains on the GPL code, it doesn't reproduce entire code files verbatim. So it's fair use.

1 more reply

eurasiantiger3y ago

Maybe we just need to prompt it to include the proper licenses and attributions. /s

tmtvl3y ago

Eh, I don't mind Copilot being trained on my code as long as it and all projects made using it are licensed under the AGPL.

thesuperbigfrog3y ago

How original is the generated code?

Can the generated code be traced back to the code used for training and the original copyrights and licenses for that code?

If so, what attribution(s) and license(s) should apply to the generated code?

dmitrygr3y ago

They demonstrate generated code being identical to some training code.

avian3y ago

There were well known examples of copilot reproducing exact code snippets well before this lawsuit (e.g. the Quake's fast inverse square root function). Microsoft dealt with them by simply adding the offending function names to a blocklist.

In other words, if your open source project doesn't have such immediately recognizable code and didn't cause a shitstorm on Twitter, chances are copilot is still happily spewing out your exact code, sans the copyright and license info.

m00x3y ago

Just like developers have never copy-pasted code from stack overflow or Github :):):)

Swizec3y ago

How many ways are there to write many of the basic algorithms we all use though? Can I copyright "({ item }) => <li>{item.label}</li>"?

Because I sure have seen that exact code written, from scratch, in many many places.

I guess my question boils down to "What is the smallest copyrightable unit of code?". Because I'm certain suing a novelist for copyright infringement on a character that says "Hi, how are you?" would be considered absurd.

2 more replies

arpowers3y ago

The proper way to think about these LLM is similar to plagiarism.

Seems to me the underlying data should be opt-in from creators and licenses should be developed that take AI into consideratiin.

Aeolun3y ago

I find this whole subject exhausting. The only reason I’m glad there is a lawsuit is that we can finally put this thing to rest when either party wins.

Yahivin3y ago

Copilot does include the licenses...

Start off a comment with // MIT license

Then watch parts of various software licenses come out including authors' names and copyrights!

marmada3y ago

All these people whining about copyright need to consider: is the issue Copilot, or is the issue copyright.

amelius3y ago

Can Copilot reproduce Numerical Recipes in C?

(asking because I know the authors were kinda famous for being very litigious).

HeavyStorm3y ago

"Angry people brandish their fists against the incoming revolution" is also a good title.

sensanaty3y ago

I personally hope they win, and win big. Anything that ruins Micro$oft's day is a boon to mine.

clusterhacks3y ago

Did Microsoft use the source code of Windows (in whole or in part) as training input to Copilot?

az2263y ago

Microsoft didn't do the training. Open AI did. They only public code.

machiste773y ago

bruh, come on! you're gonna ruin it for the rest of us

kgarten3y ago

on a tangent ... beautiful typography, I love Matthew Butterick's work on legible fonts an his guide to practicle typography.

all the best with the lawsuit.

barelysapient3y ago

MSFT to $0 anyone?

i_like_apis3y ago

I love that this is going to loose.

SighMagi3y ago

I did not see that coming.

SurgeArrest3y ago

I hope this case will fail and establish a good precedent for all future AI litigations and may be even prevent new ones. Your code is open source - irregardless of license, one might read it as a text book and then remember or even copy snippets and re-use this somewhere else unrelated to the original application. If you don't like this, don't make your code open source. This was happening and is happening independent of any license all over the world by majority of developers. What Copilot and similar tools did was to make those snippets accessible for extrapolation in new applications.

If these folks win - we again throw progress under the bus.

jacooper3y ago

No thank you. I put a license to be followed, not to just be disregarded by an AI as "Learning material". No human perfectly reproduces their learning material no matter what, but Copilot does.

mcluck3y ago

You mean to tell me that no one has ever perfectly replicated an example that they read somewhere? There's only so many ways to write AABB collision, fibonacci, or any number of other common algorithms. I'm not saying there aren't things to consider but I'm sure I've perfectly replicated something I read somewhere whether I'm actively aware of it or not

IshKebab3y ago

So are you ok with it being illegal for humans to learn from copyrighted books unless they have a license that explicitly allows learning? That does not sound like a pleasant consequence.

6 more replies

throwaway6753093y ago

100% false, there are loads of historical cases of people with eidetic memories being able to reproduce things that they've seen with near complete fidelity, there's no reason to believe that a coder with such a memory would be any different.

Etheryte3y ago

> Your code is open source - irregardless of license, one might read it as a text book and then remember or even copy snippets and re-use this somewhere else unrelated to the original application.

Yes, but attribution should still be given. Just because you don't copy-paste someone else's creation doesn't mean you're licensed to use it.

shagie3y ago

Is it the role of the tool (in this case copilot) to include the license information? Or is it the responsibility of the organization using the code to make sure that it wasn't copied from somewhere?

What if, instead of a tool, you had a random consultant do some work, and it was found out that he asked a ton of stuff on Stack Overflow and copied the CC-BY-SA 4.0 answers into his work? What if it was then found out that one of those answers was based on copying something from the Linux kernel? Who is responsible for doing the license check on the code before releasing the product?

1 more reply

humanwhosits3y ago

> irregardless of license

Hard no. Please stop using open source code if this is how you think of it.

Without licenses being respected, we don't get open source communities.

az2263y ago

Licenses be damned, copyright law sits above it -- and for now, it's hard to see how this isn't fair use. The only case might be an open source Copilot alternative and GitHub and OpenAI can take any such projects out of the training set.

vesinisa3y ago

Open source does not mean public domain. Open source specifically attaches limitations on how the code may be reused.

elcomet3y ago

There are no limitations on reading the code to learn from it.

1 more reply

simion3143y ago

> Your code is open source ....

So why MS can screw only with some licenses that you call "open source". Your example with a human reading a book would also work with code available licenses or decompiled binaries.

I would have been fine if the open source code was used to create an open model or if MS would have put his ass on the line and also train the model with all the GitHub code because they claim there is no copyright issue.

solomatov3y ago

The problem is that copyright laws were introduced for a reason, and with a thinking similar to yours we might decide to get rid of copyright altogether, which I think is a bad idea.

P.S. I am not a lawyer.

tfsh3y ago

If organisations are going to ignore the licenses attached to my OOS and that's legimitised in the law, then that's a surefire way to irreparably damage the open source ecosystem

ISL3y ago

Can anyone with Copilot access give a short summary of its response to the prompts:

  function force=Gmmr2Array(mass1, mass2)

and

  function [force, torque]=pointMatrixGravity(array1,array2)

I'd love to know if some of my GPL v3 code [1, 2] has landed in the training set

[1] https://github.com/4kbt/NewtonianEotWashToolkit/blob/master/...

[2] https://github.com/4kbt/NewtonianEotWashToolkit/blob/master/...

jmcphers3y ago

First prompt:

    var G = 6.67e-11;
    var force = G * mass1 * mass2 / distance * distance;
    return force;

Second prompt:

  % pointMatrixGravity - Calculates the gravitational force and torque on a
  % point mass due to a matrix of point masses.
  % 
  %   [force, torque]=pointMatrixGravity(array1,array2)
  % 
  %   Inputs:
  %       array1 - 3xN array of point masses
  %       array2 - 3xN array of point masses
  %   Outputs:
  %       force - 3x1 array of gravitational force
  %       torque - 3x1 array of gravitational torque
  % 
  %   Author: David J. R. MacKay,
  %   email:

ridiculous_fish3y ago

FYI if you use the "Open GitHub Copilot" command in VSCode you will get up to 10 different outputs for the same prompt.

Intereting that my results were different than yours!

jm43y ago

I got something a bit different from the other person who responded to you.

function force=Gmmr2Array(mass1, mass2) {

    var G = 6.67408e-11;

    var m1 = mass1;

    var m2 = mass2;

    var r = 1;

    var force = G \* m1 \* m2 / (r \* r);

    return force;

}

function [force, torque]=pointMatrixGravity(array1,array2) {

    var G = 6.67408e-11;

    var m1 = array1(1);

    var m2 = array2(1);

    var r = norm(array1(2:4)-array2(2:4));

    var force = G \* m1 \* m2 / (r \* r);

    var torque = cross(array1(2:4)-array2(2:4), force);

    return [force, torque];

}

ridiculous_fish3y ago

For Gmmr2Array: https://gist.github.com/ridiculousfish/9a25f5f778d98ecd81099...

For pointMatrixGravity: https://gist.github.com/ridiculousfish/af05137a4090e92de3a97...

m00x3y ago

The only people who gain out of class lawsuits are the lawyers.

This person (a lawyer) saw an opportunity to make money and jumped on it like a hungry tiger on fresh meat.

tasuki3y ago

I have quite a bit of respect for Matthew Butterick. I don't think he's just a lawyer looking to earn a quick buck. He cares about software and wants to make the world a better place.

> But neither Matthew Butterick nor anyone at the Joseph Saveri Law Firm is your lawyer

This is curious. None of them are my lawyers, but surely at least some of them are someone's lawyers? Isn't it wrong to put such a blanket disclaimer on a website which might well be read by their clients?

alpaca1283y ago

So he gets to make money with his profession while defending OSS licenses? I don't see the big problem.

alsodumb3y ago

This. I've seen so many class action lass suits where at the end of the day the highest gain per Capita always ends up going to the lawyers. Fuck this guy and everyone trying to make money from this.

Entinel3y ago

I don't have a comment on this personally but I want to throw this out there because every time I see people criticizing Copilot or Dall-E someone always says "BUT ITS FAIR USE! Those people don't seem to grasp that "Fair Use" is a defense. The burden is not on me to prove what you are doing is not fair use; the burden is on you to prove what you are doing is fair use

1 more reply

VoodooJuJu3y ago

As celestialcheese says [1], it seems like a manufactured case for the purpose of furthering someone's legal career rather than seeking remittance for any violations made by Copilot.

But I like to put on my conspiracy hat from time to time, and right now is one such time, so let's begin...

Though the motivations behind this case are uncertain, what is certain is that this case will establish a precedent. As we know, precedents are very important for any further rulings on cases of a similar nature.

Could it be the case that Microsoft has a hand in this, in trying to preempt a precedent that favors Copilot in any further litigation against it?

Wouldn't put it past a company like Microsoft.

Just a wild thought I had.

[1] https://news.ycombinator.com/item?id=33457826

1 more reply

bugfix-663y ago

Ask HN: I want to modify the BSD 2-Clause Open Source License to explicitly prohibit the use of the licensed software in training systems like Microsoft's Copilot (and use during inference). How should the third clause be worded?

  The No-AI 3-Clause Open Source Software License

  Copyright (C) <YEAR> <COPYRIGHT HOLDER>

  All rights reserved.

  Redistribution and use in source and binary forms, with or without
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in
     the documentation and/or other materials provided with the
     distribution.

  3. Use in source or binary forms for the construction or operation
     of predictive software generation systems is prohibited.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

https://bugfix-66.com/f0bb8770d4b89844d51588f57089ae5233bf67...

an1sotropy3y ago

IANAL, and I'm no fan of copilot, but I wonder if this kind of clause (your #3) is going to fly: you're preemptively prohibiting certain kinds of reading of the code (when code is read by the ML model in training). But is that something a license can actually do?

The legal footing that copyright gives you, on which licensing rests, certainly empowers you to limit things about how others may redistribute your work (and things derived from it), but does it empower you to limit how others may read your work? As a ridiculous example, I don't think it would be enforceable to have a license say "this code can't be used by left-handed people", since that's not what copyright is about, right?

bugfix-663y ago

The license conditionally permits (i.e., controls) "redistribution and use in source and binary forms".

I think we can constrain use with the third clause.

My question is, how should we word that clause?

2 more replies

CrazyStat3y ago

The legal theory for copilot is that training an ML model is fair use, not that the license allows it. If it is fair use then you can't prohibit it by license, no matter what you put in your license.

kochb3y ago

For this clause to have any positive effect, you need to 1) be willing to pursue legal action against violators and 2) actually notice that the clause has been violated.

Such language must be carefully written. What is the definition of “construction” and “operation” in a legal context? What is a “predictive software generation system”? That’s a very specific use case, you sure you covered everything you want to prohibit?

You’ve inserted your clause in such a way that this dependency cannot be used in any way to build anything similar to a “predictive software generation system”, even with attribution, as it would fail clause 3.

You have to consider that novel licenses make it difficult for any party that respects licenses to use your code. It is difficult to make one-off exceptions, especially when the text is not legally sound. So adoption of your project will be harmed.

So if you are serious about this license, you need a lawyer.

m00x3y ago

Get a lawyer since this is nonsense.

bugfix-663y ago

It's literally the standard BSD 2-Clause License, word for word, with an additional third clause:

  3. Use in source or binary forms for the construction or operation
     of predictive software generation systems is prohibited.

Hardly nonsense, but obviously you aren't equipped to judge. More about the BSD licenses:

https://en.m.wikipedia.org/wiki/BSD_licenses

2 more replies

tptacek3y ago

Is it? A similarly casual clause in the OCB license prevented OCB from being used by the military for many years (granted, it prevented OCB from being used almost everywhere else, too).

I have no idea if this license language works or doesn't, but this is hardly the least productive subthread on this story. It's concrete and specific, and we can learn stuff from it.

1 more reply

az2263y ago

Just don't upload your code to GitHub. Don't make it open source. Share via newsletters.

ilc3y ago

If I read this right, I can't use auto-complete. No thanks.

tedunangst3y ago

Yeah, lol. New rule: code may be used for autocomplete, but only by a push down automata.

60secs3y ago

This is why we can't have nice dystopias.

1 more reply

j / k navigate · click thread line to collapse

We've filed a lawsuit against GitHub Copilot (opens in new tab)

781 comments