“AI” is just fancy speak for “complex math program”. If I make a program that’s simply given an arbitrary input then, thought math operations, outputs Microsoft copyright code, am I in the clear just because it’s “AI”? I think they would sue the heck out of me if I did that, and I believe the opposite should be true as well.
I’m sure my own open source code is in that thing. I did not see any attributions, thus they break the fundamentals of open source.
In the spirit of Rick Sanchez; It’s just compression with extra steps.
function isPrime(n: number): boolean {
for (let i = 2; i < n; i++) {
if (n % i === 0) {
return false;
}
}
return n > 1;
}
function isEven(n: number): boolean {
return n % 2 === 0;
}
These are clearly not covered by copyright in the first place. This case is really quite pathetic.Legally a copyright claim seems weak, but they didn't assert one. Some of their claims look stronger than others. The DMCA claim in particular strikes me as strong-ish at first glance, though.
Morally I think this class action is dead wrong. This is how innovation dies. Many of the class members likely do not want to kill Copilot and every future service that operates similarly. Beyond that, the class members aren't likely to get much if any money. The only party here who stands to clearly benefit is the attorneys.
Do you want to be vulnerable to copyright litigation for code you write? Can you afford to respond to every lawsuit filed by disgruntled wingbat, large corp wanting to shut down open source / competing project?
There's a fairly simple technical fix for codex/copilot anyway; stick a search engine on the back end and index the training data and don't output things found in the search engine.
If they just stick to using permissive-licensed source code then i'm not sure what the actual 'harm' is with co-pilot.
If they auto-generate an acknowledgement file for all source repos used in co-pilot, and then asked clients of co-pilot to ship that file with their product, would that be enough? Call it "The Extended Github Co-Pilot Derivative Use License" or something.
Not really? It's less about arithmetic and more about inferencing data in higher dimensions than we can understand. Comparing it to traditional computation is a trap, same as treating it like a human mind. They've very different, under the surface.
IMO, if this is a data problem then we should treat it like one. Simple fix - find a legal basis for which licenses are permissive enough to allow for ML training, and train your models on that. The problem here isn't developers crying out in fear of being replaced by robots, it's more that the code that it is reproducing is not licensed for reproduction (and the AI doesn't know that). People who can prove that proprietary code made it into Copilot deserve a settlement. Schlubs like me who upload my dotfiles under BSD don't fall under the same umbrella, at least the way I see it.
Using Copilot is a bit like using a shotgun, can be very illegal depending on what you shoot at. Creating and distributing the app Copilot is like creating and selling a shotgun.
It is not directly using your code any more than programmers are using print statements. A book can be copyrighted, the vocabulary of language cannot. A particular program can be copyrighted, but snippets of it cannot, especially when they are used in a different context.
And that is why this lawsuit is dead on arrival.
Abstraction-Filtration-Comparison
The AFC test is a three-step process for determining substantial similarity of the non-literal elements of a computer program. The process requires the court to first identify the increasing levels of abstraction of the program. Then, at each level of abstraction, material that is not protectable by copyright is identified and filtered out from further examination. The final step is to compare the defendant's program to the plaintiff's, looking only at the copyright-protected material as identified in the previous two steps, and determine whether the plaintiff's work was copied. In addition, the court will assess the relative significance of any copied material with respect to the entire program.
Abstraction
The purpose of the abstraction step is to identify which aspects of the program constitute its expression and which are the ideas. By what is commonly referred to as the idea/expression dichotomy, copyright law protects an author's expression, but not the idea behind that expression. In a computer program, the lowest level of abstraction, the concrete code of the program, is clearly expression, while the highest level of abstraction, the general function of the program, might be better classified as the idea behind the program. The abstractions test was first developed by the Second Circuit for use in literary works, but in the AFC test, they outline how it might be applied to computer programs. The court identifies possible levels of abstraction that can be defined. In increasing order of abstraction; these are: individual instructions, groups of instructions organized into a "hierarchy of modules", the functions of the lowest-level modules, the functions of the higher-level modules, the "ultimate function" of the code.
Filtration
The second step is to remove from consideration aspects of the program which are not legally protectable by copyright. The analysis is done at each level of abstraction identified in the previous step. The court identifies three factors to consider during this step: elements dictated by efficiency, elements dictated by external factors, and elements taken from the public domain.
The court explains that elements dictated by efficiency are removed from consideration based on the merger doctrine which states that a form of expression that is incidental to the idea cannot be protected by copyright. In computer programs, concerns for efficiency may limit the possible ways to achieve a particular function, making a particular expression necessary to achieving the idea. In this case, the expression is not protected by copyright.
Eliminating elements dictated by external factors is an application of the scènes à faire doctrine to computer programs. The doctrine holds that elements necessary for, or standard to, expression in some particular theme cannot be protected by copyright. Elements dictated by external factors may include hardware specifications, interoperability and compatibility requirements, design standards, demands of the market being served, and standard programming techniques.
Finally, material that exists in the public domain can not be copyrighted and is also removed from the analysis.
Comparison
The final step of the AFC test is to consider the elements of the program identified in the first step and remaining after the second step, and for each of these compare the defendant's work with the plaintiff's to determine if the one is a copy of the other. In addition, the court will look at the importance of the copied portion with respect to the entire program.
https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...
Personally I think this has the potential to blow up in everyones faces.
*Jesus Christ*, I hope I live long enough to see copyright die. Here we are at the cusp of a new paradigm of commanding computers to do stuff for us, right at the beginning of the first AI development which actually impresses me.
And we are fucking bickering about how we were cheated out of $0.00034 because our repo from 2015 might have been used for training.
I am also deeply disappointed in HackerNews; where is that deep hatred of patent trolls and smug satisfaction whenever something gets cracked or pirated now?
The situation with Microsoft and Copilot is the exact opposite. Here, Microsoft is misusing its acquisition of GitHub to repackage the work of individual free and open source contributors into a proprietary product in violation of the authors' software licenses. These licenses do not even require Microsoft to pay. They only require attribution and redistribution under a compatible license. Supporting Microsoft's misuse of GitHub is an anti-populist stance that puts the interests of the corporation over the interests of the individuals.
I can't decide if people just hate Microsoft enough that a future where you must pay to include an iseven function in your code is a price worth paying to give them a bloody nose, or there are just a large contingent of users making millions off their GPL code who are put out.
More seriously yes, copilot damages copyright (or is perceived to) and that is a good outcome irrespective of the actor. I will never see eye to eye with people defending the existing legal framework.
2) So far, these tools are "better search" schemes, not actual intelligence. Sure, many find them very useful. But given this, the (voluntary or involuntary) providers of data ought to get credit/benefit for/from this phenomena, along with the tool creators. Especially giving the current situation is Microsoft/OpenAI selling to commercial software developers who sell to general public.
The entire response to this suit on this site is mind-blowing to me. Everyone is up in arms that someone trained an AI model that could potentially spit out tiny, twisted fragments of public, open-source code. This response is nothing but selfish behavior that runs counter to the core principles of open source development and the free software movement.
Why can't they train on the code they own, such as Windows sources for example?
Or even better, why can't they release CoPilot itself under an open source license that is compatible with the licenses of code they would like to train on?
Also, I don't think anyone cares about the monetary aspects. The idea behind the GPL style license is to make sure that code remains free, regardless of what or who uses it. Freedom in this context refers to the ability study the code, modify the code, and distribute any modifications. Without the GPL the code can be used in a proprietary product which strips those rights away from users of the product.
Copeleft uses copyright laws to attempt to guarantee freedom for users. This is the inverse of what normal copyright does, which is allow a single entity to sit on the ideas and not allow other's to benefit from them.
If we can just strip copyleft licenses from projects, we are giving up those guarantees that GPL code will remain free for all users.
The GPL is trying to do it's job here, not slow down progress. Progress would be everyone benefiting from the technology behind CoPilot, rather than just MicroSoft sitting on the project and selling it as a service.
I just hope Microsoft AutoPlagiarist is not the Final Solution to Free Software they have been seeking since before the millennium's turn.
Seems to me this discussion is likely to pivot on a fulcrum located between "old enough to remember Microsoft before Bill Gates began spending his ill-gotten gains on philanthropy" and "young enough to see Microsoft primarily as the Xbox people".
1) If Github Copilot is a free software liscensed under GPL, I'm all for it. Microsoft is using other's collective hard labour to benefit itself.
2) Its Microsoft. The king of dark patterns, monopolization, and the enemy of software freedom. They can't have their cake and eat it too.
My code is 100% in Github Pilot, is there any way to publicly say that I'm against the lawsuit even if they pretend to represent me?
Is it patent trolling when you are defending your future labor from being made obsolete by megacorps and signularitarians using your past labor without permission?
No, it's justice.
Copyright working in a supported/non hated way: You develop a package to do X by cribbing off someone else's package X. They sue you for stealing their work, not to make money off you. Situation at hand is case 2, hence the lack of interest in financial gain.
Why is this case 2, when it does not always reproduce the copyrighted works exactly? Situation: You realise that rather than cribbing off of one persons package X, you can crib off two other package X's and mix/average their contents. Scale this to 100's of packages.
Eventually, ML should avoid this by developing to work from first principles, writing in it's own style, with public code used only for validation of it's ability to understand and write code.
I actually agree. However this is not what's happening here.
This is ridiculous; we created A is that can program themselves and people are worried about incidental copyright infringement.
They did not actually calculate damages in terms of lost movie tickets or estimates vs actually sales number of sold game copies. When it came to pre-releases where such product wouldn't have been sold legally in the first place, they simply added a multiplier to indicate that the copyright owner wouldn't have been willing to sell.
For software code, an other practice I have read is to use the man-hours that rewriting copyrighted code would cost. Using such calculations they would likely estimate the man hours based on number of lines of code and multiply that with the average salary of a programmer.
Sometimes damages are statutory, i.e. they have a fixed dollar amount written right into the law. This lawsuit references one such law: https://www.law.cornell.edu/uscode/text/17/1203
If you have co-pilot trained on my code base (which was private), that then reproduces near replica's of my code then they sell it for $5/year...
Well, I'm eligible for damages.
If someone wants to use it commercially without complying with the GPL, I have no problem with allowing that, for a price.
Either use the code freely and openly, or pay me so you can make money on my code.
Copilot could conceivably allow someone to use my code commercially (and in a closed manner) without negotiating with me, the copyright holder.
The value of copyleft licenses, for me, was that we were fighting back against the notion of copyright. That you couldn't sell me a product that I wasn't allowed to modify and share my modifications back with others. The right to modify and redistribute transitively though the software license gave a "virality" to software freedom.
If training a NN against a GPL licensed code "launders" away the copyleft license, isn't that a good thing for software freedom? If you can launder away a copyleft license, why couldn't you launder away a proprietary license? If training a NN is fair use, couldn't we bring proprietary software into the commons using this?
It seems like the end goal of copyleft was to fight back against copyright, not to have copyleft. Tools like copilot seem to be an exceptionally powerful tool (perhaps more powerful than the GPL) for liberating software.
What am I missing?
The point is that copyleft source code cannot be used to improve proprietary software. That limitation is enforced with copyright.
Proprietary software is closed source. You can't train your NN on it, because you can't read it in the first place.
If someone takes your open source code and incorporates it into their proprietary software, then they are effectively using your work for their private gain. The entire purpose of copyleft is to compel that person to "pay it forward", by publishing their code as copyleft. This is why Stallman is a proponent of copyright law. Without copyright, there is no copyleft.
(1) The problem with copilot is that when it blurps out code X that is arguably not under fair use (given how large and non-transformed the code segment is), copilot users have no idea who owns copyright on X, and thus they are in a legal minefield because they have no idea what the terms of licensing X are.
Copilot creates legal risk regardless of whether the licensing terms of X are copyleft or not. Many permissive licenses (MIT, BSD, etc) still require attribution (identifying who owns copyright on X), and copilot screws you out doing that too.
(2) Whatever legal power copyleft licenses have, it is ultimately derived from copyright law, and people who take FOSS seriously know that. The point of "copyleft" licenses is to use the power of copyright law to implement "share and share alike" in an enforceable way. When your WiFi router includes info about the GPL code it uses, that's the legal of power of copyright at work. The point of copyleft licenses is not to create a free-for-all by "liberating" code.
Some source code might be published but not open source licensed. At least some such code has been taken with complete disregard of their licenses and/or other legal protections, and it's impossible to find and properly map out any similar violations for the purposes of a legal response.
Whether this was the original motivation depends on whom you are asking.
You may disagree, but the "Free Software" movement (RMS and the people who agree with him) essentially wants everything to be copyleft. The "Open Source" movement is probably more aligned with your views.
I find the pattern matching and repetitive code generation really helpful. And the library autocomplete on steroids, too.
Meh. Tricky subject.
It's not just functions either, one of the most common things that it helps me with daily is simple stuff like this:
Typing
const x = {
a: 'one',
b: 'two',
...
}
And later I'll be typing y = [
a['one'],
b[' <-- it auto-completes the rest here
]
It's really amazing the amount of busy-work typing in programming that a smart pattern matching algo could help with.That's where the line is for it to be suspect IMO.
Literally 10x faster development.
Case in point: had an unexpected project and no time to complete it. Within an hour Copilot helped me:
* Write a couple of tricky matplotlib plots
* Do some extensive analysis with Pandas
* Write a couple of SQL queries
* Write a Flask back-end and deploy it
* Write a bit of a front-end
* This all with extra comments , links to documentation and pretty reasonable style
I have experience with all of the above mentioned but the speed increase was considerable.
This would a a good day's work without Copilot and there would be less commenting and hackier code.
Before Copilot I would be cursing a lot more reading various docs...
The key thing that Copilot does it reduces latency for your thoughts-action-results loop.
Does the open source really suffer if less people read documentation directly? Would you really be less likely to create an open source library if you knew someone can now use your library at 10x speed?
The inference ability has crossed uncanny valley so many times.
I find myself wondering whether there is a speech recognition component at times.
When teaching a lecture I will start saying something and write a prompt at the same time and the sentence produced by Copilot will be spot on what I've just said.
Ideally there would an open source version of Copilot that respects everyone's wishes. I fear that is impossible.
So, why should an AI be treated different here? I don't understand the argument for this.
I actually see quite some danger in this line of thinking, that there are different copyright rules for an AI compared to a human intelligence. Once you allow for such arbitrary distinction, it will get restricted more and more, much more than humans are, and that will just arbitrarily restrict the usefulness of AI, and effectively be a net negative for the whole humanity.
I think we must really fight against such undertaking, and better educate people on how Copilot actually works, such that no such misunderstanding arises.
However, is it reasonable to write an AI system that monitors the time and location of all license plates seen around town, puts them into a database, and then that same officer can simply put in the suspect's license plate instead of actually following them around? Maybe, maybe not, that's not my point here. But the creation of that functionality can easily lead to its abuse.
Is this exactly the same case as Copilot? Of course not, these are two wildly different systems. But I think it's an interesting parallel to consider when discussing the point of "it's okay when a human does it" because humans and algorithms operate at two very different levels of scale. The potential for abuse of the latter being far higher and far easier than something a human has to do manually.
Because the AI is not a human and only humans have rights, including the right to learn.
I am not sure how can anyone root for AI after seeing those kinds of outputs. It's like high-school level plagiatrism.
I've noticed this a lot and it's quite funny seeing what the actual filename of the document was. Does this just get included as metadata by default when you export to PDF?
[0] https://githubcopilotlitigation.com/pdf/1-0-github_complaint...
Specifically, sections D.4 to D.7 grant Github the right to "to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video."
It also says they can't sell the code, which CoPilot is doing.
Also, in a very high number of cases it isn't the author who uploads.
Repeating your line of argumentation (which occurs in every CoPilot thread) does not make it true.
e.g. I can clone the GNU codebase and publish it to GitHub. Clearly I don't own the code and do not have any rights to grant GitHub a license.
This sounds unenforceable in the general case. How could github know whether someone pushes their own code or not? Is it a license violation to push someone's FOSS code to github because the author didn't sign up with GH?
This isn't exactly the same thing, but it seems to me that three of the biggest differences are:
1. Stack Overflow code is posted for people to use it (fair enough, but they do have a license that requires attribution anyway, so that's not an escape)
2. Scale (true; but is it a fundamental difference?)
3. People are paying attention in this case. Nobody is scanning my old code, or yours, but if they did, would they have a case?
I dunno. I'm more sympathetic to visual artists who have their work slurped up to be recapitulated as someone else's work via text to image models. Code, especially if it is posted publicly, doesn't feel like it needs to be guarded. I'm not saying this is correct, just saying that's my reaction, and I wonder why it's wrong.
>function isEven(n) {
> return n % 2 === 0;
>}
They then say, "Copilot’s Output, like Codex’s, is derived from existing code. Namely, sample code that appears in the online book Mastering JS, written by Valeri Karpov."
Surely everyone reading this has written that code verbatim at some point in their lives. How can they assert that this code is derived specifically from Mastering JS, or that Karpov has any copyright to that code?
"In computer programs, concerns for efficiency may limit the possible ways to achieve a particular function, making a particular expression necessary to achieving the idea. In this case, the expression is not protected by copyright."
https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...
Think about how absurd this is. So if Microsoft was the first company to write and publish an isEven function then no one else can legally use it?
Look at paragraphs 90 and 91 on page 27 of the complaint[1]:
"90. GitHub concedes that in ordinary use, Copilot will reproduce passages of code verbatim: “Our latest internal research shows that about 1% of the time, a suggestion [Output] may contain some code snippets longer than ~150 characters that matches” code from the training data. This standard is more limited than is necessary for copyright infringement. But even using GitHub’s own metric and the most conservative possible criteria, Copilot has violated the DMCA at least tens of thousands of times."
Does distributing licensed code without attribution on a mass scale count as fair use?
If Copilot is inadvertently providing a programmer with copyrighted code, is that programmer and/or their employer responsible for copyright infringement?
There's a lot of interesting legal complications I think the courts will want to adjudicate.
[1] https://githubcopilotlitigation.com/pdf/1-0-github_complaint...
Ironically their Twitter account uses a screenshot from a TV series as profile picture. I wonder how legal that is, even if meant as a joke.
https://twitter.com/saverlawfirm
Edit: It's been changed 2 minutes after I wrote this comment
However, if you are looking to understand the reasoning behind this lawsuit, there are lots of better examples online where Copilot blatantly ripped off open source code.
Programmer/Lawyer Plaintiff + upstart SF Based Law Firm + novel technology = a good shot at a case that'll last a long time, and fertile ground to establish yourself as experts in what looks to be a heavily litigated area over the next decade+.
There is a reasonable argument that's a horrible system. But it doesn't make sense to criticize the plaintiff looking for a profit - the entire system has been set up such that that's what they're supposed to do. If you're angry about it lobby for either no rules or properly funded government enforcement of rules.
Frankly, I don't care if anyone makes a name for themselves for doing this. In fact, I applaud them and would happily give them recognition should they be successful.
Similarly, I'd hope that there are opportunties for profit in this space, given that I don't want cheap lawyers botching this case and setting terrible legal precedent for the rest of us. Microsoft has a billion dollar legal team and they will do everything they can to protect their bottom line.
Github can't really go to a court by themselves and ask "is this legal?". There is the concept declaratory relief but you need to be at least threatened with a lawsuit before that's on the table.
So Github kinda just has to try releasing CoPilot and get sued to find out. The legal system is setup to reward the lawyer who will go to bat against them to find out if it is legal. The plantiff (and maybe lawyer, depending on how the case is financed) take the risk they are wrong just as Github had to.
It is setup this way to incentivize lawyers to protect everyone's rights.
No matter who litigates and for what reasons it will be extremely valuable for good precedents to be set around the question of things like Copilot and DALL-E with respect to copyright and ownership. I'd rather have self interested lawyers dedicated to winning their case than self interested corporations fighting this out.
Obviously this is different for the reasons you stated, but I didn’t want people to think bringing a class action lawsuit forward is a way to get rich. It’s a bit of a joke, really.
How an aggravated individual can seek justice from a big multinational corporation? That's not possible unless that individual is a retired billionaire wanting to become a millionaire.
Yes he does think of it somewhat like that, establishing himself in an area. However a lot of his work comes from finding people aggrieved by something not them finding him.
But I write this to you in Hermes Maia
If Kasparov uses chess programs to be better at chess maybe we can use copilot to be better developers?
Also, anyone, either a person or a machine, is welcome to learn from the code I wrote, actually that is how I learnt how to code, so why would I stop others from doing the same?.
But the preference of the majority does not override the conditions placed by people who prefer not to participate.
So copilot is fine but anyone using it must abide by the collective set of licenses that it used to write code for you…?
Note that even licenses like MIT ostensibly require attribution.
> behalf of a proposed class of possibly millions of GitHub users...
The appendix includes the 11 licenses that the plaintiffs say GitHub Copilot violates: https://githubcopilotlitigation.com/pdf/1-1-github_complaint...
What's that? They don't want to do that? Why not?
Because if not I would offer the very mundane explanation that the Copilot team probably just couldn't be bothered hitting up the other software teams and jumping through 3,046 internal red tape compliance steps to make their product 0.001% better (I am pretty sure the code base of all of GH dwarfs MS code base quite a lot)
I can't believe I am actually defending fucking Microsoft, but just want to say there isn't a conspiracy everwhere...
A programmer can read available but not oss licensed code and learn from it. Thats fair use. If a machine does it, is it wrong ? What is the line between copying and machine learning ? Where does overfitting come in ?
Today they're filing a lawsuit against copilot.
Tomorrow it will be against stable diffusion or (dall-e, gpt-3 whatever)
And then eventually against Wine/Proton and emulators (are APIs copyrightable)
https://wiki.winehq.org/Developer_FAQ#Who_can.27t_contribute...
Actually, we were forbidden to look at open source code at Microsoft (circa 2009) because it might influence our coding and violate licenses.
If you're using code and know that it will be output in some form, just stick a license attribution in the autocomplete.
In fact, did you know this is what Apple Books does by default? Say, for example, you copy and paste a code sample from The C Programming Language. 2nd Edition. What comes out? The code you copy and pasted, plus attribution.
If a human programmer reads some else's copyrighted code, OSS or otherwise, memorizes it and later reproduces it verbatim or nearly so, that is copyright infringement. If it wasn't, copyright would be meaningless.
The argument, so far as I understand it, is that Copilot is essentially a compressed copy of some or all of the repositories it was trained on. The idea that Copilot is "learning from" and transforming its training corpus seems, to me, like a fiction that has been created to excuse the copyright infringement. I guess we will have to see how it plays out in court.
As a non-lawyer it seems to me that stable diffusion is also on pretty shaky ground.
APIs are not copyrightable (in the US), so Wine is safe (in the US).
Let me tell you the story of Google Books, also known as "Authors Guild Inc. v. Google Inc"
https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....
In 2004, Google added copyrighted books to is Google Books search engine, that does search among millions of book text and shows full page results without any authors authorization. Any sane lawyer of the time would have bet on this being illegal because, well, it most certainly was. And you may be shocked to learn that it is actually not.
in 2005 the Authors Guild sues for this pretty straightforward copyright violation.
Now an important part of the story: IT TOOK 10 YEARS FOR THE JUDGEMENT TO BE DECIDED (8 years + 2 years appeal) during which, well, tech continued its little stroll. Ten year is a lot in the web world, it is even more for ML.
The judgement decided Google use of the books was fair use. Why? Not because of the law, silly. A common error we geeks do is to believe that the law is like code and that it is an invincible argument in court. No, the court was impressed by the array of people who were supporting Google, calling it an invaluable tool to find books, that actually caused many sales to increase, and therefore the harm the laws were trying to prevent was not happening while a lot of good came from it.
Now the second important part of the story: MOST OF THESE USEFUL USES HAPPENED AFTER THE LITIGATION STARTS. That's the kind of crazy world we are living in: the laws are badly designed and badly enforced, so the way to get around them is to disregard them for the greater good, and hope the tribunal won't be competent enough to be fast but not incompetent enough to fail and understand the greater picture.
Rants aside, I doubt training data use will be considered copyright infringement if the courts have a similar mindset than in 2005-2015. Copyright laws were designed to preserve the authors right to profit from copies of their work, not to give them absolute control on every possible use of every copy ever made.
Quite sure the issue at hand is about the code being copied verbatim without the license terms, not "learning" from it.
You can learn from it, but if you start copying snippets or base your code on it to such an extent that its clear your work is based on it, things start to get risky.
For comparison, people have tried to get around copyright of photos by hiring an illustrator to "draw" the photo, which doesn't work legally. This situation seems similar.
What is the difference between a neighbor watching you leave your home to visit the local grocery store and mass surveillance? Where do you draw the line?
It is pretty simple, actually.
The reason why those wouldn't apply to Copilot is because they aren't separating out APIs from implementation and just implementing what they need for the goal of compatibility or "programmer convenience". AI takes the whole work and shreds it in a blender in the hopes of creating something new. The hope of the AI community is that the fair use argument is more like Authors Guild v. Google rather than Sony v. Connectix.
> Tomorrow it will be against stable diffusion or (dall-e, gpt-3 whatever)
> And then eventually against Wine/Proton and emulators (are APIs copyrightable)
Textbook definition of F.U.D.
No it isn't, at least not automatically which is why infringement of licenses exists at all, the fact that you have a brain doesn't change that and never has. If you reproduce someone's code you can be in hot water, and that should be the case for an operator of a machine.
It's also why the concept of a clean room implementation exists at all.
My (extremely amateur) understanding is that what is meant by "learn from it" is one of the hinge points of the legal question.
If a programmer reads licensed code and reproduces it verbatim or near-verbatim in a project with a conflicting license, that becomes a legal problem in certain circumstances.
If a programmer reads the same code and gets an idea to implement something different, that's less troublesome (or at least, if it is troublesome it's in a different area; if the idea was related to a patentable process, then other questions arise, but I'm even less qualified to speak to that area of law).
There's nothing special about copy/paste buttons that make them the only way you can infringe copyright.
Fair use doesn't automatically kick in just because someone uses what they took/copied as part of a larger artifact; it's a really complicated legal line.
The purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes
The nature of the copyrighted work
The amount and substantiality of the portion used in relation to the copyrighted work as a whole
The effect of the use upon the potential market for or value of the copyrighted work.
A programmer who studied in school and learned to code did so clearly for and educational purpose. The nature of the work is primarily facts and ideas, while expression and fixation is generally not what the school is focusing on (obviously some copying of style and implementation could occur). The amount and substantiality of the original works is likely to be so minor as to be unrecognized, and the effect of the use upon the potential market when student learn from existing works would be very hard to measure (if it could be detected).
When a machine do this, are we going to give the same answers? Their purpose is explicitly commercial. Machines operate on expression and fixation, and the operators can't extract the idea that a model should have learned in order to explain how a given output is generated. Machines makes no distinction of the amount and substantiality of the original works, with no ability to argue for how they intentionally limited their use of the original work. And finally, GitHub Copilot and other tools like them do not consider the potential market of the infringed work.
API's are generally covered by the interoperability exception. I am unsure how that is related copilot or dall-e (and the likes). In the Oracle v. Google case the court also found that the API in question was neither an expression or fixation of an idea. A co-pilot that only generated header code could in theory be more likely to fall within fair use, but then the scope of the project would be tiny compared to what exist now.
Just because both activities are calling "learning" does not mean they are the same thing. They are fundamentally, physically different activities.
Remember when Napster was all the rage. And then Jobs and Apple stepped in and set an expectation for the value of a song (at 99 cents)? And that made music into the razor and the iPod the much more profitable blades. Sure it pushed back Napster but artists - as the creator of the goods - have yet to recover.
I'm not saying this is the same thing. It's not. Only noting that today's "win" is tomorrow's loss. This very well could be a case of be careful what you wish for.
0) https://www.scotusblog.com/case-files/cases/andy-warhol-foun...
It seems like GitHub Copilot can spit out copyrighted works all day but the person running the text editor has to "choose" which Copilot output to actually save/commit/deploy.
Does it really matter that much "how" the text in your text editor gets there? You write it yourself or copy/paste it or have Copilot generate it. Ultimately the individual that "approved" it to be saved to the disk is the one violating the copyright, Copilot is just making a "suggestion".
Large platforms like github will just stick blanket agreements into the TOS which grant them permission (and require you indemnify them for any third party code you submit). By doing so they'll gain a monopoly on comprehensively trained AI, and the open world that doesn't have the lever of a TOS will not at all be able to compete with that.
Copilot has seemed to have some outright copying problems, presumably because its a bit over-fit. (perhaps to work at all it must be because its just failing to generalize enough at the current state of development) --- but I'm doubtful that this litigation could distinguish the outright copying from training in a way that doesn't substantially infringe any copyright protected right (e.g. where the AI learns the 'ideas' rather than verbatim reproducing their exact expressions).
The same goes for many other initiatives around AI training material-- e.g. people not wanting their own pictures being used to train facial recognition. Litigating won't be able to stop it but it will be able to hand the few largest quasi-monopolisits like facebook, google, and microsoft a near monopoly over new AI tools when they're the only ones that can overcome the defaults set by legislation or litigation.
It's particularly bad because the spectacular data requirements and training costs already create big centralization pressures in the control of the technology. We will not be better off if we amplify these pressures further with bad legal precedents.
… & of course we again ask Microsoft's GitHub to start respecting FOSS licenses, cooperate with the community, & retract their incorrect claim that their behavior is “fair use”.
A few more links to our work on this issue:
https://sfconservancy.org/blog/2022/feb/03/github-copilot-co... https://sfconservancy.org/news/2022/feb/23/committee-ai-assi...
P.S. I am not a lawyer.
In this case, wouldn’t the users of copilot be the ones responsible for any copyrighted code they may have accessed using copilot?
//below output code is MIT licensed (source: github/repo/blah)
And yes, the "users" are responsible, but it's possible that copilot could be implicated in a case depending on how it's access is licensed.
Stable diffusion has this same problem btw, but in visual arts "fair use" is even murkier.
For code, if you could use the code and respect the license, why wouldn't you? Copilot takes away that opportunity and replaces it with "trust us".
Obviously not financially as Microsoft has basically YES amounts of money.
If you are opinionated but lazy, no judgement here as I sit here watching TV, you could add a notation at the top of your repos explicitly supporting the usage of your code in such tools as fair use.
Notably if your code is derivative of other works you have no power to grant permission for such use for code you don't own so best include some weasel words to that effect. Say.
I SUPPORT AND EXPLICITLY GRANT PERMISSION FOR THE USAGE OF THE BELOW CODE TO TRAIN ML SYSTEMS TO PRODUCE USEFUL HIGH QUALITY AUTOCOMPLETE FOR THE BETTERMENT AND UTILITY OF MY FELLOW PROGRAMMERS TO THE EXTENT ALLOWABLE BY LICENSE AND LAW. NOTHING ABOUT THIS GRANT SHALL BE CONSTRUED TO GRANT PERMISSION TO ANY CODE I DO NOT OWN THE RIGHTS TO NOR ENCOURAGE ANY INFRINGING USE OF SAID CODE.
Years from now when such cases are being heard and appealed ad nauseam a large portion of repos bearing such notices may persuade a judge that such use is a desired and normal use.
You could even make a GPLesque modification if you were so included where you said. SO LONG AS THE RESULTING TOOLING AND DATA IS MADE AVAILABLE TO ALL
Note not only am I not your lawyer, I am not a lawyer of any sort so if you think you'll end up in court best buy the time of an actual lawyer instead of a smart ass from the internet.
The situation that this lawsuit is trying to save you from is this: (1) copilot blurps out some code X that you use, and then redistribute in some form (monetized or not); (2) it turns out company C owns copyright on something Y that copilot was trained on, and then (3) C makes a strong case that X is part of Y, and that your use of X does not fall under "fair use", i.e. you infringed on the licensing terms that C set for Y.
You are now in legal trouble, and copilot put you there, because it never warned that you X is part of Y, and that Y comes with such and such licensing terms.
Whether we like copilot or not, we should be grateful that this case is seeking to clarify some things are currently legally untested. Microsoft's assertions may muddy the waters, but that doesn't make law.
If not, it's a pretty clear sign they consider it radioactive.
But no matter how this goes, in case training AI with copyrighted inputs is "fair use" that'll end up as the ultimate "copyright laundry machine" like this "joke" project here:
https://web.archive.org/web/20220104214929/https://fairuseif...
https://news.ycombinator.com/item?id=27796124 (302 points, 151 comments)
1. The ability to be able to run and train these models is going to eventually be perfectly plausible on a home machine.
2. It's only a matter of time before models, e.g. a popular model scraped from all of the code on GitHub, is a publicly available torrent.
3. People will be able to just run it locally as an integrated plug-in in jet brains or VS code.
4. You'll never know if somebody has lifted their code in violation of a license anymore than you would be able to tell if somebody used code from stack overflow without attribution in any commercial endeavor.
The End.
I don't think 1-3 matter at all. The point is that GitHub is selling a tool that can commit copyright infringement. This lawsuit is trying to get them to pay the consequences for the infringement that they have enabled.
We've even seen this with stable diffusion image generation, where specific watermarks can be re-created (decrypted?) deterministically with the proper input.
Anybody looking at the source image and the generated result would say they are the same.
Did you know before airplanes were invented common law said you owned the air above your land all the way to the heavens.
In addition just because code is available publicly on GitHub does not necessarily mean it is permissively licensed to use elsewhere, even with attribution. Copyright holders not happy with their copyrighted works publicly accessible can use the DMCA to issue take-downs that GitHub does comply with but how that interacts with Copilot and any of its training data is a different question.
As much as the DMCA is bad law rather funny seeing Microsoft be charged in this lawsuit with the less known provision against 'removal of copyright management information'. Microsoft does have more resources to mount at defence so it will probably end up different compared to a smaller player facing this action.
Individually, each frame is protected by the copyright of the movie it belongs to. But what happens if you take a million frames from a million different movies and just arrange them in a new way?
That's the core question here. Is the new movie a new copyrightable work, or is it plagiarizing a million other works at once? Is it legal to use copyrighted works in this way?
The other question is if it is right to use copyrighted works this way. Is this within the spirit of open source software? Or is this just a bad corporation taking advantage of your good will?
I'm not sure where I stand on this, it's a complicated problem for sure. Definitely interested to see how this plays out in court.
I don't know about the US laws in copyright so I can't comment on the legal documents but this website is not complaining that copilot is reproducing copyrighted content but it was trained on copyrighted content. I don't see how you can forbid someone or something to read and learn from something that is public (once again producing is another problem)
For example let's say I'll take a single frame of animation from a cartoon, The frame contains a mountain, house, and a couple characters although those characters are not integral to the actual cartoon maybe they're extras (villagers and not named characters something like Mickey Mouse for example)
I draw a picture of a lake with a cabin next to it, then start to draw a frontiersman but I trace one of his arms from a villager of that previous frame of animation... Number one am I in danger of copyright infringement (have I hit some arbitrary threshold), and number two: am I causing monetary losses for the cartoon?
If I'm being honest I'm a bit annoyed at this. What's the problem and what's the point of this?
I notice often on hackernews that people don't seem to understand anything about free or open-source software outside of the pragmatics of whether they can abuse the work for free.
Depends on the license. If it's MIT and you serve the license, no, you are not infringing at all. A trimmed version of MIT for the relevant bits:
Permission is hereby granted [...[ to any person obtaining a copy of this software [..] to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, [...] subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
> are you infringing when you run it
Depends on the license
> are you infringing when you use that file and distribute it somewhere
Depends on the license
----
When copilot gives you code without the license, you can't even know!
The crux of the lawsuit's argument is that the AI unlawfully outputs copyrighted material. This is evident in many tests with many people here and on Twitter even getting verbatim comments out of it.
AI art, in the other hand, is not capable of outputting the images from its training set, as it's not a collage-maker, but an artificial brain with a paintbrush and virtual hand.
But I don't think copyright on visual images actually works like that, that it needs to be an exact copy to infringe.
If I draw my own pictures of Mickey Mouse and Goofy having a tea party, it's still a copyright infringement if it is substantially similar to copyright depictions of mickey mouse and goofy. (subject to fair use defenses; I'm allowed to do what would otherwise have been a copyright infringement if it meets a fair use defense, which is also not cut and dry, but if it's, say, a parody it's likely to be fair use. There is probably a legal argument that Copilot is fair use.... the more money Github makes on it, the harder it is though, but making money off something is not relevant to whether it's a copyright violation in the first place, but is to fair use defense).
(yes, it might also be a trademark infringement; but there's a reason Disney is so concerned with copyright on mickey expiring, and it's not that they think there's lots of money to be spent on selling copies of the specific Steamboat Willy movie...)
> There is actually no percentage by which you must change an image to avoid copyright infringement. While some say that you have to change 10-30% of a copyrighted work to avoid infringement, that has been proven to be a myth. The standard is whether the artworks are “substantially similar,” or a “substantial part” has been changed, which of course is subjective.
https://www.epgdlaw.com/how-can-my-artwork-steer-clear-of-co...
I think Stable Diffusion etc are quite capable of creating art that is "substantially similar" to pre-existing art.
- https://i.imgur.com/VikPFDT.png
I also don't know if I would anthropomorphize ML to that degree. It's a poor metaphor and isn't really analogous to a human brain, especially considering our current understanding, or lack thereof, of the brain, and even the limited insight we have into how some of these models work from the people who work on them.
Want to say that again?
P.S. I am not a lawyer.
robots.txt
This is exactly what is needed for source code, and the default (no robots.txt) should be "disallow".The fact that the Web has considered this moral issue should be a strong hint for the AI people not to take a purely legal stance but consider the OSS community that they are so heavily using.
I'm 1000% on team open source and have had to refer to things like tldrlegal.com many times to make sure I get all my software licensing puzzle pieces right. Totally get the argument for why this litigation exists in the present.
Just saying in general my friends I hope you have an absolutely great day. Someone will be wrong on the internet tomorrow, no doubt about it. Worry about something productive instead.
This one has the feel of being nothing more than tilting at windmills in the long run.
Sometimes the query is the first half of a small statement that we can fill in with common patterns. Useful, fair.
Sometimes the query is a signature like `fn fast_inv_sqrt` that copies someone's code and doesn't attribute it.
A better shortening if the original title is simple "We’ve filed a lawsuit challenging GitHub Copilot"
Grand theft , interstate wire fraud and conspiracy for same.
This is a criminal matter as well as civil. Intentional and knowing violation of the law.
We must not let our work be taken!
Can the generated code be traced back to the code used for training and the original copyrights and licenses for that code?
If so, what attribution(s) and license(s) should apply to the generated code?
Seems to me the underlying data should be opt-in from creators and licenses should be developed that take AI into consideratiin.
Start off a comment with // MIT license
Then watch parts of various software licenses come out including authors' names and copyrights!
(asking because I know the authors were kinda famous for being very litigious).
all the best with the lawsuit.
If these folks win - we again throw progress under the bus.
function force=Gmmr2Array(mass1, mass2)
and function [force, torque]=pointMatrixGravity(array1,array2)
?I'd love to know if some of my GPL v3 code [1, 2] has landed in the training set
[1] https://github.com/4kbt/NewtonianEotWashToolkit/blob/master/...
[2] https://github.com/4kbt/NewtonianEotWashToolkit/blob/master/...
This person (a lawyer) saw an opportunity to make money and jumped on it like a hungry tiger on fresh meat.
But I like to put on my conspiracy hat from time to time, and right now is one such time, so let's begin...
Though the motivations behind this case are uncertain, what is certain is that this case will establish a precedent. As we know, precedents are very important for any further rulings on cases of a similar nature.
Could it be the case that Microsoft has a hand in this, in trying to preempt a precedent that favors Copilot in any further litigation against it?
Wouldn't put it past a company like Microsoft.
Just a wild thought I had.
The No-AI 3-Clause Open Source Software License
Copyright (C) <YEAR> <COPYRIGHT HOLDER>
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in
the documentation and/or other materials provided with the
distribution.
3. Use in source or binary forms for the construction or operation
of predictive software generation systems is prohibited.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
https://bugfix-66.com/f0bb8770d4b89844d51588f57089ae5233bf67...