(And charging for a product which adds value to your developer experience and needs money to be run is not a bad thing)
Uploading it to Github does not transfer ownership or imply allowances for any use. If you upload it without a license it is a copyright violation to copy the code. Even with an MIT license it is a copyright violation to copy the code without attribution.
> I don't see the point of being pissy about Github using it, I'm saying this as someone who's written quite a lot of MIT code.
People are probably angry because this is yet another case of a big multinational corporation abusing unclear or difficult to enforce legislation for profit.
it's worse than that: it's Microsoft trying to completely undermine the concept of open source
meanwhile: they're unaffected as their high-value proprietary code remains private and doesn't train the model
If you upload code to GitHub, you grant them (and every GitHub user) a license to do exactly what Copilot does.
This ToS change happened 2017, and I actually had to get approval from all contributors of my projects to accept to the changed ToS: https://github.com/justjanne/QuasselDroid-ng/issues/5
What GitHub’s doing is shady, but it’s been obvious it was going to happen for years.
Probably a copyright violation. There are surely circumstances in which copying a small portion would either fall under fair use, or for other reasons not constitute a violation. The question then is whether or not codepilot is causing a violation. I don't think it's as clear cut as most commenters are making out.
All in all though, it's probably going to take a few court cases to figure out. In the mean time, I'd expect most companies to steer clear of codepilot.
Pretty much anyone can scrape GitHub and train their model.
What exactly is the legal implications of this has yet to be tested.
Pretty much every model is susceptible to some sort of model inversion or set inclusion attack.
By their own admissions Co-Pilot sometimes outputs PII that part of the code and code snippets verbatim, even if it’s rare iirc around 0.1% it’s still a huge legal liability for anyone who uses the tool, especially since it’s unclear how these inclusions are spread out and what triggers them. For example it could be that a particular coding / usage of Co-Pilot style or working on a specific subset of problems increases the likelihood of this occurring.
ML is too new to have been tested in court this has more ramifications beyond just licensing, for example if you use PII to train a dataset and receive a GDPR deletion request do you need to throw away and retrain your model?
I don’t think people should be angry however I also think that this needs to be test in court and multiple times before this can be “safe to use”.
But I also don’t think that the ML model is necessarily a derivative work.
For example if you use copyleft material to construct a CS course someone would be hard pressed to argue that the course now needs to be released freely yet alone that anything that the students would write after attending the course would fall under derivative work too.
That was my take originally, but apparently this is not as cut and dry as you may think:
https://www.technollama.co.uk/is-githubs-copilot-potentially...
I believe that gives me the right to be mad and to demand they fix their violations, one way or another.
"The above copyright notice and this permission notice shall be included in all COPIES OR SUBSTANTIAL PORTIONS of the Software."
Reusing a snippet doesn't require reproducing the MIT license. People who publish MIT software know they're basically giving their code out with basically no strings attached.
However, GitHub should be careful with the GPL variety.
You would sue.
And then Github would argue that their algorithms did not spit out verbatim the code by copying but rather it generated code that looked exactly like the other code based on learning from millions of codebases. ¨
And then there would be lots of lawyers.
And then a judge would have to decide.
We shall see, by Googling some of the code it spits out.
FWIW GPT-3 doesn't really tend to spit out verbatim reproductions of copyrighted books.
Innovation should push boundaries.
The Apache 2 license allows for commercial use, but has implications for the way you can enforce your software patents. It also requires distributing the license file along with your application.
Complaining that companies use the software you told the world was free to use without restriction is dumb. However, not everyone gives away their software for free without restrictions. The fact that Github isn't respecting those licenses is a much bigger problem.
The tool autocompleting some random guy's personal information because he uploaded his blog to Github is highly problematic. The idea of using permissively licensed code to train an AI is not bad, but some human with knowledge of software licenses would need to pre-select those projects.
If all code came from one of those "do whatever the fuck you want" licenses, then there wouldn't be a problem. I'd consider it to be a great product and have no issue paying a fee. There's a huge market for a Copilot product, but this iteration just.. isn't it.
The GPL is completely compatible with commercial use. You just need to share modifications to the source with anyone you share the binary with. Many tech companies make extensive use of GPL software, and since they are not providing binaries to their end users they don't even have to share their changes to the source.
Even the AGPL, which does require you to share the source with users, still completely allows commercial use (though not compatible with as many business models).
But what's bothering me about this is that it's not a small company doing this. It's a company that's got crazy amounts of cash, who has been trying to trade on a "we're nice now and we love open source" image in the last few years, now taking all the open source code and balling it up in a closed-source app they will charge us for.
I'd be fine if I got to use it for free, extend it to whatever editing platform I like through its open API, and it was a part of an open project.
But right now it looks like they'll charge, and that bugs me.
I think this situation is slightly more complex but that sentiment is at the heart of a lot of pushback against things like this.
$BigCorp: "I want to use to use Dev's code for commercial purposes as he has explicitly granted me the right to do so."
Dev: "Wait, no not like that."
As much as I am a proponent of permissive licenses (my favorite is the wtfpl), you have to pick your license wisely especially if you're going to be picky about usage (Be it by $BigCorp, government agencies, or other companies that you might not be fond of).
If you really want "full control" over your code you have to make it proprietary.
The MIT license doesn't require attribution for small snippets, only for full copies or substantial portions.
Github/Microsoft is going to take your code, and then cut off your access to it. This is what the GPL was designed to fight, so they're going to try it this way instead.
Those who do not learn history yadda yadda.
If so then what about private repositories with a permissible license but not been made public for what ever reason.
What about those projects whose dependencies has permissible license but main repo doesn't? Can GitHub just go oops!
I think the point that so much confusion exists regarding their product & possible violation of user's trust is a valid reason to be pissy about.
But we didn't.
Where is this MIT licensed codes of yours, because it definitely is not on your github.
And also recently we saw GPT that generates articles, waifulabs that generates ... waifus... to be honest I cannot perceive the difference since all of them are "learning" (in a mechanical way of human created knowledge.
I'm really waiting for this to blow up from the open source license angle. Freely combining code with different license is a hellish undertaking on its own. But already just re-using some, say, GPL code, even staying under the same license, but without proper attribution, is Forbidden with capital F.
More like a defect approach, behavior like that is well known(1) to be basically guaranteed to happen with GPT-3 and similar approaches.
(1): By people involved in the respective science categories (Representation Learning/Deep Learning, NLP, etc.).
It's an interesting question.
1) When a human being reads code or a CS text book, we think of them extracting general principles from the code and so not having to repeat that particular code again. In contrast, what GPT-3 and Copilot seem to do is just extract sequences of little snippets, something that apparently requires them to regurgitate the text they've been trained on. That seem rather permanently dependent on the training corpus.
2) Human beings have a natural urge, a natural ethos, to help people learn. It's understandable. The thing is, when suddenly you're not talking people but machines, the reason for this urge easily vanish. Even if github was extracting knowledge from the code, I wouldn't have a reason to help them do so since that knowledge would be their entirely private property. They expect to charge people whatever they judge the going rate would be - why should anyone be helping them without similar compensation? That this is being done by "OpenAI", a company which went from open-nonprofit to closed-for-profit in a matter of few years, should accent this point. We're nowhere near a system that could digest all the knowledge of humankind. But if we got there, one might argue the result should belong to humankind rather than to one genius entrepreneur. And having the result belong one genius entrepreneur has some clear downsides.
TL;DR: The AI doesn't know it can't just copy past (from perfect memory) and as such it learned to sometimes just copy past thinks.
The GPT model doesn't: "learn to understand the code and reproduce code based on that knowledge".
What it learns is a bit of understanding but more similar to recombining and tweaking verbatim text snipped it had seen before, without even understanding them or the concept of "not just copy/pasting code". (But while knowing which patterns "fit together").
This means that the model will "if it fits" potentially copy/past code "from memory" instead of writing new code which just happens to be the same/similar. It's like a person with perfect memory sometimes copy pasting code they had seen before pretending they wrote the code based on their "knowledge". Except worse, as it also will copy semantic irrelevant comments or sensitive information (if not pre filtered out before training).
I.e. there is a difference between "having a different kind of understanding" and "vastly missing understanding but compensating it by copying remembered code snippets from memory".
Theoretically it could be possible to create a GPT model which is forced to only understand programming (somewhat) but not memorize text snippets, but practically I think we are still far away from this, as it's really hard to say if a model did memorize copyright protected code.
A similar product, TabNine, has been around for years. It does essentially the exact same thing as Copilot, it’s trained on essentially the same dataset, and it gets mentioned in just about every thread on here that talks about AI code generation. (It’s a really cool product btw and I’ve been using and loving it for years). According to their website they have over 1M active users.
Why is this suddenly a huge big deal and why is everyone suddenly freaking out about Copilot? Is it because it’s GitHub and Microsoft and OpenAI behind Copilot vs some small startup you’ve never heard of? Is it just that the people freaking out weren’t paying attention and didn’t realize this service already existed?
Also, tabnine has a smaller scope; you type "var " and it suggests a variable name and possibly the rest of the line, like autocomplete has been doing for decades. Perfectly normal.
My understanding of copilot is that you can type "// here's a high-level description of my problem" and it'll fill out entire functions, dozens of lines. The scope is much grander.
I don’t see how? The question is about the ethics of building such a tool, not whether anyone is forced to use it.
I think some are also beginning to feel an Amazonification happening. We built all the stuff and made it free, but now a company is going to own it and profit off of it.
Edit: If we want to prevent this, we need a new license that states our code may not be included in deep learning training sets.
Edit 2: if private repository code is in this training set, it may be possible to leak details of private company infrastructure. Models can leak training data.
Otherwise, Into honestly trying to have a conversation on this to understand the objections because I haven't made up my mind but struggle to see the problem. So pease consider the following:
if the code was not encumbered by restrictions I don't see an obvious problem with this. Using code or data or anything like that in the public commons for a meta analysis doesn't strike me as wrong, even if the people doing it make money off of that analysis.
If I scraped GitHub code and then wrote a book about common coding patterns & practices I don't think that would be wrong.
I used the Brown corpus and multiple other written word corpuses (corpi?) Along with WordNet and other sources to write my thesis in Computational Linguistics Word Sense Disambiguation, later applying it to my job, which earns me money. Is this wrong?
Public datasets have been used extensively for ML already. I don't see this as much different.
It did. It's spitting up the AGPL in empty files, and AGPL'd code isn't free for commercial use. It requires people who use it to make changes available under the same license.
However the gray area is that the massive data set of which it is a part will spit out new code that has, in some way big or small, been influence by the AGPL code, which... well, I don't think that sort of use was anticipated by the terms of AGPL. I see can reasonable arguments in both directions. Personally though, I would favor an interpretation that limits GitHub's use for commercial purposes, if not for strictly licensing restrictions then at least for the spirit of these licenses.
In truth, I would very much have liked GitHub to gone out big & loud with an aggressive awareness campaign asking repo owner to opt-in to the use of their code for this. Again, for pure opens source licenses I don't thing that would be required, but I still think it would be the right thing to do. And certainly less damaging to their reputation & future hesitancy for project maintainers to trust GitHub with their code.
I don't think this will be a tipping point by itself, but if this behavioral pattern continues I could imagine devs big & small shifting to hosted or on-prem instances of things like GitLab.
If a whole project was copied verbatim and the license violated I think everyone would agree that was wrong. So then is copying the same quantity of code across 1000 projects wrong?
Is setting up a process and a system that does that systematically, at scale with intent and then commercialises the result wrong?
There's entire websites dedicated to GPL violations; people do care.
What if that private key you accidentally committed, pushed, removed and pushed last week to your private repo is now showing up in everybody's Copilot?
If they're not going to act accordingly, there's no reason someone couldn't roll their own GitLab instance, or a competitor with more respect enter the marketplace.
> ...
> It’s truly disappointing to watch people cheer at having their work and time exploited by a company worth billions.
Huh? Over the last few days that I've watched this "copilot" story unfold on various news aggregator sites, I've first seen people point out copyright and other issues with it, then the fast inverse square root tweet happened, and then more articles and tweets like this one and the discussion that we are currently having. But I somehow don't really recall anyone besides the Microsoft marketing department being overly excited about it. Did I miss something?
https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
That would be exciting tech for me.
What you just saw 3 days ago was a hype driven unveiling of a cherry picked contraption by GitHub, OpenAI and Microsoft. Open source became the loser once again and got taken advantage of this clever trick and will soon become a paid service. (With lots of code that is under copyright of various authors.)
Anyone who critiqued the announcement three days ago was drowned out, downvoted and stamped on by the fanatics.
I wanted to see those who had access to it (Not GitHub or Microsoft fans) to demystify and VERIFY the claims rather than blindly trust it. Those suspicions by the skeptics were right, and lots of questions still remain unanswered.
Well done for re-centralising everything to GitHub. Again.
Time to move on to the carbon age I suppose.
I do pity the poor algorithm that has to parse sense into my coding idiosyncrasies.
* a person that follows popular trends than
* a person who finds/dissects clever/unique solutions to add to their tool belt
Honestly, I'm pretty sure ML hates my guts. Anything I've ever used involving it ends up burying my voice and slowly trying to etch away that parts of me that aren't normal enough.
If not, will anybody quietly slip something like this into Copilot's training data?
But, in that case, I think that the things that are put to charge GitHub are not right.
I think that the idea is nice and it is fair from open source code. Anyone is free of downloading free software and doing something similar, and it is nice.
I just find the product itself is stupid, and it is for users to be smart enough not to use that knowing that their is a risk of them being sued for involuntary violating copyright. And GitHub might be at risk if it is a paid service as the companies could sue them back by pretending that they expected the code generated by GitHub to be safe for commercial use.
Also, I would think that GH would have abused if they used 'private repo' codes to train their model without permission.
This means that if Copilot does not attribute code when it copies and modifies it, then it is violating most open source licenses. Full stop.
So, if you just use copilot to generate random things, you are ok. But if you try to use the generated code for anything: distribution, selling, eventually usage. Then, you are violating the licenses in the same way as you took yourself the parts of code to reuse.
It is possible users of copilot that should avoid that or be very careful to check any line produced (that is almost impossible).
Also, by itself, copying one or two lines of code can hardly be limited by copyright. But, as we so, copilot can spit big full block of code from existing projects.
Microsoft of course will implement compliance standards as necessary (they genuinely do not want to break the law), but what does this mean for smaller companies and individuals training models?
Additionally, "The third-party doctrine is a United States legal doctrine that holds that people who voluntarily give information to third parties—such as banks, phone companies, internet service providers (ISPs), and e-mail servers—have "no reasonable expectation of privacy.""
The above isn't to say I agree with this but just to highlight the dangers of outsourcing and the cloud.
> "The third-party doctrine is a United States legal doctrine that holds that people who voluntarily give information to third parties—such as banks, phone companies, internet service providers (ISPs), and e-mail servers—have "no reasonable expectation of privacy."
this is definitely not the case for 100% of the rest of the world
They will do whatever they want with your code.
MS didn't change a bit.
especially: Conclusion and Next Steps.
This investigation demonstrates that GitHub Copilot can quote a body of code verbatim, but that it rarely does so, and when it does, it mostly quotes code that everybody quotes, and mostly at the beginning of a file, as if to break the ice.
But there’s still one big difference between GitHub Copilot reciting code and me reciting a poem: I know when I’m quoting. I would also like to know when Copilot is echoing existing code rather than coming up with its own ideas. That way, I’m able to look up background information about that code, and to include credit where credit is due.
The answer is obvious: sharing the prefiltering solution we used in this analysis to detect overlap with the training set. When a suggestion contains snippets copied from the training set, the UI should simply tell you where it’s quoted from. You can then either include proper attribution or decide against using that code altogether.
This duplication search is not yet integrated into the technical preview, but we plan to do so. And we will both continue to work on decreasing rates of recitation, and on making its detection more precise.
So their defense of the like "oh it's fine it very rarely emits verbatim things" is bullshit anyway. That's an answer to a wrong question, at least given the answer is in this direction (would there be ton of verbatim recitation, they obviously would not try to wave away the problem like that -- however we can not conclude anything from verbatim output being rare, despite them stating that as if it a quite central and strong argument)
"""
4. License Grant to Us
We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.
...
"""
Note that the relevant detail is that this applies to public repositories not covered under some free/libre license. I also assume this excludes private repos which might have more restrictive terms of use. GitHub has a section on it I just haven't read it in detail and so maybe the above covers private repos as well.
[0] https://docs.github.com/en/github/site-policy/github-terms-o...
> We are obsessed with shiny without considering that it might be sharp.
If the creators interests are not clearly expressed anymore with a license, we need updates to the license texts.
Let's look at MIT:
____________________
"Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. [...]
____________________
From the license text alone, it would not be clear to me, why anyone could claim that the OpenAI codex or the Github Copilot would require attribution to any of the used MIT source code to generate the AI model. The AI model is simply not a copy of the source or of a portion thereof. It is essentially a mathematical / statistical analysis of it.
Now what about any generated new source? How similar does it need to be to any source to be a copy? At what size of the generated code it qualifies to be a copy instead of a snippet of industry best practice?
Where does the responsibility for attribution lie? Should we treat the AI code generation models like a copy & paste program? Usually you cannot really say where the copy came from 100% - how do you know what factors influenced it?
Let's handle the simplest case first: Copilot can and does regurgitate large pieces of its training dataset verbatim. This is a well-known and trivially demonstrable property of all ML models in this family. Would such exact copy fall under the license of the code being copied? This of course needs to be tested in courts, but my gut says "yes". The problem now is, if you're using Copilot, you may end up with such copied code in your codebase without ever knowing, and this might open you to liability.
It's not that crazy.
Maybe it's my information bubble, but I don't see anyone cheering. Currently Copilot churning out rather bad code. I am definitely would not use it. And my prediction about it that it will go like Tesla's autopilot for years.
(Also, GPT-3 wasn't trained on nearly as much writing as that. Even if you ignore lost writing, GPT-3 was trained on a small subset of the 'net.)
A lot of people dislike them and minimize their use.
More importantly, we are seeing a bait-and-switch. People agreed on GitHub storing, showing and indexing their code and issues, not using the code for Copilot, regardless of what the fine print in the usage agreement says.
Maybe people should be mad about what Facebook or Google do but that stuff doesn't involve taking stuff outside their terms of use.
Maybe Github could try attaching a "we can relicense all your code whenever we want" condition to their hosting but they'd lose all their business.
...what?
Information that is aggregated and organized for easy retrieval is worth more than the sum of individual bits of information. I thought that was common sense.
We might as well complain that billionaire supermarket chains are pocketing all the profit while not growing a single potato by themselves.
Are you making a claim that Netflix shouldn't be required to pay for individual movies because they sell a collection of movies?
So it won't copypaste your code. It had just read code from open sources and learned from it - similar to what humans do. So I don't see any problem with this.
Second, we can't ignore that if someone deliberately tries to make it spit out copyrighted code, the chances are going to be much greater.
Why would anyone? Plausible deniability: "I didn't copy this GPL procedure, the copilot gave it to me!"
That means that every week, there will be 1000 verbatim copypaste of code by Copilot. Then multiply that by a year or more as Copilot gets older.
0.1% may not seem like a lot, but at the scale of Internet companies, it always is.
You might want to check out this video...
Original code in somebody's GitHub repo:
int x = y + z;
Copilot code: int Eisaa7ha = Wu8iazo7 + Roh0Eesh;
Not copy pasted! Uniquely generated! Never before seen!It's a NET POSITIVE FOR EVERYBODY.
Copyright cuts both ways. Free Software and Open Software exist in context, and because of, copyright laws. This means that a person or a company using output from Copilot may be engaging in copyright infringement. In other words, Copilot is enabling software piracy.
I might be sympathetic to it, and even consider it mostly positive, but then if companies can use my code ignoring the license, I want to be able to Torrent their products in peace too.
You just did.