GitHub scraped your code. And they plan to charge you (opens in new tab)

(twitter.com)

277 pointstower-shield4y ago198 comments

198 comments

If you open-sourced code and allowed it to be used for commercial purposes, I don't see the point of being pissy about Github using it, I'm saying this as someone who's written quite a lot of MIT code.

(And charging for a product which adds value to your developer experience and needs money to be run is not a bad thing)

hmfrh4y ago

> If you open-sourced code and allowed it to be used for commercial purposes

Uploading it to Github does not transfer ownership or imply allowances for any use. If you upload it without a license it is a copyright violation to copy the code. Even with an MIT license it is a copyright violation to copy the code without attribution.

> I don't see the point of being pissy about Github using it, I'm saying this as someone who's written quite a lot of MIT code.

People are probably angry because this is yet another case of a big multinational corporation abusing unclear or difficult to enforce legislation for profit.

blibble4y ago

> People are probably angry because this is yet another case of a big multinational corporation abusing unclear or difficult to enforce legislation for profit.

it's worse than that: it's Microsoft trying to completely undermine the concept of open source

meanwhile: they're unaffected as their high-value proprietary code remains private and doesn't train the model

3 more replies

kuschku4y ago

> you grant each User of GitHub a nonexclusive, worldwide license to […] reproduce Your Content […] as permitted through GitHub' functionality

If you upload code to GitHub, you grant them (and every GitHub user) a license to do exactly what Copilot does.

This ToS change happened 2017, and I actually had to get approval from all contributors of my projects to accept to the changed ToS: https://github.com/justjanne/QuasselDroid-ng/issues/5

What GitHub’s doing is shady, but it’s been obvious it was going to happen for years.

anothernewdude4y ago

> If you upload code to GitHub, you grant them (and every GitHub user) a license to do exactly what Copilot does.

You can upload code to GitHub without the ability to grant such a license, or is github now only for primary copyright holders?

1 more reply

ineedasername4y ago

Good point; if the code does not have a specified license then standard copyright terms apply and inclusion for commercial use should be verboten. It it's actually open source without commercial restrictions though, I don't see an ethical difference in using the code directly or for an meta analysis driving ML for enhanced code completion.

1 more reply

messe4y ago

> Even with an MIT license it is a copyright violation to copy the code without attribution

Probably a copyright violation. There are surely circumstances in which copying a small portion would either fall under fair use, or for other reasons not constitute a violation. The question then is whether or not codepilot is causing a violation. I don't think it's as clear cut as most commenters are making out.

All in all though, it's probably going to take a few court cases to figure out. In the mean time, I'd expect most companies to steer clear of codepilot.

dogma11384y ago

Snyk did the same with Snyk code to build their “ML driven SAST” offering.

Pretty much anyone can scrape GitHub and train their model.

What exactly is the legal implications of this has yet to be tested.

Pretty much every model is susceptible to some sort of model inversion or set inclusion attack.

By their own admissions Co-Pilot sometimes outputs PII that part of the code and code snippets verbatim, even if it’s rare iirc around 0.1% it’s still a huge legal liability for anyone who uses the tool, especially since it’s unclear how these inclusions are spread out and what triggers them. For example it could be that a particular coding / usage of Co-Pilot style or working on a specific subset of problems increases the likelihood of this occurring.

ML is too new to have been tested in court this has more ramifications beyond just licensing, for example if you use PII to train a dataset and receive a GDPR deletion request do you need to throw away and retrain your model?

I don’t think people should be angry however I also think that this needs to be test in court and multiple times before this can be “safe to use”.

But I also don’t think that the ML model is necessarily a derivative work.

For example if you use copyleft material to construct a CS course someone would be hard pressed to argue that the course now needs to be released freely yet alone that anything that the students would write after attending the course would fall under derivative work too.

1 more reply

kenniskrag4y ago

is copy not allowed by theirs terms of service? I read something about that it is needed for e.g. forking feature. But it was years ago, when I created the account.

jwalton4y ago

> Even with an MIT license it is a copyright violation to copy the code without attribution.

That was my take originally, but apparently this is not as cut and dry as you may think:

https://www.technollama.co.uk/is-githubs-copilot-potentially...

1 more reply

ghoward4y ago

All of my open source licenses require attribution, but Copilot does not give that attribution. So while my code is open source, Copilot is still violating the open source license. Just because it's open source doesn't mean there are not any terms that must be abided by.

I believe that gives me the right to be mad and to demand they fix their violations, one way or another.

SamWhited4y ago

If you write MIT code you expect them not to strip your license out in derivative works. This is exactly what license are for and GitHub is blatantly violating it while people applaud.

dopaminefasting4y ago

Maybe read the MIT license before you grab the pitchforks:

"The above copyright notice and this permission notice shall be included in all COPIES OR SUBSTANTIAL PORTIONS of the Software."

Reusing a snippet doesn't require reproducing the MIT license. People who publish MIT software know they're basically giving their code out with basically no strings attached.

However, GitHub should be careful with the GPL variety.

arp2424y ago

What counts as a "substantial portion"? Personally I'd say that a function is substantial, whereas one or two lines would not be.

2 more replies

chartreusek4y ago

Sure, but there's still the license at play here. It's not like they trained it only on public domain/CC0 code. What happens when it verbatim outputs a significant amount of code that was originally MIT, or BSD, or GPL licensed without the appropriate attribution. It can create unintended copyright violations and potentially open people using it up to liability.

bryanrasmussen4y ago

>What happens when it verbatim outputs a significant amount of code that was originally MIT, or BSD, or GPL licensed without the appropriate attribution

You would sue.

And then Github would argue that their algorithms did not spit out verbatim the code by copying but rather it generated code that looked exactly like the other code based on learning from millions of codebases. ¨

And then there would be lots of lawyers.

And then a judge would have to decide.

2 more replies

dheera4y ago

I suppose it really depends on if they spit out verbatim reproductions of code or whether it is the equivalent of a 10-year experienced programmer who has just seen a lot of code but isn't reproducing anything verbatim.

We shall see, by Googling some of the code it spits out.

FWIW GPT-3 doesn't really tend to spit out verbatim reproductions of copyrighted books.

4 more replies

dylan6044y ago

Maybe they should train the ML to read the license? If the ML can undertand the license, then we'll have to bow down to their superiority. However, if it did understand the license, then it would do the right thing.

toomuchtodo4y ago

So sue them and a court opinion can demonstrate where the line is and how much code can be replicated before attribution is required (and the product can be refined to ensure compliance).

Innovation should push boundaries.

3 more replies

jeroenhd4y ago

Open source code still has a license. That license may or may not require distributing the license along with the code. MIT may allow distribution without a license unless the code share is significant, but reusing GPL3 is a no-go for commercial companies.

The Apache 2 license allows for commercial use, but has implications for the way you can enforce your software patents. It also requires distributing the license file along with your application.

Complaining that companies use the software you told the world was free to use without restriction is dumb. However, not everyone gives away their software for free without restrictions. The fact that Github isn't respecting those licenses is a much bigger problem.

The tool autocompleting some random guy's personal information because he uploaded his blog to Github is highly problematic. The idea of using permissively licensed code to train an AI is not bad, but some human with knowledge of software licenses would need to pre-select those projects.

If all code came from one of those "do whatever the fuck you want" licenses, then there wouldn't be a problem. I'd consider it to be a great product and have no issue paying a fee. There's a huge market for a Copilot product, but this iteration just.. isn't it.

jefftk4y ago

> reusing GPL3 is a no-go for commercial companies

The GPL is completely compatible with commercial use. You just need to share modifications to the source with anyone you share the binary with. Many tech companies make extensive use of GPL software, and since they are not providing binaries to their end users they don't even have to share their changes to the source.

Even the AGPL, which does require you to share the source with users, still completely allows commercial use (though not compatible with as many business models).

1 more reply

bphogan4y ago

I think you're missing my point. I have tons of MIT code out there, including a node.js project used by lots of companies. I don't care about people using my code for money because I open sourced it under a permissive license. So I'm not really objecting to that.

But what's bothering me about this is that it's not a small company doing this. It's a company that's got crazy amounts of cash, who has been trying to trade on a "we're nice now and we love open source" image in the last few years, now taking all the open source code and balling it up in a closed-source app they will charge us for.

I'd be fine if I got to use it for free, extend it to whatever editing platform I like through its open API, and it was a part of an open project.

But right now it looks like they'll charge, and that bugs me.

hbz654y ago

“Free and open source, assuming I approve of the usage” is a common sentiment among people who paste Apache or MIT and don’t think about the ramifications. It’s increasingly common.

I think this situation is slightly more complex but that sentiment is at the heart of a lot of pushback against things like this.

iotku4y ago

Dev: "Anyone can use my code for any purpose including commercial purposes."

$BigCorp: "I want to use to use Dev's code for commercial purposes as he has explicitly granted me the right to do so."

Dev: "Wait, no not like that."

As much as I am a proponent of permissive licenses (my favorite is the wtfpl), you have to pick your license wisely especially if you're going to be picky about usage (Be it by $BigCorp, government agencies, or other companies that you might not be fond of).

If you really want "full control" over your code you have to make it proprietary.

1 more reply

dopaminefasting4y ago

All the people in this thread angry that GitHub is using MIT software in a way permitted by its license... depressing.

The MIT license doesn't require attribution for small snippets, only for full copies or substantial portions.

1 more reply

Retr0id4y ago

MIT licensed code must still be distributed with a copy of the license.

jrm44y ago

Once again, Stallman (will have been) right.

Github/Microsoft is going to take your code, and then cut off your access to it. This is what the GPL was designed to fight, so they're going to try it this way instead.

Those who do not learn history yadda yadda.

jefftk4y ago

How does this cut off our access to the code?

1 more reply

brutal_chaos_4y ago

Let's say your code on GitHub is not opensourced. Do we know it wasn't used for training?

Abishek_Muthian4y ago

Has GitHub agreed that it scraps based on license of the repository?

If so then what about private repositories with a permissible license but not been made public for what ever reason.

What about those projects whose dependencies has permissible license but main repo doesn't? Can GitHub just go oops!

I think the point that so much confusion exists regarding their product & possible violation of user's trust is a valid reason to be pissy about.

matsemann4y ago

> If you..

But we didn't.

licenseauth4y ago

> quite a lot of MIT code

Where is this MIT licensed codes of yours, because it definitely is not on your github.

seph-reed4y ago

Frankly, I think the reason people are upset is because a tool that once revolved around sharing work with others has been bought by a super giant corporation and then all of that sharing is being turned into a means of putting the people who shared out of work. Or in the very least, cutting their salaries dramatically.

richardfey4y ago

How do you see this technology putting people out of work or having their salaries cut dramatically? I do not write any code that could be found and copy/pasted from somewhere online.

1 more reply

firebaze4y ago

Downvoted. There's a difference between "commercial purpose" and "global player", and Microsoft crossed another line. One of many.

yangff4y ago

So.. I can see that this ML model is generating some code exactly same as the original dataset, which definiately a problem. A defect model, sure. Beside that, I cannot understand why the overall idea, using open-source project to train a ML model that generates code would ever be a problem. We human beings are learning as the model, we read others code, books, articles, design patterns... and it becomes part of us. Even the private code, I mean like you join a company, you read their codebase, methodology and it becomes something yours. Copyrights generally not allow you to "copy" the original, but you can still synthesize your own code -- cutting, combination, creating based on whatever you have learnt. The method of how a ML model works is differ from human brain for sure, but I cannot see why this would be a problem, or why an organic would become something superior that what they do is a creation and a ML mode is scraping your code. What is the difference here????

And also recently we saw GPT that generates articles, waifulabs that generates ... waifus... to be honest I cannot perceive the difference since all of them are "learning" (in a mechanical way of human created knowledge.

wildmanx4y ago

The difference is that it's a judgement call when to include attribution, whom to attribute with how much, and overall whether something is too close to be counted as a copyright or other license violation or not. Intelligent humans sometimes, or even often times, have a hard time doing this judgement call. An artificial intelligence would too, and a somewhat simple ML model (no offense) certainly does.

I'm really waiting for this to blow up from the open source license angle. Freely combining code with different license is a hellish undertaking on its own. But already just re-using some, say, GPL code, even staying under the same license, but without proper attribution, is Forbidden with capital F.

dathinab4y ago

> A defect model, sure

More like a defect approach, behavior like that is well known(1) to be basically guaranteed to happen with GPT-3 and similar approaches.

(1): By people involved in the respective science categories (Representation Learning/Deep Learning, NLP, etc.).

1 more reply

joe_the_user4y ago

Beside that, I cannot understand why the overall idea, using open-source project to train a ML model that generates code would ever be a problem. We human beings are learning as the model, we read others code, books, articles, design patterns... and it becomes part of us.

It's an interesting question.

1) When a human being reads code or a CS text book, we think of them extracting general principles from the code and so not having to repeat that particular code again. In contrast, what GPT-3 and Copilot seem to do is just extract sequences of little snippets, something that apparently requires them to regurgitate the text they've been trained on. That seem rather permanently dependent on the training corpus.

2) Human beings have a natural urge, a natural ethos, to help people learn. It's understandable. The thing is, when suddenly you're not talking people but machines, the reason for this urge easily vanish. Even if github was extracting knowledge from the code, I wouldn't have a reason to help them do so since that knowledge would be their entirely private property. They expect to charge people whatever they judge the going rate would be - why should anyone be helping them without similar compensation? That this is being done by "OpenAI", a company which went from open-nonprofit to closed-for-profit in a matter of few years, should accent this point. We're nowhere near a system that could digest all the knowledge of humankind. But if we got there, one might argue the result should belong to humankind rather than to one genius entrepreneur. And having the result belong one genius entrepreneur has some clear downsides.

dathinab4y ago

> I cannot see why this would be a problem, or why an organic would become something superior that what they do is a creation and a ML mode is scraping your code. What is the difference here????

TL;DR: The AI doesn't know it can't just copy past (from perfect memory) and as such it learned to sometimes just copy past thinks.

The GPT model doesn't: "learn to understand the code and reproduce code based on that knowledge".

What it learns is a bit of understanding but more similar to recombining and tweaking verbatim text snipped it had seen before, without even understanding them or the concept of "not just copy/pasting code". (But while knowing which patterns "fit together").

This means that the model will "if it fits" potentially copy/past code "from memory" instead of writing new code which just happens to be the same/similar. It's like a person with perfect memory sometimes copy pasting code they had seen before pretending they wrote the code based on their "knowledge". Except worse, as it also will copy semantic irrelevant comments or sensitive information (if not pre filtered out before training).

I.e. there is a difference between "having a different kind of understanding" and "vastly missing understanding but compensating it by copying remembered code snippets from memory".

Theoretically it could be possible to create a GPT model which is forced to only understand programming (somewhat) but not memorize text snippets, but practically I think we are still far away from this, as it's really hard to say if a model did memorize copyright protected code.

nlh4y ago

I have a genuine question about this whole thing with Copilot:

A similar product, TabNine, has been around for years. It does essentially the exact same thing as Copilot, it’s trained on essentially the same dataset, and it gets mentioned in just about every thread on here that talks about AI code generation. (It’s a really cool product btw and I’ve been using and loving it for years). According to their website they have over 1M active users.

Why is this suddenly a huge big deal and why is everyone suddenly freaking out about Copilot? Is it because it’s GitHub and Microsoft and OpenAI behind Copilot vs some small startup you’ve never heard of? Is it just that the people freaking out weren’t paying attention and didn’t realize this service already existed?

rdw4y ago

The feature of TabNine that uses the "public" dataset is optional. It can also provide completions only based on local code. That optionality is important.

Also, tabnine has a smaller scope; you type "var " and it suggests a variable name and possibly the rest of the line, like autocomplete has been doing for decades. Perfectly normal.

My understanding of copilot is that you can type "// here's a high-level description of my problem" and it'll fill out entire functions, dozens of lines. The scope is much grander.

paulgb4y ago

> It can also provide completions only based on local code. That optionality is important.

I don’t see how? The question is about the ethics of building such a tool, not whether anyone is forced to use it.

1 more reply

Lariscus4y ago

Yes, and it is also not an OK thing to do for the start-up. They were just lucky that nobody noticed their licence violations.

echelon4y ago

Because the repository trusted by millions is starting to do things we never anticipated. It's growing in ways that are a touch uncomfortable for some.

I think some are also beginning to feel an Amazonification happening. We built all the stuff and made it free, but now a company is going to own it and profit off of it.

Edit: If we want to prevent this, we need a new license that states our code may not be included in deep learning training sets.

Edit 2: if private repository code is in this training set, it may be possible to leak details of private company infrastructure. Models can leak training data.

ghoward4y ago

I personally have never heard of TabNine until now. Now that I have, I don't want my code to be part of that.

jchw4y ago

GitHub has more visibility and yes, more scrutiny. But that doesn’t mean TabNine would’ve survived without scrutiny, especially after an acquisition. The fact is, size matters.

moocowtruck4y ago

it's just what the community does these days, bored and have to be upset about something, and being upset at big companies is trendy

ineedasername4y ago

Can't you host code on GitHub that is not "free" for commercial use? If GitHub scraped these projects then it's a problem.

Otherwise, Into honestly trying to have a conversation on this to understand the objections because I haven't made up my mind but struggle to see the problem. So pease consider the following:

if the code was not encumbered by restrictions I don't see an obvious problem with this. Using code or data or anything like that in the public commons for a meta analysis doesn't strike me as wrong, even if the people doing it make money off of that analysis.

If I scraped GitHub code and then wrote a book about common coding patterns & practices I don't think that would be wrong.

I used the Brown corpus and multiple other written word corpuses (corpi?) Along with WordNet and other sources to write my thesis in Computational Linguistics Word Sense Disambiguation, later applying it to my job, which earns me money. Is this wrong?

Public datasets have been used extensively for ML already. I don't see this as much different.

pessimizer4y ago

> Can't you host code on GitHub that is not "free" for commercial use? If GitHub scraped these projects then it's a problem.

It did. It's spitting up the AGPL in empty files, and AGPL'd code isn't free for commercial use. It requires people who use it to make changes available under the same license.

ineedasername4y ago

My non-expert reading of the AGPL seems to indicate that using AGPL code in a commercial project is find as long as you don't change it. GitHub wouldn't necessarily have needed to change it to include it in a data set, so I'm not sure there's a license violation.

However the gray area is that the massive data set of which it is a part will spit out new code that has, in some way big or small, been influence by the AGPL code, which... well, I don't think that sort of use was anticipated by the terms of AGPL. I see can reasonable arguments in both directions. Personally though, I would favor an interpretation that limits GitHub's use for commercial purposes, if not for strictly licensing restrictions then at least for the spirit of these licenses.

In truth, I would very much have liked GitHub to gone out big & loud with an aggressive awareness campaign asking repo owner to opt-in to the use of their code for this. Again, for pure opens source licenses I don't thing that would be required, but I still think it would be the right thing to do. And certainly less damaging to their reputation & future hesitancy for project maintainers to trust GitHub with their code.

I don't think this will be a tipping point by itself, but if this behavioral pattern continues I could imagine devs big & small shifting to hosted or on-prem instances of things like GitLab.

jraph4y ago

> My non-expert reading of the AGPL seems to indicate that using AGPL code in a commercial project is find as long as you don't change it

Any restriction imposed by the GPL is also true for the AGPL. In particular, if you reuse some code from a (A)GPL project, even if you don't change the code, you must release your whole project under this license too, and give attribution to the author.

For the LGPL, the same thing applies, except you can reuse LGPL code in your code without releasing it under a *GPL license if someone can replace this code at execution time by another code (static linking does not allow this).

1 more reply

zmmmmm4y ago

That seems really dumb since they have a well formalised system for people to declare their licenses. Even if people have shown isolated incidents of it, I'm still sceptical (for example, maybe someone put AGPL headers into a project they explicitly licensed MIT in their main license file etc).

1 more reply

ghaff4y ago

If an individual hypothetically painstakingly searched through GitHub to see how others wrote an API call and copy-pasted, almost no one would have a problem with that even if they didn't attribute every little code snippet. But some are bringing out the pitchforks because ML can basically do that painstaking search (yes, I know it's not literally a search) so efficiently that it's actually (maybe) useful as a tool. But it's not fundamentally different from what many programmers do all the time.

zmmmmm4y ago

Scale, intent and impact do actually matter in copyright. Copying a single API can be ok, and copying a million of them might not be OK. There's a reason for the word "fair" in "fair use".

If a whole project was copied verbatim and the license violated I think everyone would agree that was wrong. So then is copying the same quantity of code across 1000 projects wrong?

Is setting up a process and a system that does that systematically, at scale with intent and then commercialises the result wrong?

1 more reply

jeroenhd4y ago

If large company violated my open source license, I'd certainly have a problem with it. Use my code all you want, but follow the basic rules.

There's entire websites dedicated to GPL violations; people do care.

ineedasername4y ago

If the license allows commercial use, what would the GPL violation be in this case?

3 more replies

nomercy4004y ago

Unless, it is searching through repositories which someone paid to keep private.

What if that private key you accidentally committed, pushed, removed and pushed last week to your private repo is now showing up in everybody's Copilot?

1 more reply

joe_the_user4y ago

The difference appears when copyrighted material is repeated verbatim. And because it's obvious Github has no control over how much copyrighted material is being repeated verbatim. And that copyrighted material is intended to be used by commercial companies who copyright their own material and don't want to have their copyright challenged.

ineedasername4y ago

Under US rules at least, everything automatically falls under copyright. But if the license allows full use even for commercial purposes, verbatim repetition is no different in this context than if I included it in my own new piece of software. (Of course attributions and distribution of the code & modifications would frequently be requires... with Pilot it's... potentially a gray areas on whether code lumps together with countless other software is actually modified in the traditional sense of the term, but attribution should still be an obvious requirement. And where the other license requirements are strictly required or not, I still think it would be the right thing to do for GitHub to honor the spirit of those strictures, especially considering their entire business model is based on a majority of users trusting their code to them.

If they're not going to act accordingly, there's no reason someone couldn't roll their own GitLab instance, or a competitor with more respect enter the marketplace.

rcxdude4y ago

OpenAI's argument is that this is fair use, in which case the license does not apply at all (though if the court's decision hangs on certain parts of the fair use tests, especially the fourth part, what was contained in the license may have some relevance).

st_goliath4y ago

> Hi. I know you’re excited about copilot.

> ...

> It’s truly disappointing to watch people cheer at having their work and time exploited by a company worth billions.

Huh? Over the last few days that I've watched this "copilot" story unfold on various news aggregator sites, I've first seen people point out copyright and other issues with it, then the fast inverse square root tweet happened, and then more articles and tweets like this one and the discussion that we are currently having. But I somehow don't really recall anyone besides the Microsoft marketing department being overly excited about it. Did I miss something?

Nuzzerino4y ago

Well, for whatever reason, the HN thread for it is in the top 30 most upvoted threads of all time. That probably counts for something.

https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

dinglejungle4y ago

The comments on the original announcement here were pretty positive: https://news.ycombinator.com/item?id=27676266

nomercy4004y ago

How amazing would it be if you could ask a search engine for a piece of working code based on a short description?

That would be exciting tech for me.

ghoward4y ago

There have been a lot of tweets from developers with access raving about how cool it is.

rvz4y ago

> But I somehow don't really recall anyone besides the Microsoft marketing department being overly excited about it

What you just saw 3 days ago was a hype driven unveiling of a cherry picked contraption by GitHub, OpenAI and Microsoft. Open source became the loser once again and got taken advantage of this clever trick and will soon become a paid service. (With lots of code that is under copyright of various authors.)

Anyone who critiqued the announcement three days ago was drowned out, downvoted and stamped on by the fanatics.

I wanted to see those who had access to it (Not GitHub or Microsoft fans) to demystify and VERIFY the claims rather than blindly trust it. Those suspicions by the skeptics were right, and lots of questions still remain unanswered.

Well done for re-centralising everything to GitHub. Again.

andrewjl4y ago

Here's the brutal and ugly truth: why isn't our personal data treated as private property? It's because those who write the laws governing its status either lack the requisite understanding or else practice a form of, to put it mildly, motivated reasoning.

scrollaway4y ago

Your public source code is not "personal data".

seph-reed4y ago

Well, for the most part my code isn't going to do anyone a ton of good. I don't use much in the way of popular frameworks, but I also guess this means I'm gonna be out of a job for not writing "normal" enough code at some point.

Time to move on to the carbon age I suppose.

rikroots4y ago

I don't know - I think a variety of coding approaches may make for some interesting AI code suggestions in the next year or two.

I do pity the poor algorithm that has to parse sense into my coding idiosyncrasies.

seph-reed4y ago

Perhaps. My experience with ML is that it's more like:

* a person that follows popular trends than

* a person who finds/dissects clever/unique solutions to add to their tool belt

Honestly, I'm pretty sure ML hates my guts. Anything I've ever used involving it ends up burying my voice and slowly trying to etch away that parts of me that aren't normal enough.

sfg4y ago

Is there no licence with any sort of model training clause: "If this licence or the source code it covers is used to train a statistical model, then the model and code used to create the model are covered by this licence (which has terms like the AGPL)"?

If not, will anybody quietly slip something like this into Copilot's training data?

greatgib4y ago

I'm a big proponent of open source and I'm usually not nice with bad moves of GitHub. For example, i find stupid to use vscode and believe that it is open source when it is a lie.

But, in that case, I think that the things that are put to charge GitHub are not right.

I think that the idea is nice and it is fair from open source code. Anyone is free of downloading free software and doing something similar, and it is nice.

I just find the product itself is stupid, and it is for users to be smart enough not to use that knowing that their is a risk of them being sued for involuntary violating copyright. And GitHub might be at risk if it is a paid service as the companies could sue them back by pretending that they expected the code generated by GitHub to be safe for commercial use.

Also, I would think that GH would have abused if they used 'private repo' codes to train their model without permission.

ghoward4y ago

Unfortunately, just because code is open source doesn't mean that there aren't terms of use attached with it. One of the simplest and most widely used terms is attribution.

This means that if Copilot does not attribute code when it copies and modifies it, then it is violating most open source licenses. Full stop.

greatgib4y ago

My point is that somehow, it is not copilot that is violating the licenses, but the code that is generated is.

So, if you just use copilot to generate random things, you are ok. But if you try to use the generated code for anything: distribution, selling, eventually usage. Then, you are violating the licenses in the same way as you took yourself the parts of code to reuse.

It is possible users of copilot that should avoid that or be very careful to check any line produced (that is almost impossible).

Also, by itself, copying one or two lines of code can hardly be limited by copyright. But, as we so, copilot can spit big full block of code from existing projects.

1 more reply

maxbendick4y ago

What's hilarious about auto-generating the GPL license is that it's provable Copilot is trained on GPL code, but it's essentially impossible to tell which code it came from. Any legal battle will be strange... Is it enough for Copilot to not regurgitate GPL licensed code exactly? Is it enough for Copilot to create a slightly modified version? Laughably, as soon as slight variation is added, there is so much code in the world that it'll be impossible to prove wrongdoing for HTML or JavaScript synthesis. A model trained on all permissively licensed code on GitHub looks a lot like your own GPL code? Are you sure your code is so unique?

Microsoft of course will implement compliance standards as necessary (they genuinely do not want to break the law), but what does this mean for smaller companies and individuals training models?

eqtn4y ago

Github should list all the projects they scrapped the code from to make copilot.

superkuh4y ago

If you're hosting at the free github service, or even paid, github did not scrape your code. They just accessed the information on the hardware they owned. HTTP wouldn't have to be involved at all. They could just look at the disks.

Additionally, "The third-party doctrine is a United States legal doctrine that holds that people who voluntarily give information to third parties—such as banks, phone companies, internet service providers (ISPs), and e-mail servers—have "no reasonable expectation of privacy.""

The above isn't to say I agree with this but just to highlight the dangers of outsourcing and the cloud.

blibble4y ago

believe it or not there's more countries in the world than the United States

> "The third-party doctrine is a United States legal doctrine that holds that people who voluntarily give information to third parties—such as banks, phone companies, internet service providers (ISPs), and e-mail servers—have "no reasonable expectation of privacy."

this is definitely not the case for 100% of the rest of the world

vinay4274y ago

It's not even true in the US. It's not a specific law, but rather a doctrine that is more of a vague legal idea as I understand it. There are always laws that don't follow it, such as the CCPA in California or several (narrower) approaches in other states, or even cases such as Carpenter v. US that rejected a possible application of it. This is without even considering more obvious holes in this concept including IP law, healthcare data, etc.

superkuh4y ago

Github isn't incorporated and mostly centered in the rest of the world.

smoldesu4y ago

The good news is that Github then also has no reasonable expectation for me to use their service. Most developers can just as easily set up a Gitlab or self-hosted alternative with zero friction.

croes4y ago

That's why the GDPR is right. Like you mentioned, there is no cloud just other peoples hardware.

They will do whatever they want with your code.

MS didn't change a bit.

yayr4y ago

see some analysis of the scope of this issue here: https://docs.github.com/en/github/copilot/research-recitatio...

especially: Conclusion and Next Steps.

This investigation demonstrates that GitHub Copilot can quote a body of code verbatim, but that it rarely does so, and when it does, it mostly quotes code that everybody quotes, and mostly at the beginning of a file, as if to break the ice.

But there’s still one big difference between GitHub Copilot reciting code and me reciting a poem: I know when I’m quoting. I would also like to know when Copilot is echoing existing code rather than coming up with its own ideas. That way, I’m able to look up background information about that code, and to include credit where credit is due.

The answer is obvious: sharing the prefiltering solution we used in this analysis to detect overlap with the training set. When a suggestion contains snippets copied from the training set, the UI should simply tell you where it’s quoted from. You can then either include proper attribution or decide against using that code altogether.

This duplication search is not yet integrated into the technical preview, but we plan to do so. And we will both continue to work on decreasing rates of recitation, and on making its detection more precise.

temac4y ago

The way "AI" works for now, Copilot never comes with its own ideas, as it is incapable of deductive reasoning. It basically just detects from the context then mixes variations of things it learned. If there is nothing to mix (that is if there is a single source), the risk of spitting verbatim is high. But if there are multiple sources and some mixing and some amount of tiny differences, said differences better not be not too trivial because I don't see why we would suddenly drop Abstraction-Filtration-Comparison approaches...

So their defense of the like "oh it's fine it very rarely emits verbatim things" is bullshit anyway. That's an answer to a wrong question, at least given the answer is in this direction (would there be ton of verbatim recitation, they obviously would not try to wave away the problem like that -- however we can not conclude anything from verbatim output being rare, despite them stating that as if it a quite central and strong argument)

bphogan4y ago

Hi HN. Didn't really expect to see this tweet make it here of all places. But that's cool.

abetusk4y ago

Here is the relevant portion of GitHub's terms of service (section D.4) [0]:

"""

4. License Grant to Us

We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.

...

"""

Note that the relevant detail is that this applies to public repositories not covered under some free/libre license. I also assume this excludes private repos which might have more restrictive terms of use. GitHub has a section on it I just haven't read it in detail and so maybe the above covers private repos as well.

[0] https://docs.github.com/en/github/site-policy/github-terms-o...

ChrisMarshallNY4y ago

This is my take-home:

> We are obsessed with shiny without considering that it might be sharp.

1 more reply

yayr4y ago

To me it seems, the whole subject requires additional consideration in licensing. It is a little like applying telephone based law to the internet. It will not 100% fit.

If the creators interests are not clearly expressed anymore with a license, we need updates to the license texts.

Let's look at MIT:

____________________

"Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. [...]

____________________

From the license text alone, it would not be clear to me, why anyone could claim that the OpenAI codex or the Github Copilot would require attribution to any of the used MIT source code to generate the AI model. The AI model is simply not a copy of the source or of a portion thereof. It is essentially a mathematical / statistical analysis of it.

Now what about any generated new source? How similar does it need to be to any source to be a copy? At what size of the generated code it qualifies to be a copy instead of a snippet of industry best practice?

Where does the responsibility for attribution lie? Should we treat the AI code generation models like a copy & paste program? Usually you cannot really say where the copy came from 100% - how do you know what factors influenced it?

TeMPOraL4y ago

> Now what about any generated new source? How similar does it need to be to any source to be a copy? At what size of the generated code it qualifies to be a copy instead of a snippet of industry best practice?

Let's handle the simplest case first: Copilot can and does regurgitate large pieces of its training dataset verbatim. This is a well-known and trivially demonstrable property of all ML models in this family. Would such exact copy fall under the license of the code being copied? This of course needs to be tested in courts, but my gut says "yes". The problem now is, if you're using Copilot, you may end up with such copied code in your codebase without ever knowing, and this might open you to liability.

lokl4y ago

How does Copilot avoid training on malicious code? There might be bad actors who would love to have their code scraped for this...

TeMPOraL4y ago

I bet there are bad actors already starting to spam GitHub with sensibly looking projects that have hidden vulnerabilities, in hopes the next retraining of Copilot will pick them up.

cush4y ago

Such a great question. Remember Tay?

coliveira4y ago

Newsflash: all open source means that you're already doing free work for the largest corporations in the world! It seems like developers, as a group, decided that it would be better to spend their nights writing free code for FAANG, so they would be able to keep their day jobs. Bezos and friends thank you all. #genius

haolez4y ago

Source Hut[0] is getting more attractive with each passing day, but I'm not sure I can adapt to it's weird e-mail centric pull requests (and I know that this is a standard Git feature, but the UX seems bad).

[0] https://sourcehut.org/

ghoward4y ago

The other thing is that Source Hut is immensely complicated to set up. I wish I could love it, but Gitea was just so much easier to get going.

lmarcos4y ago

A colleague of mine: "I either remove all my (useless) repositories from GitHub or I ask GitHub to pay me if they want to use my code in Copilot".

It's not that crazy.

SergeAx4y ago

> It’s truly disappointing to watch people cheer at having their work and time exploited

Maybe it's my information bubble, but I don't see anyone cheering. Currently Copilot churning out rather bad code. I am definitely would not use it. And my prediction about it that it will go like Tesla's autopilot for years.

mensetmanusman4y ago

Is this analogous to gtp-3 reading every sentence ever written without attribution to all of mankind?

wizzwizz44y ago

A little. But all works output by GPT-3 are provided in “source form” to everyone who uses them – whereas lots of the output of Co-Pilot (trained on copyleft code, among other things) is going into proprietary software projects.

(Also, GPT-3 wasn't trained on nearly as much writing as that. Even if you ignore lost writing, GPT-3 was trained on a small subset of the 'net.)

ricardobeat4y ago

I don’t understand this mentality. The AI is trained (or at least supposed to be - that’s fixable) on code that was published under open licenses. The “exploited by the man” trope after publishing OSS feels entirely backwards.

gdsdfe4y ago

Nothing is free people! ... People are outraged by GitHub but nobody is going after Facebook or Google for training their AIs on your personal data. Facebook used your face to train some algos, google your personal emails etc.

goodpoint4y ago

> nobody is going after Facebook or Google

A lot of people dislike them and minimize their use.

More importantly, we are seeing a bait-and-switch. People agreed on GitHub storing, showing and indexing their code and issues, not using the code for Copilot, regardless of what the fine print in the usage agreement says.

SamWhited4y ago

It's not about it being free, it's about GitHub taking something you licensed with conditions (ie. attribution or keep this copyright notice and license file, etc.) and blatantly ignoring your license because they know you probably can't afford to sue them for copyright infringement. Open Source doesn't mean you can reproduce and copy the code freely, licenses exist for a reason. Also: of course people care about Facebook et al. (not enough, I'll grant you). Plenty of people complain about Facebook violating privacy every single day.

gdsdfe4y ago

Your license means nothing if you can't defend it. If you paid for the service they wouldn't dare using your code regardless of the license.

wizzwizz44y ago

> Your license means nothing if you can't defend it.

It's okay to commit crimes if the victims are poor enough? (And yet what you've said is still true.)

1 more reply

joe_the_user4y ago

Well, "nothing is free" doesn't mean someone can blatantly violate a license 'cause they have rent to pay.

Maybe people should be mad about what Facebook or Google do but that stuff doesn't involve taking stuff outside their terms of use.

Maybe Github could try attaching a "we can relicense all your code whenever we want" condition to their hosting but they'd lose all their business.

ericmay4y ago

I mean.. I am? I care much more about Google or Facebook profiling and profiting off of my data (especially when I don’t consent to giving it to them in any meaningful way) than I do letting GitHub do things with code I knew was freely available and that other entities could use in profitable ways.

void_mint4y ago

> People are outraged by GitHub but nobody is going after Facebook or Google for training their AIs on your personal data.

...what?

shakow4y ago

Because people accept a EULA when they give their data to FB or Google. Github is exploiting a grey area to leverage a big fat chunk of GPL (or other) licensed code in ways that are perceived to be a technically probably legal, but morally very ambiguous understanding of these licenses.

code_duck4y ago

I wish that I could somehow turn my personal data into IP, but I’m not sure how to accomplish that.

macintux4y ago

Well, some of us try to keep Facebook and Google out of our private lives, but our faces aren't copyrightable.

nomercy4004y ago

What if you paid to keep it private?

yongjik4y ago

There may be discussions to be made about licenses, but "to watch people cheer at having their work and time exploited by a company worth billions" is a disappointingly myopic take, especially from a developer.

Information that is aggregated and organized for easy retrieval is worth more than the sum of individual bits of information. I thought that was common sense.

We might as well complain that billionaire supermarket chains are pocketing all the profit while not growing a single potato by themselves.

pessimizer4y ago

We would, if they weren't required to pay for potatoes.

Are you making a claim that Netflix shouldn't be required to pay for individual movies because they sell a collection of movies?

hekec4y ago

On their website they say that "GitHub Copilot is a code synthesizer, not a search engine: the vast majority of the code that it suggests is uniquely generated and has never been seen before. We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set."

So it won't copypaste your code. It had just read code from open sources and learned from it - similar to what humans do. So I don't see any problem with this.

rhn_mk14y ago

First, it does copypaste code: https://docs.github.com/en/github/copilot/research-recitatio...

Second, we can't ignore that if someone deliberately tries to make it spit out copyrighted code, the chances are going to be much greater.

Why would anyone? Plausible deniability: "I didn't copy this GPL procedure, the copilot gave it to me!"

ghoward4y ago

GitHub has 56 million users as of September 2020 (according to Wikipedia). Let's assume that only 1 million of them use Copilot at an average of once a week.

That means that every week, there will be 1000 verbatim copypaste of code by Copilot. Then multiply that by a year or more as Copilot gets older.

0.1% may not seem like a lot, but at the scale of Internet companies, it always is.

macintux4y ago

> So it won't copypaste your code.

You might want to check out this video...

https://twitter.com/mitsuhiko/status/1410886329924194309

einpoklum4y ago

> the vast majority of the code that it suggests is uniquely generated and has never been seen before.

Original code in somebody's GitHub repo:

  int x = y + z;

Copilot code:

  int Eisaa7ha = Wu8iazo7 + Roh0Eesh;

Not copy pasted! Uniquely generated! Never before seen!

skc4y ago

I think the product is pretty cool, but I wish it had been announced by GitLab instead so there would be less of this brouhaha.

mrkramer4y ago

Isn't Google's BigQuery also scraping GitHub and is making it accessible/available for commercial use.

tasubotadas4y ago

God forbit somebody profits from the code that you've posted publicly.

It's a NET POSITIVE FOR EVERYBODY.

TeMPOraL4y ago

To the extent Copilot is doing something illegal, or making its users inadvertently engage in illegal behavior, it is copyright infringement, as (most) license violations are copyright violations.

Copyright cuts both ways. Free Software and Open Software exist in context, and because of, copyright laws. This means that a person or a company using output from Copilot may be engaging in copyright infringement. In other words, Copilot is enabling software piracy.

I might be sympathetic to it, and even consider it mostly positive, but then if companies can use my code ignoring the license, I want to be able to Torrent their products in peace too.

rhn_mk14y ago

Copyleft exists for a reason.

Traubenfuchs4y ago

> I have a SoundCloud and books and whatnot I could promote here.

You just did.

stakkur4y ago

Github is (owned by) Microsoft. This is just an appetizer.

tedunangst4y ago

What if I don't pay?

speedgoose4y ago

Then you can't use the code completion tool made by Github using open-source code.

einpoklum4y ago

Hey, how come what is essentially the equivalent of an HN comment, only made on Twitter, gets to be an HN story? :-(

j / k navigate · click thread line to collapse

198 comments

iliekcomputers4y ago

(And charging for a product which adds value to your developer experience and needs money to be run is not a bad thing)

hmfrh4y ago

> If you open-sourced code and allowed it to be used for commercial purposes

> I don't see the point of being pissy about Github using it, I'm saying this as someone who's written quite a lot of MIT code.

People are probably angry because this is yet another case of a big multinational corporation abusing unclear or difficult to enforce legislation for profit.

blibble4y ago

> People are probably angry because this is yet another case of a big multinational corporation abusing unclear or difficult to enforce legislation for profit.

it's worse than that: it's Microsoft trying to completely undermine the concept of open source

meanwhile: they're unaffected as their high-value proprietary code remains private and doesn't train the model

3 more replies

kuschku4y ago

> you grant each User of GitHub a nonexclusive, worldwide license to […] reproduce Your Content […] as permitted through GitHub' functionality

If you upload code to GitHub, you grant them (and every GitHub user) a license to do exactly what Copilot does.

This ToS change happened 2017, and I actually had to get approval from all contributors of my projects to accept to the changed ToS: https://github.com/justjanne/QuasselDroid-ng/issues/5

What GitHub’s doing is shady, but it’s been obvious it was going to happen for years.

anothernewdude4y ago

> If you upload code to GitHub, you grant them (and every GitHub user) a license to do exactly what Copilot does.

You can upload code to GitHub without the ability to grant such a license, or is github now only for primary copyright holders?

1 more reply

ineedasername4y ago

1 more reply

messe4y ago

> Even with an MIT license it is a copyright violation to copy the code without attribution

All in all though, it's probably going to take a few court cases to figure out. In the mean time, I'd expect most companies to steer clear of codepilot.

dogma11384y ago

Snyk did the same with Snyk code to build their “ML driven SAST” offering.

Pretty much anyone can scrape GitHub and train their model.

What exactly is the legal implications of this has yet to be tested.

Pretty much every model is susceptible to some sort of model inversion or set inclusion attack.

I don’t think people should be angry however I also think that this needs to be test in court and multiple times before this can be “safe to use”.

But I also don’t think that the ML model is necessarily a derivative work.

1 more reply

kenniskrag4y ago

is copy not allowed by theirs terms of service? I read something about that it is needed for e.g. forking feature. But it was years ago, when I created the account.

jwalton4y ago

> Even with an MIT license it is a copyright violation to copy the code without attribution.

That was my take originally, but apparently this is not as cut and dry as you may think:

https://www.technollama.co.uk/is-githubs-copilot-potentially...

1 more reply

ghoward4y ago

I believe that gives me the right to be mad and to demand they fix their violations, one way or another.

SamWhited4y ago

If you write MIT code you expect them not to strip your license out in derivative works. This is exactly what license are for and GitHub is blatantly violating it while people applaud.

dopaminefasting4y ago

Maybe read the MIT license before you grab the pitchforks:

"The above copyright notice and this permission notice shall be included in all COPIES OR SUBSTANTIAL PORTIONS of the Software."

Reusing a snippet doesn't require reproducing the MIT license. People who publish MIT software know they're basically giving their code out with basically no strings attached.

However, GitHub should be careful with the GPL variety.

arp2424y ago

What counts as a "substantial portion"? Personally I'd say that a function is substantial, whereas one or two lines would not be.

2 more replies

chartreusek4y ago

bryanrasmussen4y ago

>What happens when it verbatim outputs a significant amount of code that was originally MIT, or BSD, or GPL licensed without the appropriate attribution

You would sue.

And then there would be lots of lawyers.

And then a judge would have to decide.

2 more replies

dheera4y ago

We shall see, by Googling some of the code it spits out.

FWIW GPT-3 doesn't really tend to spit out verbatim reproductions of copyrighted books.

4 more replies

dylan6044y ago

toomuchtodo4y ago

So sue them and a court opinion can demonstrate where the line is and how much code can be replicated before attribution is required (and the product can be refined to ensure compliance).

Innovation should push boundaries.

3 more replies

jeroenhd4y ago

The Apache 2 license allows for commercial use, but has implications for the way you can enforce your software patents. It also requires distributing the license file along with your application.

jefftk4y ago

> reusing GPL3 is a no-go for commercial companies

Even the AGPL, which does require you to share the source with users, still completely allows commercial use (though not compatible with as many business models).

1 more reply

bphogan4y ago

I'd be fine if I got to use it for free, extend it to whatever editing platform I like through its open API, and it was a part of an open project.

But right now it looks like they'll charge, and that bugs me.

hbz654y ago

“Free and open source, assuming I approve of the usage” is a common sentiment among people who paste Apache or MIT and don’t think about the ramifications. It’s increasingly common.

I think this situation is slightly more complex but that sentiment is at the heart of a lot of pushback against things like this.

iotku4y ago

Dev: "Anyone can use my code for any purpose including commercial purposes."

$BigCorp: "I want to use to use Dev's code for commercial purposes as he has explicitly granted me the right to do so."

Dev: "Wait, no not like that."

If you really want "full control" over your code you have to make it proprietary.

1 more reply

dopaminefasting4y ago

All the people in this thread angry that GitHub is using MIT software in a way permitted by its license... depressing.

The MIT license doesn't require attribution for small snippets, only for full copies or substantial portions.

1 more reply

Retr0id4y ago

MIT licensed code must still be distributed with a copy of the license.

jrm44y ago

Once again, Stallman (will have been) right.

Github/Microsoft is going to take your code, and then cut off your access to it. This is what the GPL was designed to fight, so they're going to try it this way instead.

Those who do not learn history yadda yadda.

jefftk4y ago

How does this cut off our access to the code?

1 more reply

brutal_chaos_4y ago

Let's say your code on GitHub is not opensourced. Do we know it wasn't used for training?

Abishek_Muthian4y ago

Has GitHub agreed that it scraps based on license of the repository?

If so then what about private repositories with a permissible license but not been made public for what ever reason.

What about those projects whose dependencies has permissible license but main repo doesn't? Can GitHub just go oops!

I think the point that so much confusion exists regarding their product & possible violation of user's trust is a valid reason to be pissy about.

matsemann4y ago

> If you..

But we didn't.

licenseauth4y ago

> quite a lot of MIT code

Where is this MIT licensed codes of yours, because it definitely is not on your github.

seph-reed4y ago

richardfey4y ago

How do you see this technology putting people out of work or having their salaries cut dramatically? I do not write any code that could be found and copy/pasted from somewhere online.

1 more reply

firebaze4y ago

Downvoted. There's a difference between "commercial purpose" and "global player", and Microsoft crossed another line. One of many.

yangff4y ago

wildmanx4y ago

dathinab4y ago

> A defect model, sure

More like a defect approach, behavior like that is well known(1) to be basically guaranteed to happen with GPT-3 and similar approaches.

(1): By people involved in the respective science categories (Representation Learning/Deep Learning, NLP, etc.).

1 more reply

joe_the_user4y ago

It's an interesting question.

dathinab4y ago

> I cannot see why this would be a problem, or why an organic would become something superior that what they do is a creation and a ML mode is scraping your code. What is the difference here????

TL;DR: The AI doesn't know it can't just copy past (from perfect memory) and as such it learned to sometimes just copy past thinks.

The GPT model doesn't: "learn to understand the code and reproduce code based on that knowledge".

I.e. there is a difference between "having a different kind of understanding" and "vastly missing understanding but compensating it by copying remembered code snippets from memory".

nlh4y ago

I have a genuine question about this whole thing with Copilot:

rdw4y ago

The feature of TabNine that uses the "public" dataset is optional. It can also provide completions only based on local code. That optionality is important.

Also, tabnine has a smaller scope; you type "var " and it suggests a variable name and possibly the rest of the line, like autocomplete has been doing for decades. Perfectly normal.

My understanding of copilot is that you can type "// here's a high-level description of my problem" and it'll fill out entire functions, dozens of lines. The scope is much grander.

paulgb4y ago

> It can also provide completions only based on local code. That optionality is important.

I don’t see how? The question is about the ethics of building such a tool, not whether anyone is forced to use it.

1 more reply

Lariscus4y ago

Yes, and it is also not an OK thing to do for the start-up. They were just lucky that nobody noticed their licence violations.

echelon4y ago

Because the repository trusted by millions is starting to do things we never anticipated. It's growing in ways that are a touch uncomfortable for some.

I think some are also beginning to feel an Amazonification happening. We built all the stuff and made it free, but now a company is going to own it and profit off of it.

Edit: If we want to prevent this, we need a new license that states our code may not be included in deep learning training sets.

Edit 2: if private repository code is in this training set, it may be possible to leak details of private company infrastructure. Models can leak training data.

ghoward4y ago

I personally have never heard of TabNine until now. Now that I have, I don't want my code to be part of that.

jchw4y ago

GitHub has more visibility and yes, more scrutiny. But that doesn’t mean TabNine would’ve survived without scrutiny, especially after an acquisition. The fact is, size matters.

moocowtruck4y ago

it's just what the community does these days, bored and have to be upset about something, and being upset at big companies is trendy

ineedasername4y ago

Can't you host code on GitHub that is not "free" for commercial use? If GitHub scraped these projects then it's a problem.

Otherwise, Into honestly trying to have a conversation on this to understand the objections because I haven't made up my mind but struggle to see the problem. So pease consider the following:

If I scraped GitHub code and then wrote a book about common coding patterns & practices I don't think that would be wrong.

Public datasets have been used extensively for ML already. I don't see this as much different.

pessimizer4y ago

> Can't you host code on GitHub that is not "free" for commercial use? If GitHub scraped these projects then it's a problem.

It did. It's spitting up the AGPL in empty files, and AGPL'd code isn't free for commercial use. It requires people who use it to make changes available under the same license.

ineedasername4y ago

I don't think this will be a tipping point by itself, but if this behavioral pattern continues I could imagine devs big & small shifting to hosted or on-prem instances of things like GitLab.

jraph4y ago

> My non-expert reading of the AGPL seems to indicate that using AGPL code in a commercial project is find as long as you don't change it

1 more reply

zmmmmm4y ago

1 more reply

ghaff4y ago

zmmmmm4y ago

Scale, intent and impact do actually matter in copyright. Copying a single API can be ok, and copying a million of them might not be OK. There's a reason for the word "fair" in "fair use".

If a whole project was copied verbatim and the license violated I think everyone would agree that was wrong. So then is copying the same quantity of code across 1000 projects wrong?

Is setting up a process and a system that does that systematically, at scale with intent and then commercialises the result wrong?

1 more reply

jeroenhd4y ago

If large company violated my open source license, I'd certainly have a problem with it. Use my code all you want, but follow the basic rules.

There's entire websites dedicated to GPL violations; people do care.

ineedasername4y ago

If the license allows commercial use, what would the GPL violation be in this case?

3 more replies

nomercy4004y ago

Unless, it is searching through repositories which someone paid to keep private.

What if that private key you accidentally committed, pushed, removed and pushed last week to your private repo is now showing up in everybody's Copilot?

1 more reply

joe_the_user4y ago

ineedasername4y ago

If they're not going to act accordingly, there's no reason someone couldn't roll their own GitLab instance, or a competitor with more respect enter the marketplace.

rcxdude4y ago

st_goliath4y ago

> Hi. I know you’re excited about copilot.

> ...

> It’s truly disappointing to watch people cheer at having their work and time exploited by a company worth billions.

Nuzzerino4y ago

Well, for whatever reason, the HN thread for it is in the top 30 most upvoted threads of all time. That probably counts for something.

https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

dinglejungle4y ago

The comments on the original announcement here were pretty positive: https://news.ycombinator.com/item?id=27676266

nomercy4004y ago

How amazing would it be if you could ask a search engine for a piece of working code based on a short description?

That would be exciting tech for me.

ghoward4y ago

There have been a lot of tweets from developers with access raving about how cool it is.

rvz4y ago

> But I somehow don't really recall anyone besides the Microsoft marketing department being overly excited about it

Anyone who critiqued the announcement three days ago was drowned out, downvoted and stamped on by the fanatics.

Well done for re-centralising everything to GitHub. Again.

andrewjl4y ago

scrollaway4y ago

Your public source code is not "personal data".

seph-reed4y ago

Time to move on to the carbon age I suppose.

rikroots4y ago

I don't know - I think a variety of coding approaches may make for some interesting AI code suggestions in the next year or two.

I do pity the poor algorithm that has to parse sense into my coding idiosyncrasies.

seph-reed4y ago

Perhaps. My experience with ML is that it's more like:

* a person that follows popular trends than

* a person who finds/dissects clever/unique solutions to add to their tool belt

Honestly, I'm pretty sure ML hates my guts. Anything I've ever used involving it ends up burying my voice and slowly trying to etch away that parts of me that aren't normal enough.

sfg4y ago

If not, will anybody quietly slip something like this into Copilot's training data?

greatgib4y ago

I'm a big proponent of open source and I'm usually not nice with bad moves of GitHub. For example, i find stupid to use vscode and believe that it is open source when it is a lie.

But, in that case, I think that the things that are put to charge GitHub are not right.

I think that the idea is nice and it is fair from open source code. Anyone is free of downloading free software and doing something similar, and it is nice.

Also, I would think that GH would have abused if they used 'private repo' codes to train their model without permission.

ghoward4y ago

Unfortunately, just because code is open source doesn't mean that there aren't terms of use attached with it. One of the simplest and most widely used terms is attribution.

This means that if Copilot does not attribute code when it copies and modifies it, then it is violating most open source licenses. Full stop.

greatgib4y ago

My point is that somehow, it is not copilot that is violating the licenses, but the code that is generated is.

It is possible users of copilot that should avoid that or be very careful to check any line produced (that is almost impossible).

Also, by itself, copying one or two lines of code can hardly be limited by copyright. But, as we so, copilot can spit big full block of code from existing projects.

1 more reply

maxbendick4y ago

Microsoft of course will implement compliance standards as necessary (they genuinely do not want to break the law), but what does this mean for smaller companies and individuals training models?

eqtn4y ago

Github should list all the projects they scrapped the code from to make copilot.

superkuh4y ago

The above isn't to say I agree with this but just to highlight the dangers of outsourcing and the cloud.

blibble4y ago

believe it or not there's more countries in the world than the United States

this is definitely not the case for 100% of the rest of the world

vinay4274y ago

superkuh4y ago

Github isn't incorporated and mostly centered in the rest of the world.

smoldesu4y ago

The good news is that Github then also has no reasonable expectation for me to use their service. Most developers can just as easily set up a Gitlab or self-hosted alternative with zero friction.

croes4y ago

That's why the GDPR is right. Like you mentioned, there is no cloud just other peoples hardware.

They will do whatever they want with your code.

MS didn't change a bit.

yayr4y ago

see some analysis of the scope of this issue here: https://docs.github.com/en/github/copilot/research-recitatio...

especially: Conclusion and Next Steps.

temac4y ago

bphogan4y ago

Hi HN. Didn't really expect to see this tweet make it here of all places. But that's cool.

abetusk4y ago

Here is the relevant portion of GitHub's terms of service (section D.4) [0]:

"""

4. License Grant to Us

...

"""

[0] https://docs.github.com/en/github/site-policy/github-terms-o...

ChrisMarshallNY4y ago

This is my take-home:

> We are obsessed with shiny without considering that it might be sharp.

1 more reply

yayr4y ago

To me it seems, the whole subject requires additional consideration in licensing. It is a little like applying telephone based law to the internet. It will not 100% fit.

If the creators interests are not clearly expressed anymore with a license, we need updates to the license texts.

Let's look at MIT:

____________________

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. [...]

____________________

TeMPOraL4y ago

lokl4y ago

How does Copilot avoid training on malicious code? There might be bad actors who would love to have their code scraped for this...

TeMPOraL4y ago

I bet there are bad actors already starting to spam GitHub with sensibly looking projects that have hidden vulnerabilities, in hopes the next retraining of Copilot will pick them up.

cush4y ago

Such a great question. Remember Tay?

coliveira4y ago

haolez4y ago

[0] https://sourcehut.org/

ghoward4y ago

The other thing is that Source Hut is immensely complicated to set up. I wish I could love it, but Gitea was just so much easier to get going.

lmarcos4y ago

A colleague of mine: "I either remove all my (useless) repositories from GitHub or I ask GitHub to pay me if they want to use my code in Copilot".

It's not that crazy.

SergeAx4y ago

> It’s truly disappointing to watch people cheer at having their work and time exploited

mensetmanusman4y ago

Is this analogous to gtp-3 reading every sentence ever written without attribution to all of mankind?

wizzwizz44y ago

(Also, GPT-3 wasn't trained on nearly as much writing as that. Even if you ignore lost writing, GPT-3 was trained on a small subset of the 'net.)

ricardobeat4y ago

gdsdfe4y ago

goodpoint4y ago

> nobody is going after Facebook or Google

A lot of people dislike them and minimize their use.

SamWhited4y ago

gdsdfe4y ago

Your license means nothing if you can't defend it. If you paid for the service they wouldn't dare using your code regardless of the license.

wizzwizz44y ago

> Your license means nothing if you can't defend it.

It's okay to commit crimes if the victims are poor enough? (And yet what you've said is still true.)

1 more reply

joe_the_user4y ago

Well, "nothing is free" doesn't mean someone can blatantly violate a license 'cause they have rent to pay.

Maybe people should be mad about what Facebook or Google do but that stuff doesn't involve taking stuff outside their terms of use.

Maybe Github could try attaching a "we can relicense all your code whenever we want" condition to their hosting but they'd lose all their business.

ericmay4y ago

void_mint4y ago

> People are outraged by GitHub but nobody is going after Facebook or Google for training their AIs on your personal data.

...what?

shakow4y ago

code_duck4y ago

I wish that I could somehow turn my personal data into IP, but I’m not sure how to accomplish that.

macintux4y ago

Well, some of us try to keep Facebook and Google out of our private lives, but our faces aren't copyrightable.

nomercy4004y ago

What if you paid to keep it private?

yongjik4y ago

Information that is aggregated and organized for easy retrieval is worth more than the sum of individual bits of information. I thought that was common sense.

We might as well complain that billionaire supermarket chains are pocketing all the profit while not growing a single potato by themselves.

pessimizer4y ago

We would, if they weren't required to pay for potatoes.

Are you making a claim that Netflix shouldn't be required to pay for individual movies because they sell a collection of movies?

hekec4y ago

So it won't copypaste your code. It had just read code from open sources and learned from it - similar to what humans do. So I don't see any problem with this.

rhn_mk14y ago

First, it does copypaste code: https://docs.github.com/en/github/copilot/research-recitatio...

Second, we can't ignore that if someone deliberately tries to make it spit out copyrighted code, the chances are going to be much greater.

Why would anyone? Plausible deniability: "I didn't copy this GPL procedure, the copilot gave it to me!"

ghoward4y ago

GitHub has 56 million users as of September 2020 (according to Wikipedia). Let's assume that only 1 million of them use Copilot at an average of once a week.

That means that every week, there will be 1000 verbatim copypaste of code by Copilot. Then multiply that by a year or more as Copilot gets older.

0.1% may not seem like a lot, but at the scale of Internet companies, it always is.

macintux4y ago

> So it won't copypaste your code.

You might want to check out this video...

https://twitter.com/mitsuhiko/status/1410886329924194309

einpoklum4y ago

> the vast majority of the code that it suggests is uniquely generated and has never been seen before.

Original code in somebody's GitHub repo:

  int x = y + z;

Copilot code:

  int Eisaa7ha = Wu8iazo7 + Roh0Eesh;

Not copy pasted! Uniquely generated! Never before seen!

skc4y ago

I think the product is pretty cool, but I wish it had been announced by GitLab instead so there would be less of this brouhaha.

mrkramer4y ago

Isn't Google's BigQuery also scraping GitHub and is making it accessible/available for commercial use.

tasubotadas4y ago

God forbit somebody profits from the code that you've posted publicly.

It's a NET POSITIVE FOR EVERYBODY.

TeMPOraL4y ago

I might be sympathetic to it, and even consider it mostly positive, but then if companies can use my code ignoring the license, I want to be able to Torrent their products in peace too.

rhn_mk14y ago

Copyleft exists for a reason.

Traubenfuchs4y ago

> I have a SoundCloud and books and whatnot I could promote here.

You just did.

stakkur4y ago

Github is (owned by) Microsoft. This is just an appetizer.

tedunangst4y ago

What if I don't pay?

speedgoose4y ago

Then you can't use the code completion tool made by Github using open-source code.

einpoklum4y ago

Hey, how come what is essentially the equivalent of an HN comment, only made on Twitter, gets to be an HN story? :-(

j / k navigate · click thread line to collapse