No, it's not in a "unique and dominant position". Open source code is freely available online, it's almost trivial to build a bot to scrape OS code from anywhere on the web (GitHub included).
The comparison to the Google Books antitrust falls down completely, Google had a dominant position because it had the resources to scan all books. Anyone can build a collection of almost all open source code.
Further to that, all these models (GPT and Image generation) are trained on scraped data, trying to suggest that only GitHub/Microsoft could do it defeats the purpose of trying to establish what the legal rights are over training models with scraped data.
We need test cases and precedent, but trying to use this as one is not going to work.
Edit:
It took me 15 seconds to find that there is a Google Big Query dataset of open source code for GitHub: https://cloud.google.com/blog/topics/public-datasets/github-...
and thats been further curated on Hugging Face: https://huggingface.co/datasets/codeparrot/github-code
GitHub / Microsoft do not have a monopoly on this data.
I thought Google had a dominant position because they signed an exclusive deal with the authors guild that explicitly gave them a dominant position.
Anyone else could set up a project to go round libraries and scan books. Google has put more money into it than other organisations, but The Internet Archive has about 20 million scans (https://archive.org/details/texts).
That does put Microsoft in the unique position to have direct unfettered access to any and all open source code on GitHub without restrictions. Unless you or I get the same kind of direct access without rate limiting and antibot protection, then they do dominate and have an advantage over everyone else.
git clone
git set origin…
It’s much harder to copy Google’s index.
Has anyone actually tried, because i've cloned lots of repos and have never been throttled. I'd go so far as to say the author of that post has never even tried it.
Absolutely wrong. GitHub is doing way more than just hosting code. It hosts bugtrackers, CI and much more. For most FOSS project it's the ONLY place where you can go and submit a bug report.
It's not just a repository, it's a communication tool and refuses to interoperate with other platform.
This is monopoly, just like NPM and Linkedin. Microsoft never changes.
Seems like a logistical nightmare to me. Git repos interact spectacularly poorly with web scraping in general.
I imagine you meant "precedent".
Copying a file is not the same thing as "scanning" a book. To scan you first need to get your hands on the book (the download part) and then use industrial scanners to scan them. So apple-apple comparison here is scanning <-> training & scanned collection of books <-> trained model, and finally the portals to the loot: Google Books <~> Github+VSC.
Not everyone has the resources to actually process -- that is train the 'model' -- using the publicly available 'data'. Most also don't also own Github and VSC platforms to field their model. In fact, is anyone other than microsoft in a position to both scrape OSS, train a coding AI, and then include that tool in dominant software development platforms?
However, that part of the argument feels like the less interesting CoPilot legal argument. The interesting one is: what's the license for use of the code it spits out? Any time CoPilot spits out a nontrivial piece of code that a) exists verbatim on Github and b) is nontrivial enough to be copyrightable, then what happens? Just because it was chewed through the machine doesn't magically wipe the original GPL/MIT/BSD license it had on GitHub. CoPilot doesn't represent a "clean room".
Large companies tend to be extremely skittish about devs using IP they don't have rights to. I lived under a rule of "No open source licensed thing , at all, anywhere" for years in the early 2000. Later, the rules are relaxed and obviously everyone uses MIT/BSD type stuff in commercial products these days, but management is still nervous about things like Stackoverflow answer code being copied verbatim (Still verboten). So how can - if I understand things correctly - CoPilot be allowed or encouraged at such places now? Wouldn't exactly the same worry about nontrivial StackOverflow snippets apply to CoPilot produced code?
The author seems very confused and is mostly talking about copyright claim and then bizarrely starts talking about antitrust litigation.
There is a subtle difference here. Microsoft isn’t just producing code based on GitHub data. They are producing a tool that lets others generate code based on GitHub data. I do think consideration of the source data creators intent is important- and there is a case CoPilot hasn’t done that. But if Shutterstock wants to use any images _that they have been given license for and treat creators fairly for_ to build a tool that lets others generate images, they should be allowed.
Also, the op argues only MS has access to train based on all of GitHub. Others might run into rate limiting etc. However we know Amazon and others do have similar models. This would indicate MS may have a competitive edge but not a full market lockout.
Grand Upright Music, Ltd. v. Warner Bros. Records, Inc.
https://en.wikipedia.org/wiki/Grand_Upright_Music,_Ltd._v._W....
Looks like it's in the EU as well.
edit: Hm, Pelham v Hütter C-476/17 might offer some grace for mashups under the quotation exemption at least. Though I wouldn't rely on that.
The conclusion essentially boils down to "remixing is not fair use". Today's hip-hop is a direct result of that decision because sampling became prohibitively expensive.
Now they may have found a way. And that I think is the potential anti-trust issue here.
What is one of the main obstacles to Microsoft's monopoly dominance in the software sphere? The Linux kernel, it's everywhere. And it's under the GPL, a license explicitly resistant to "Embrace, extend, and extinguish" (old school Gates/Balmer MS). Microsoft right now is not emphasizing an anti-Linux, anti-GPL focus, but it clearly has in the past and it (and others) could definitely do so again in the future.
Systems like CoPilot have the potential to be for the GPL (or other copyleft type licenses) what cryptocurrency 'mixers' or 'tumblers' are to money laundering laws. A potential to be an automated way to pull pieces of IP out of those licenses and into other codebases without respecting the obligations that go with it.
A lot of the dialog on here and other threads on this forum in the past shows me that understanding of copyleft licenses among the open source and developer community is really low right now. This is the license that the Linux kernel is licensed under, it is extremely important. There should be better recognition of the rights and responsibilities afforded by it.
The GPL was explicitly formulated as a way to protect portions of the hobbyist and free software community from potentially predatory commercial interests. Remember it's always possible to attempt to negotiate a commercial non-copyleft license with an entity that has released its source under the GPL. But if you don't, you have to respect its distribution requirements. It's fine to be personally opposed to using the GPL for your own work, but it is important to understand the obligations that come with it. And that includes systems that harvest data from it automatically.
Spoiler alert: Google was copying books in a manner considered fair use, consistent with Sony v Universal. I’m not sure why this author thinks this is irrelevant. The Federal court system surely won’t!
Displaying book excerpts also:
- Leaves the attribution and copyright intact.
- Is not intended to use excerpts verbatim or slightly modified, unless quoting them with attribution.
- May increase the sales of the book.
I agree with the OP of the submission that this case is entirely irrelevant for the CoPilot situation.
In late 2013, after the class action status was challenged, the District Court granted summary judgement in favor of Google, dismissing the lawsuit and affirming the Google Books project met all legal requirements for fair use. The Second Circuit Court of Appeal upheld the District Court's summary judgement in October 2015, ruling Google's "project provides a public service without violating intellectual property law." The U.S. Supreme Court subsequently denied a petition to hear the case.
https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....
There actually is a convenient archive for accessing GitHub-hosted code in bulk. All GitHub source code is available for bulk analysis in Google BigQuery.
https://cloud.google.com/blog/topics/public-datasets/github-...
I still don't support GitHub training Copilot on other people's code without permission, but this particular part of OP's argument is incorrect.
I suspect many others who publish there feel the same way.
Software authors are not upset about the mere reuse of their code, it's the violations of such license terms that are problematic. If attribution is required, but neglected or impossible, that's typically known as "plagiarism", you know.
Second, the only aspects of code that needs to follow the license are the parts of the code that are covered by copyright. That excludes anything that is functional. Since optimizations are functional and not expressive in nature then, for example, an optimized sorting algorithm would not be covered. What would be covered is how that algorithm is organized… the API, file structure, class names, ie, the arbitrary parts of code that everyone argues about.
> didn't foresee this use
So you really didn’t want any use. You just wanted the use you found acceptable? So you didn’t really want it to be “open”
Code i dont want others to use I dont publish.
The concerns over copyrighted material ingested and exposed through AI system are the same for copyrighted material ingested by and displayed by our web 2.0 search engines.
So, Microsoft GitHub Co-pilot also indexes publicly accessible content but emits that content differently, however it does not exercise exclusive rights over what it indexes or control access to that content.
The Google Books and Author Guilds axis would have given exclusive monopoly access, distribution, and pricing of the largest collection of digital books in the world – so I don’t believe the comparison between the Google Books project and CoPilot is valid, because we have already accepted the concept of indexing and clipping content on the public internet.
Legal processes are generally slow.
I personally think Copilot is training on all the code. It's not verifiable so I go with the worst case scenario. But it shouldn't be a problem if you don't publish code that's licensed.
Lines of code shouldn't even be copyrightable. But that's a whole other discussion.
It’s not antitrust because GitHub isn’t a monopoly. And copilot only scanned public repos, so anyone could train, if they like.
Also this isn’t like the Google Books case because Google made the books available, violating copyright. GitHub has not made the code available. So these cases aren’t similar and aren’t antitrust.
Although comically, by using GitHub I grant them copyright to publish my public repo so I suppose they could republish my repo in other ways without any additional permission. It would be interesting if their license allows them to rebounder and publish my repos in a book or something.
CoPilot reads and rearranges the IP that was created by millions of people who were working very hard and did not anticipate a code laundering machine when they wrote the code and the licenses.
When you publish something for others to view (text, images, code, whatever), others are allowed to view it. You can't anticipate how others view it, with their eyes or with screenreaders to assist. You can't stop them from reading it, thinking about it, discussing it with their friends, taking notes, summarizing it. You can't stop people from learning from your published content or recognizing patterns between it and other similar things.
Sorry, but you can't create a license that says "I will allow you to view this but you cannot learn from it. If you learn from it, you need to pay me."
The word that seems to fit best is transforming and adapting. In order to adapt something, one has to first learn from the original in order to produce the derivative work. This is however covered by copyright, since the transforming and adapting is still considered a form of copying even if all people did was learning and producing something unique but similar to the original.
The license can say that "I will allow you to view this but you cannot create a derviate work from it".
Furthermore, while lots of hard work was put into the code that CoPilot used, that hard work was specifically donated with the intent that the code be reused. The only hard requirement being that the code remain free. The thing people are angry about with CoPilot is that it's a hosted OpenAI product with no freely-available model weights, and that generated code might be regurgitated from training data in some cases[1]. If CoPilot was actually open AI, nobody would be suing over it.
[0] In Sony v. Connectix, it was found that Connectix actually tried clean-room, black-box analysis of the PlayStation ROM, but abandoned it in favor of disassembling the whole thing. Connectix was still ruled non-infringing.
[1] Most egregiously, the comment "evil floating point bit level hacking" will make it spit out Quake III source. Microsoft worked around this by explicitly banning that particular phrase, which is just stupid.
Class structure, file structure, APIs…
Comparing this to Google Books is silly. Google stole copyrighted books. Copilot uses freely shared open source code. No copyright issue.
The article claims "Open source code on GitHub might be thought of as 'open and freely accessible' but it is not." Lol what? The MIT and Apache licenses explicitly allow reuse. Copilot can absolutely use open source data.
This is typical hype and FUD. No evidence Copilot even used all of GitHub's data or violated any licenses. Baseless speculation.
There's no real antitrust argument here. Nothing to see, move along. yawn
> No evidence Copilot [...] violated any licenses
Both of these allow redistribution _if you include the license_. Copilot doesn't include any licenses in the code it distributes. You can argue whether that's fair use or not, but you can't argue that it doesn't respect the license.