Why are there no antitrust claims vs. GitHub Copilot, when there is a precedent? (opens in new tab)

(thehftguy.com)

108 pointshippo773y ago123 comments

123 comments

> Microsoft GitHub is the largest collection of open source code in the world. Microsoft GitHub is in a unique and dominant positions to host and access and distribute most of the open-source code in the world

No, it's not in a "unique and dominant position". Open source code is freely available online, it's almost trivial to build a bot to scrape OS code from anywhere on the web (GitHub included).

The comparison to the Google Books antitrust falls down completely, Google had a dominant position because it had the resources to scan all books. Anyone can build a collection of almost all open source code.

Further to that, all these models (GPT and Image generation) are trained on scraped data, trying to suggest that only GitHub/Microsoft could do it defeats the purpose of trying to establish what the legal rights are over training models with scraped data.

We need test cases and precedent, but trying to use this as one is not going to work.

Edit:

It took me 15 seconds to find that there is a Google Big Query dataset of open source code for GitHub: https://cloud.google.com/blog/topics/public-datasets/github-...

and thats been further curated on Hugging Face: https://huggingface.co/datasets/codeparrot/github-code

GitHub / Microsoft do not have a monopoly on this data.

rjmunro3y ago

> Google had a dominant position because it had the resources to scan all books.

I thought Google had a dominant position because they signed an exclusive deal with the authors guild that explicitly gave them a dominant position.

Anyone else could set up a project to go round libraries and scan books. Google has put more money into it than other organisations, but The Internet Archive has about 20 million scans (https://archive.org/details/texts).

mschild3y ago

There certainly are other spaces where open source code is hosted and available, but the default for most is GitHub. I think it's in a similar position to Google 10 years ago. Sure there are other search engines, but Google is by and large the standard one.

That does put Microsoft in the unique position to have direct unfettered access to any and all open source code on GitHub without restrictions. Unless you or I get the same kind of direct access without rate limiting and antibot protection, then they do dominate and have an advantage over everyone else.

reissbaker3y ago

Not sure if you posted before the edits, but I'm pretty convinced by them, seeing as how there are multiple alternatives with the same data.

scarface743y ago

it’s really not that hard to

git clone

git set origin…

It’s much harder to copy Google’s index.

ChatGTP3y ago

You think it's practical to do this with almost all the public repos on Github?

2 more replies

jackdaniel3y ago

This is addressed in the same paragraph - you can't scan/download "whole" github because you'll be throttled.

neximo643y ago

Are you actually throttled if you try to git clone or is that what the theory is, or is the assumption that it uses API calls to scrape through github?

Has anyone actually tried, because i've cloned lots of repos and have never been throttled. I'd go so far as to say the author of that post has never even tried it.

jackdaniel3y ago

I'm not arguing for or against whether they are in the dominant position; what I'm doing is pointing out that the grandparent quoted part of the text (and argues against it) without quoting the justification the author provided that is directly relevant to what they say.

> There’s an important notion to address here. Open source code on GitHub might be thought of as “open and freely accessible” but it is not. It’s possible for any person to access and download one single repo from GitHub. It’s not possible for a person to download all repos from Github or a percentage of all repos, they will hit limitations and restrictions when trying to download too many repos. (Unless there’s some special archives or mechanisms I am not aware of).

logifail3y ago

> Has anyone actually tried, because i've cloned lots of repos and have never been throttled

(Full disclosure: I have some pretty serious data hoarding issues)

When someone says "I've cloned lots of repos and have never been throttled" I'm afraid I immediately start wondering whether "lots" means multiple GB or multiple TB ... or more!

quickthrower23y ago

21Tb of data, they might rate limit you! But might be possible via proxies. But only public repos.

1 more reply

williamcotton3y ago

There’s no need to crawl for your own dataset:

https://pile.eleuther.ai/

hanselot3y ago

@article{pile, title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling}, author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor}, journal={arXiv preprint arXiv:2101.00027}, year={2020} }

So if I understand this correctly, the Pile is for code from 2020 backwards? If I wanted anything released in the past 3 years, say something in the SOTA AI space (where a month is a lifetime), I would need the scraper again?

I don't follow how this can compare to direct, live, unrestricted access. I suppose this is just my own hatred of Microsoft shining through. Of course we should accept the status quo, because how dare we suggest Microsoft could operate in a manner that is anti-competitive.

For anyone else trying to catch up, just rent a datacenter, write a crawler, deal with all the intricacies of keeping it in sync in real-time. This sounds trivial, simple even.

I wonder why nobody is doing it? Perhaps everyone doesn't have access to petabytes of storage space, unlimited bandwidth, unlimited proxy-jumps etc.

So the alternative is to buy github?

1 more reply

goodpoint3y ago

> No, it's not in a "unique and dominant position". Open source code is freely available online, it's almost trivial to build a bot to scrape OS code from anywhere on the web (GitHub included).

Absolutely wrong. GitHub is doing way more than just hosting code. It hosts bugtrackers, CI and much more. For most FOSS project it's the ONLY place where you can go and submit a bug report.

It's not just a repository, it's a communication tool and refuses to interoperate with other platform.

This is monopoly, just like NPM and Linkedin. Microsoft never changes.

bilqis3y ago

Github also has access to private repositories.

samwillis3y ago

They don't use privet repositories to train Copilot.

zelphirkalt3y ago

Maybe not yet. All just a change of their terms away. Oh you don't like it? We will give you 2 weeks to migrate. Perhaps you want this other more expensive subscription?

Just like with other code they should not be using as they do, they would probably run another "ask questions later" approach.

bilqis3y ago

They say they don’t

3 more replies

marginalia_nu3y ago

> It's almost trivial to built a bot to scrape OS code from anywhere on the web.

Seems like a logistical nightmare to me. Git repos interact spectacularly poorly with web scraping in general.

mewpmewp23y ago

I would've said you should download only archives, but really I think commits are also very important data since that shows the actual changes in the code which would be very useful to train AI to suggest changes to the code.

marginalia_nu3y ago

There are valid non-evil reasons for git hosts to want to throttle and put up obstacles toward scraping as well, both via crawlers or 'git clone' or whatever. These are very expensive operations.

flockonus3y ago

It appears to be the exact opposite to me, `git clone --depth 1 ...` will give you a code that you can know exactly how to parse, vs. webpages that have all sorts of semantical issues.

marginalia_nu3y ago

Git clone is a very expensive operation. Git hosts generally will try to prohibit mass git clone:ing for this reason.

1 more reply

moneywoes3y ago

How so? Can’t someone just download the zip file and make a queue of downloads or does GitHub rate limit?

toastal3y ago

Microsoft GitHub has access to all the commits you force pushed away or branch you deleted. We have no reason to believe that it’s actually gone with no transparency and the source code being closed.

cassianoleal3y ago

> We need test cases and president

I imagine you meant "precedent".

eternalban3y ago

> The comparison to the Google Books antitrust falls down completely, Google had a dominant position because it had the resources to scan all books. Anyone can build a collection of almost all open source code.

Copying a file is not the same thing as "scanning" a book. To scan you first need to get your hands on the book (the download part) and then use industrial scanners to scan them. So apple-apple comparison here is scanning <-> training & scanned collection of books <-> trained model, and finally the portals to the loot: Google Books <~> Github+VSC.

Not everyone has the resources to actually process -- that is train the 'model' -- using the publicly available 'data'. Most also don't also own Github and VSC platforms to field their model. In fact, is anyone other than microsoft in a position to both scrape OSS, train a coding AI, and then include that tool in dominant software development platforms?

alkonaut3y ago

That someone reads my code I expect. That someone reads my code and uses it to train a machine they money off I didn't expect, but I also can't say I object.

However, that part of the argument feels like the less interesting CoPilot legal argument. The interesting one is: what's the license for use of the code it spits out? Any time CoPilot spits out a nontrivial piece of code that a) exists verbatim on Github and b) is nontrivial enough to be copyrightable, then what happens? Just because it was chewed through the machine doesn't magically wipe the original GPL/MIT/BSD license it had on GitHub. CoPilot doesn't represent a "clean room".

Large companies tend to be extremely skittish about devs using IP they don't have rights to. I lived under a rule of "No open source licensed thing , at all, anywhere" for years in the early 2000. Later, the rules are relaxed and obviously everyone uses MIT/BSD type stuff in commercial products these days, but management is still nervous about things like Stackoverflow answer code being copied verbatim (Still verboten). So how can - if I understand things correctly - CoPilot be allowed or encouraged at such places now? Wouldn't exactly the same worry about nontrivial StackOverflow snippets apply to CoPilot produced code?

EamonnMR3y ago

And if indeed it's treated as clean room, does open source need to just pack it in? Are all of our licenses rendered unenforceable?

alkonaut3y ago

It feels like there is zero chance it could be used as some sort of blanket copyright cleaner. If it is then I'll make my own "model" (Ok a 2 line python script) that produces royalty free bestseller novels if you just prompt it with the title (Its training is extremely simple it just responds with the content of the book file with the same filename!). The fact that in a LLM we don't quite understand the black box, and the novels are chopped into tokens doesn't mean that IF they are stringed back together they are still the same paragraph.

vinaypai3y ago

Contrary to what the author of this article seems to believe, antitrust law isn't a general purpose law for going after companies you don't like.

The author seems very confused and is mostly talking about copyright claim and then bizarrely starts talking about antitrust litigation.

shireboy3y ago

> “If GitHub can produce code by training an AI on all code it is hosting, Youtube could produce videos and music by training an AI on all content it is hosting, the Writer Guilds could produce books by training an AI on all books it owns the rights for, Shutterstock could produce more stock images by training an AI on all stock images it is hosting.”

There is a subtle difference here. Microsoft isn’t just producing code based on GitHub data. They are producing a tool that lets others generate code based on GitHub data. I do think consideration of the source data creators intent is important- and there is a case CoPilot hasn’t done that. But if Shutterstock wants to use any images _that they have been given license for and treat creators fairly for_ to build a tool that lets others generate images, they should be allowed.

Also, the op argues only MS has access to train based on all of GitHub. Others might run into rate limiting etc. However we know Amazon and others do have similar models. This would indicate MS may have a competitive edge but not a full market lockout.

shanebellone3y ago

I believe the closest precedent is:

Grand Upright Music, Ltd. v. Warner Bros. Records, Inc.

https://en.wikipedia.org/wiki/Grand_Upright_Music,_Ltd._v._W....

FeepingCreature3y ago

Wow, that's horrible. I didn't know that sampling and mashups required an explicit license.

Looks like it's in the EU as well.

edit: Hm, Pelham v Hütter C-476/17 might offer some grace for mashups under the quotation exemption at least. Though I wouldn't rely on that.

shanebellone3y ago

"Wow, that's horrible. I didn't know that sampling and mashups required an explicit license."

The conclusion essentially boils down to "remixing is not fair use". Today's hip-hop is a direct result of that decision because sampling became prohibitively expensive.

williamcotton3y ago

It’s not legally fair use but no musicians consider it stealing.

Remember, the session players who wrote those hooks and grooves were not given a copyright. They got paid a flat fee. They could not care less if their drum beat, bass line or horn part was reused in a creative new way. The lawyers of the copyright holders sure do care, though!

2 more replies

lexandstuff3y ago

The verdict was reached in December 1991. A lot of sampling has gone on since then.

1 more reply

cmrdporcupine3y ago

Historically, Microsoft -- itself an entity plagued with anti-trust sentiment in the past -- slagged the GPL in public for years, but was unable to do anything about its ascent and propagation.

Now they may have found a way. And that I think is the potential anti-trust issue here.

What is one of the main obstacles to Microsoft's monopoly dominance in the software sphere? The Linux kernel, it's everywhere. And it's under the GPL, a license explicitly resistant to "Embrace, extend, and extinguish" (old school Gates/Balmer MS). Microsoft right now is not emphasizing an anti-Linux, anti-GPL focus, but it clearly has in the past and it (and others) could definitely do so again in the future.

Systems like CoPilot have the potential to be for the GPL (or other copyleft type licenses) what cryptocurrency 'mixers' or 'tumblers' are to money laundering laws. A potential to be an automated way to pull pieces of IP out of those licenses and into other codebases without respecting the obligations that go with it.

A lot of the dialog on here and other threads on this forum in the past shows me that understanding of copyleft licenses among the open source and developer community is really low right now. This is the license that the Linux kernel is licensed under, it is extremely important. There should be better recognition of the rights and responsibilities afforded by it.

The GPL was explicitly formulated as a way to protect portions of the hobbyist and free software community from potentially predatory commercial interests. Remember it's always possible to attempt to negotiate a commercial non-copyleft license with an entity that has released its source under the GPL. But if you don't, you have to respect its distribution requirements. It's fine to be personally opposed to using the GPL for your own work, but it is important to understand the obligations that come with it. And that includes systems that harvest data from it automatically.

williamcotton3y ago

If you’re curious, the case continued in different directions and reached important decisions regarding books, I won’t get into them because they are not relevant to GitHub. (Google being allowed or not allowed to show a one page preview of a book, to a user who was looking for a quote from a book, is not directly applicable to the concerns surrounding GitHub and GitHub Copilot)

Spoiler alert: Google was copying books in a manner considered fair use, consistent with Sony v Universal. I’m not sure why this author thinks this is irrelevant. The Federal court system surely won’t!

htpltr3y ago

Considered fair use? Google settled with the Authors Guild.

Displaying book excerpts also:

- Leaves the attribution and copyright intact.

- Is not intended to use excerpts verbatim or slightly modified, unless quoting them with attribution.

- May increase the sales of the book.

I agree with the OP of the submission that this case is entirely irrelevant for the CoPilot situation.

williamcotton3y ago

As this very article we are discussing notes, Google was not allowed to settle precisely because of this monopoly position. It went to trial:

In late 2013, after the class action status was challenged, the District Court granted summary judgement in favor of Google, dismissing the lawsuit and affirming the Google Books project met all legal requirements for fair use. The Second Circuit Court of Appeal upheld the District Court's summary judgement in October 2015, ruling Google's "project provides a public service without violating intellectual property law." The U.S. Supreme Court subsequently denied a petition to hear the case.

https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....

mtlynch3y ago

>Open source code on GitHub might be thought of as “open and freely accessible” but it is not. It’s possible for any person to access and download one single repo from GitHub. It’s not possible for a person to download all repos from Github or a percentage of all repos, they will hit limitations and restrictions when trying to download too many repos. (Unless there’s some special archives or mechanisms I am not aware of).

There actually is a convenient archive for accessing GitHub-hosted code in bulk. All GitHub source code is available for bulk analysis in Google BigQuery.

https://cloud.google.com/blog/topics/public-datasets/github-...

I still don't support GitHub training Copilot on other people's code without permission, but this particular part of OP's argument is incorrect.

h2odragon3y ago

When I published stuff to GitHub, it had open licenses: i wanted anyone and everyone to make whatever use of it they could. I didn't foresee this use, and I'm not fond of Microsoft (to say the least); but it certainly falls into the area of things I explicitly allowed when publishing.

I suspect many others who publish there feel the same way.

NoZebra120vClip3y ago

Perhaps your license permits CoPilot reuse, but that is not every F/OSS license. There are some which require attribution of the original authors. There are some which require the distributor to make available any source code, and any modifications made to the software.

Software authors are not upset about the mere reuse of their code, it's the violations of such license terms that are problematic. If attribution is required, but neglected or impossible, that's typically known as "plagiarism", you know.

williamcotton3y ago

First, if the copying is found to be fair use (which is very likely), then attribution or other requirements of a copyright license will not be required.

Second, the only aspects of code that needs to follow the license are the parts of the code that are covered by copyright. That excludes anything that is functional. Since optimizations are functional and not expressive in nature then, for example, an optimized sorting algorithm would not be covered. What would be covered is how that algorithm is organized… the API, file structure, class names, ie, the arbitrary parts of code that everyone argues about.

belorn3y ago

If copying by AI is generally found to be fair use then we will see this in music, porn, advertisement, in political associated situations, and other situations where authors has a history of disagreeing with how their work get used. Unstable diffusion is an ongoing test of how far fair use may be applied.

I find it very likely that copyright law will be changed if training on copyrighted material becomes universally allowed under fair use. The alternative is that training on software code is allowed, but training on images/videos/music is not, which I do not find likely.

> Second, the only aspects of code that needs to follow the license are the parts of the code that are covered by copyright

The legal system don't generally work that ways. The questions judges tend to look at is if the accused party can be reasonable said to have copied someone else work without permission. We can look at either napster or the pirate bay court cases and see how low priority judges tend to view arguments that rely exclusively on a technical detail (A torrent file is not the same as a movie!).

1 more reply

belorn3y ago

Every creator and author has their own idea of what they want and do not want. A license is rarely ever comprehensive enough to cover all of it, and as time goes on those ideas can also change with the author.

toastal3y ago

Best thing you can do now is start porting projects over elsewhere leaving history behind on that platform pointing users to the new home/mirrors. You could also consider a different license. I hope the FSF comes up with an exception like they did in the A of AGPL about how LLMs can use the data (or require the data to be open, etc.).

marginalia_nu3y ago

Problem is when someone uploads something to github they have a license to share (eg via GPL), but are not the copyright owner of.

jackdaniel3y ago

Well, I certainly had some expectations that are covered in the license. I.e that derivative work is a subject of some constraints and that copyrights are not removed from the code.

scarface743y ago

> i wanted anyone and everyone to make whatever use of it they could.

> didn't foresee this use

So you really didn’t want any use. You just wanted the use you found acceptable? So you didn’t really want it to be “open”

h2odragon3y ago

I didn't foresee it, i do not object to it, and probably would not have had i known beforehand.

Code i dont want others to use I dont publish.

scarface743y ago

I only work on open source code that I either am getting paid for or that I have gotten paid for in the past, I genericized and gone through my employer’s very straightforward open source process.

By default the license we use is MIT. If I ever did for some reason choose to open source my own work, it would be a similar license.

I don’t like the idea of claiming something is “open” and then placing restrictions on it.

elzbardico3y ago

Because open source developers, either individuals or companies can't even possibly entertain the idea of the legal expenses involved in fighting a behemoth like Microsoft.

gumballindie3y ago

No but we can take the code offline and call it a day. Once we are “freed up” we wont be able to support the software that feed ai models.

pelasaco3y ago

From the same author, this article looks more interesting https://thehftguy.com/2021/08/30/french-appeal-court-affirms...

dagaci3y ago

Microsoft GitHub Co-pilot could be viewed entirely as just a more sophisticated indexer and search interface, a indexer of freely available source code on the public internet.

The concerns over copyrighted material ingested and exposed through AI system are the same for copyrighted material ingested by and displayed by our web 2.0 search engines.

So, Microsoft GitHub Co-pilot also indexes publicly accessible content but emits that content differently, however it does not exercise exclusive rights over what it indexes or control access to that content.

The Google Books and Author Guilds axis would have given exclusive monopoly access, distribution, and pricing of the largest collection of digital books in the world – so I don’t believe the comparison between the Google Books project and CoPilot is valid, because we have already accepted the concept of indexing and clipping content on the public internet.

supriyo-biswas3y ago

There kinda is, just not an antitrust claim.

[1] https://githubcopilotlitigation.com/case-updates.html

pelasaco3y ago

last update nov. 2022.. is the system too slow or didn't they get the news that they were expecting?

supriyo-biswas3y ago

> MARCH 10, 2023 > Plaintiffs filed oppositions to these motions to dismiss.

Legal processes are generally slow.

dncornholio3y ago

If you don't want your code to be public, licence it... You can put a licence.txt in your code, but people will ignore it. If you really don't want your code to be in public, don't publish it at all.

I personally think Copilot is training on all the code. It's not verifiable so I go with the worst case scenario. But it shouldn't be a problem if you don't publish code that's licensed.

Lines of code shouldn't even be copyrightable. But that's a whole other discussion.

ncphil3y ago

Because they'll (Microsoft and Github) ultimately be crushed by copyright infringement claims (even the most liberal oss licenses require attribution, which no one seems to be accommodating)? Why bother mounting an expensive antitrust campaign when The Mother of All DMCA takedowns is on its way?

ok1234563y ago

I'd rather regulators go after drug companies, medical insurance and hospital systems with antitrust lawsuits for their cartel than an emerging field that provides "nice to have" toys at this point.

lvl1023y ago

The agencies know fighting MSFT is a costly undertaking. They learned that in 90/00s. They will only do it if there’s enough public support. Threshold is incredibly high for MSFT.

jdavis7033y ago

Has anyone just requested access to GitHub’s training data? In other words, maybe GitHub will send a drive with all the data.

bionhoward3y ago

One great solution for fairness and progress would be for GitHub to host bulk downloads or a vector db as a service

kohlerm3y ago

Someone is waiting for them become dominant, then whey will sue and try to "cash in"

prepend3y ago

Obligatory, “Was this article written by chatgpt?”

It’s not antitrust because GitHub isn’t a monopoly. And copilot only scanned public repos, so anyone could train, if they like.

Also this isn’t like the Google Books case because Google made the books available, violating copyright. GitHub has not made the code available. So these cases aren’t similar and aren’t antitrust.

Although comically, by using GitHub I grant them copyright to publish my public repo so I suppose they could republish my repo in other ways without any additional permission. It would be interesting if their license allows them to rebounder and publish my repos in a book or something.

htpltr3y ago

Antitrust is one thing, but by cleanroom implementation standards (one team reads the source and writes a spec, another team writes the code) CoPilot is illegal to begin with.

CoPilot reads and rearranges the IP that was created by millions of people who were working very hard and did not anticipate a code laundering machine when they wrote the code and the licenses.

unreal373y ago

That's quite an extreme set of statements, and I very much doubt what you consider "illegal" is actually illegal.

When you publish something for others to view (text, images, code, whatever), others are allowed to view it. You can't anticipate how others view it, with their eyes or with screenreaders to assist. You can't stop them from reading it, thinking about it, discussing it with their friends, taking notes, summarizing it. You can't stop people from learning from your published content or recognizing patterns between it and other similar things.

Sorry, but you can't create a license that says "I will allow you to view this but you cannot learn from it. If you learn from it, you need to pay me."

belorn3y ago

Learning is very different from copying. I can take a movie and converts it to different formats and resolutions. I can use an AI algorithms to remove rough edges, and even add color to images which was taken in black and white. None of that would be covered by using the word learning, even if the program takes the movie as input and learns from it and outputs a work with is completely different from the original.

The word that seems to fit best is transforming and adapting. In order to adapt something, one has to first learn from the original in order to produce the derivative work. This is however covered by copyright, since the transforming and adapting is still considered a form of copying even if all people did was learning and producing something unique but similar to the original.

The license can say that "I will allow you to view this but you cannot create a derviate work from it".

mrtranscendence3y ago

This isn’t about a person learning, however. This is about developing an algorithm through the inclusion of GPL licensed code, that might — and has — verbatim emitted that code. Those seem materially different to me.

williamcotton3y ago

You can without attribution verbatim copy the parts of GPL code that is not covered by copyright, such as anything purely functional, like an optimized sorting algorithm.

The art in GPL code is in the arbitrary decisions made about how to structure that code… the class structure and not the algorithms.

You cannot copyright an algorithm and for very good reason. Think if Microsoft had the assumed powers granted by the GPL!

1 more reply

kmeisthax3y ago

Clean room is not the actual requirement for avoiding copyright infringement in reverse engineering. There have been several notable cases in which clean room practices were either not followed or outright disregarded, but the resulting product was considered to be non-infringing anyway[0].

Furthermore, while lots of hard work was put into the code that CoPilot used, that hard work was specifically donated with the intent that the code be reused. The only hard requirement being that the code remain free. The thing people are angry about with CoPilot is that it's a hosted OpenAI product with no freely-available model weights, and that generated code might be regurgitated from training data in some cases[1]. If CoPilot was actually open AI, nobody would be suing over it.

[0] In Sony v. Connectix, it was found that Connectix actually tried clean-room, black-box analysis of the PlayStation ROM, but abandoned it in favor of disassembling the whole thing. Connectix was still ruled non-infringing.

[1] Most egregiously, the comment "evil floating point bit level hacking" will make it spit out Quake III source. Microsoft worked around this by explicitly banning that particular phrase, which is just stupid.

williamcotton3y ago

Clean room implementations are there to make sure that none of the arbitrary, artistically expressive parts of the code are inadvertently copied.

Class structure, file structure, APIs…

amoss3y ago

Clean implementation is an approach to guarantee a lack of pollution. It is not the minimum level necessary to avoid it.

19h3y ago

What a ridiculous article. Copilot does not violate antitrust law. GitHub is not a monopoly just because open source devs choose to host there. Devs are free to use GitLab or whatever.

Comparing this to Google Books is silly. Google stole copyrighted books. Copilot uses freely shared open source code. No copyright issue.

The article claims "Open source code on GitHub might be thought of as 'open and freely accessible' but it is not." Lol what? The MIT and Apache licenses explicitly allow reuse. Copilot can absolutely use open source data.

This is typical hype and FUD. No evidence Copilot even used all of GitHub's data or violated any licenses. Baseless speculation.

There's no real antitrust argument here. Nothing to see, move along. yawn

tpxl3y ago

> The MIT and Apache licenses explicitly allow reuse

> No evidence Copilot [...] violated any licenses

Both of these allow redistribution _if you include the license_. Copilot doesn't include any licenses in the code it distributes. You can argue whether that's fair use or not, but you can't argue that it doesn't respect the license.

19h3y ago

You can configure Copilot to not return code that appears verbatim in public repositories. In that case it at least won't produce code you could legitimately argue would be covered by any individuals' specific license.

Liquid_Fire3y ago

But it might well give you the exact same code with a variable name changed (for example), which would be unlikely to hold up in court if a human had done it to bypass the license.

2 more replies

justinclift3y ago

With the AGPL3 license requiring derivatives operating over a network to disclose their source code (my rough summarisation), does that mean GitHub Copilot should be publishing its source code publicly somewhere?

j / k navigate · click thread line to collapse

123 comments

samwillis3y ago

No, it's not in a "unique and dominant position". Open source code is freely available online, it's almost trivial to build a bot to scrape OS code from anywhere on the web (GitHub included).

We need test cases and precedent, but trying to use this as one is not going to work.

Edit:

It took me 15 seconds to find that there is a Google Big Query dataset of open source code for GitHub: https://cloud.google.com/blog/topics/public-datasets/github-...

and thats been further curated on Hugging Face: https://huggingface.co/datasets/codeparrot/github-code

GitHub / Microsoft do not have a monopoly on this data.

rjmunro3y ago

> Google had a dominant position because it had the resources to scan all books.

I thought Google had a dominant position because they signed an exclusive deal with the authors guild that explicitly gave them a dominant position.

mschild3y ago

reissbaker3y ago

Not sure if you posted before the edits, but I'm pretty convinced by them, seeing as how there are multiple alternatives with the same data.

scarface743y ago

it’s really not that hard to

git clone

git set origin…

It’s much harder to copy Google’s index.

ChatGTP3y ago

You think it's practical to do this with almost all the public repos on Github?

2 more replies

jackdaniel3y ago

This is addressed in the same paragraph - you can't scan/download "whole" github because you'll be throttled.

neximo643y ago

Are you actually throttled if you try to git clone or is that what the theory is, or is the assumption that it uses API calls to scrape through github?

Has anyone actually tried, because i've cloned lots of repos and have never been throttled. I'd go so far as to say the author of that post has never even tried it.

jackdaniel3y ago

logifail3y ago

> Has anyone actually tried, because i've cloned lots of repos and have never been throttled

(Full disclosure: I have some pretty serious data hoarding issues)

When someone says "I've cloned lots of repos and have never been throttled" I'm afraid I immediately start wondering whether "lots" means multiple GB or multiple TB ... or more!

quickthrower23y ago

21Tb of data, they might rate limit you! But might be possible via proxies. But only public repos.

1 more reply

williamcotton3y ago

There’s no need to crawl for your own dataset:

https://pile.eleuther.ai/

hanselot3y ago

For anyone else trying to catch up, just rent a datacenter, write a crawler, deal with all the intricacies of keeping it in sync in real-time. This sounds trivial, simple even.

I wonder why nobody is doing it? Perhaps everyone doesn't have access to petabytes of storage space, unlimited bandwidth, unlimited proxy-jumps etc.

So the alternative is to buy github?

1 more reply

goodpoint3y ago

> No, it's not in a "unique and dominant position". Open source code is freely available online, it's almost trivial to build a bot to scrape OS code from anywhere on the web (GitHub included).

Absolutely wrong. GitHub is doing way more than just hosting code. It hosts bugtrackers, CI and much more. For most FOSS project it's the ONLY place where you can go and submit a bug report.

It's not just a repository, it's a communication tool and refuses to interoperate with other platform.

This is monopoly, just like NPM and Linkedin. Microsoft never changes.

bilqis3y ago

Github also has access to private repositories.

samwillis3y ago

They don't use privet repositories to train Copilot.

zelphirkalt3y ago

Maybe not yet. All just a change of their terms away. Oh you don't like it? We will give you 2 weeks to migrate. Perhaps you want this other more expensive subscription?

Just like with other code they should not be using as they do, they would probably run another "ask questions later" approach.

bilqis3y ago

They say they don’t

3 more replies

marginalia_nu3y ago

> It's almost trivial to built a bot to scrape OS code from anywhere on the web.

Seems like a logistical nightmare to me. Git repos interact spectacularly poorly with web scraping in general.

mewpmewp23y ago

marginalia_nu3y ago

There are valid non-evil reasons for git hosts to want to throttle and put up obstacles toward scraping as well, both via crawlers or 'git clone' or whatever. These are very expensive operations.

flockonus3y ago

It appears to be the exact opposite to me, `git clone --depth 1 ...` will give you a code that you can know exactly how to parse, vs. webpages that have all sorts of semantical issues.

marginalia_nu3y ago

Git clone is a very expensive operation. Git hosts generally will try to prohibit mass git clone:ing for this reason.

1 more reply

moneywoes3y ago

How so? Can’t someone just download the zip file and make a queue of downloads or does GitHub rate limit?

toastal3y ago

cassianoleal3y ago

> We need test cases and president

I imagine you meant "precedent".

eternalban3y ago

alkonaut3y ago

That someone reads my code I expect. That someone reads my code and uses it to train a machine they money off I didn't expect, but I also can't say I object.

EamonnMR3y ago

And if indeed it's treated as clean room, does open source need to just pack it in? Are all of our licenses rendered unenforceable?

alkonaut3y ago

vinaypai3y ago

Contrary to what the author of this article seems to believe, antitrust law isn't a general purpose law for going after companies you don't like.

The author seems very confused and is mostly talking about copyright claim and then bizarrely starts talking about antitrust litigation.

shireboy3y ago

shanebellone3y ago

I believe the closest precedent is:

Grand Upright Music, Ltd. v. Warner Bros. Records, Inc.

https://en.wikipedia.org/wiki/Grand_Upright_Music,_Ltd._v._W....

FeepingCreature3y ago

Wow, that's horrible. I didn't know that sampling and mashups required an explicit license.

Looks like it's in the EU as well.

edit: Hm, Pelham v Hütter C-476/17 might offer some grace for mashups under the quotation exemption at least. Though I wouldn't rely on that.

shanebellone3y ago

"Wow, that's horrible. I didn't know that sampling and mashups required an explicit license."

The conclusion essentially boils down to "remixing is not fair use". Today's hip-hop is a direct result of that decision because sampling became prohibitively expensive.

williamcotton3y ago

It’s not legally fair use but no musicians consider it stealing.

2 more replies

lexandstuff3y ago

The verdict was reached in December 1991. A lot of sampling has gone on since then.

1 more reply

cmrdporcupine3y ago

Historically, Microsoft -- itself an entity plagued with anti-trust sentiment in the past -- slagged the GPL in public for years, but was unable to do anything about its ascent and propagation.

Now they may have found a way. And that I think is the potential anti-trust issue here.

williamcotton3y ago

htpltr3y ago

Considered fair use? Google settled with the Authors Guild.

Displaying book excerpts also:

- Leaves the attribution and copyright intact.

- Is not intended to use excerpts verbatim or slightly modified, unless quoting them with attribution.

- May increase the sales of the book.

I agree with the OP of the submission that this case is entirely irrelevant for the CoPilot situation.

williamcotton3y ago

As this very article we are discussing notes, Google was not allowed to settle precisely because of this monopoly position. It went to trial:

https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....

mtlynch3y ago

There actually is a convenient archive for accessing GitHub-hosted code in bulk. All GitHub source code is available for bulk analysis in Google BigQuery.

https://cloud.google.com/blog/topics/public-datasets/github-...

I still don't support GitHub training Copilot on other people's code without permission, but this particular part of OP's argument is incorrect.

h2odragon3y ago

I suspect many others who publish there feel the same way.

NoZebra120vClip3y ago

williamcotton3y ago

First, if the copying is found to be fair use (which is very likely), then attribution or other requirements of a copyright license will not be required.

belorn3y ago

> Second, the only aspects of code that needs to follow the license are the parts of the code that are covered by copyright

1 more reply

belorn3y ago

toastal3y ago

marginalia_nu3y ago

Problem is when someone uploads something to github they have a license to share (eg via GPL), but are not the copyright owner of.

jackdaniel3y ago

Well, I certainly had some expectations that are covered in the license. I.e that derivative work is a subject of some constraints and that copyrights are not removed from the code.

scarface743y ago

> i wanted anyone and everyone to make whatever use of it they could.

> didn't foresee this use

So you really didn’t want any use. You just wanted the use you found acceptable? So you didn’t really want it to be “open”

h2odragon3y ago

I didn't foresee it, i do not object to it, and probably would not have had i known beforehand.

Code i dont want others to use I dont publish.

scarface743y ago

I only work on open source code that I either am getting paid for or that I have gotten paid for in the past, I genericized and gone through my employer’s very straightforward open source process.

By default the license we use is MIT. If I ever did for some reason choose to open source my own work, it would be a similar license.

I don’t like the idea of claiming something is “open” and then placing restrictions on it.

elzbardico3y ago

Because open source developers, either individuals or companies can't even possibly entertain the idea of the legal expenses involved in fighting a behemoth like Microsoft.

gumballindie3y ago

No but we can take the code offline and call it a day. Once we are “freed up” we wont be able to support the software that feed ai models.

pelasaco3y ago

From the same author, this article looks more interesting https://thehftguy.com/2021/08/30/french-appeal-court-affirms...

dagaci3y ago

Microsoft GitHub Co-pilot could be viewed entirely as just a more sophisticated indexer and search interface, a indexer of freely available source code on the public internet.

The concerns over copyrighted material ingested and exposed through AI system are the same for copyrighted material ingested by and displayed by our web 2.0 search engines.

supriyo-biswas3y ago

There kinda is, just not an antitrust claim.

[1] https://githubcopilotlitigation.com/case-updates.html

pelasaco3y ago

last update nov. 2022.. is the system too slow or didn't they get the news that they were expecting?

supriyo-biswas3y ago

> MARCH 10, 2023 > Plaintiffs filed oppositions to these motions to dismiss.

Legal processes are generally slow.

dncornholio3y ago

If you don't want your code to be public, licence it... You can put a licence.txt in your code, but people will ignore it. If you really don't want your code to be in public, don't publish it at all.

I personally think Copilot is training on all the code. It's not verifiable so I go with the worst case scenario. But it shouldn't be a problem if you don't publish code that's licensed.

Lines of code shouldn't even be copyrightable. But that's a whole other discussion.

ncphil3y ago

ok1234563y ago

I'd rather regulators go after drug companies, medical insurance and hospital systems with antitrust lawsuits for their cartel than an emerging field that provides "nice to have" toys at this point.

lvl1023y ago

The agencies know fighting MSFT is a costly undertaking. They learned that in 90/00s. They will only do it if there’s enough public support. Threshold is incredibly high for MSFT.

jdavis7033y ago

Has anyone just requested access to GitHub’s training data? In other words, maybe GitHub will send a drive with all the data.

bionhoward3y ago

One great solution for fairness and progress would be for GitHub to host bulk downloads or a vector db as a service

kohlerm3y ago

Someone is waiting for them become dominant, then whey will sue and try to "cash in"

prepend3y ago

Obligatory, “Was this article written by chatgpt?”

It’s not antitrust because GitHub isn’t a monopoly. And copilot only scanned public repos, so anyone could train, if they like.

htpltr3y ago

Antitrust is one thing, but by cleanroom implementation standards (one team reads the source and writes a spec, another team writes the code) CoPilot is illegal to begin with.

CoPilot reads and rearranges the IP that was created by millions of people who were working very hard and did not anticipate a code laundering machine when they wrote the code and the licenses.

unreal373y ago

That's quite an extreme set of statements, and I very much doubt what you consider "illegal" is actually illegal.

Sorry, but you can't create a license that says "I will allow you to view this but you cannot learn from it. If you learn from it, you need to pay me."

belorn3y ago

The license can say that "I will allow you to view this but you cannot create a derviate work from it".

mrtranscendence3y ago

williamcotton3y ago

You can without attribution verbatim copy the parts of GPL code that is not covered by copyright, such as anything purely functional, like an optimized sorting algorithm.

The art in GPL code is in the arbitrary decisions made about how to structure that code… the class structure and not the algorithms.

You cannot copyright an algorithm and for very good reason. Think if Microsoft had the assumed powers granted by the GPL!

1 more reply

kmeisthax3y ago

williamcotton3y ago

Clean room implementations are there to make sure that none of the arbitrary, artistically expressive parts of the code are inadvertently copied.

Class structure, file structure, APIs…

amoss3y ago

Clean implementation is an approach to guarantee a lack of pollution. It is not the minimum level necessary to avoid it.

19h3y ago

What a ridiculous article. Copilot does not violate antitrust law. GitHub is not a monopoly just because open source devs choose to host there. Devs are free to use GitLab or whatever.

Comparing this to Google Books is silly. Google stole copyrighted books. Copilot uses freely shared open source code. No copyright issue.

This is typical hype and FUD. No evidence Copilot even used all of GitHub's data or violated any licenses. Baseless speculation.

There's no real antitrust argument here. Nothing to see, move along. yawn

tpxl3y ago

> The MIT and Apache licenses explicitly allow reuse

> No evidence Copilot [...] violated any licenses

19h3y ago

Liquid_Fire3y ago

But it might well give you the exact same code with a variable name changed (for example), which would be unlikely to hold up in court if a human had done it to bypass the license.

2 more replies

justinclift3y ago

j / k navigate · click thread line to collapse