GitHub Copilot emits GPL code (opens in new tab)

(codeium.com)

586 pointsfortenforge3y ago370 comments

370 comments

As long as the AI doesn't produce this function, you're fine:

     private static void rangeCheck(int arrayLen, int fromIndex, int toIndex {
       if (fromIndex > toIndex)
          throw new IllegalArgumentException("fromIndex(" + fromIndex +
               ") > toIndex(" + toIndex+")");
       if (fromIndex < 0) 
          throw new ArrayIndexOutOfBoundsException(fromIndex);
       if (toIndex > arrayLen) 
          throw new ArrayIndexOutOfBoundsException(toIndex);
    }

On a more serious note, I really wonder where the line is drawn for copyright. I see a lot of people claiming that AI is producing code they've written verbatim, but sometimes I wonder if everyone just writes certain things the same way. For the above rangeCheck function, there isn't much opportunity for the individual programmer to be creative. Perhaps there is a matter of taste on what exceptions you throw, or in what order. But the chosen ones are certainly what most people think of first, and the order to validate arguments, then check the low value, then check the high value, is pretty much what anyone would do. Perhaps you could format the error message differently. That's about it. So when someone "rips off" your code wholesale, it's could just be that everyone writing that function would have typed in the exact same bytes as you. You know your style guide is working when you look at code, think you wrote it, but actually you didn't!

humanistbot3y ago

That's why copyright holders for reference works have been using copyright traps for ages. That's where you include a fictional town in a map, a nonsense word in a dictionary, or a fake person in your phone book. If your competitors reproduce the trap, then that's clear evidence you can use in court.

https://en.wikipedia.org/wiki/Copyright_trap

tedivm3y ago

We don't need the copyright traps here though as Github openly admits to using the public code for training. They just don't care that they're essentially license laundering code since they can make money doing it.

That said we used copyright traps at Malwarebytes, which is how we found out that iobit was stealing our database.

peytoncasper3y ago

What happens if GitHub didn't use GPL licensed code, but still generated code that was identical to GPL licensed code?

2 more replies

concordDance3y ago

Using for training doesn't mean its reproduced.

Consider a junior dev who writes a range check function while working for a company (so they own the copyright) then goes to a different company and writes the same range function because that's just how he writes code.

Has copyright been infringed?

5 more replies

jschrf3y ago

Also, re: maps, fake streets and cul-de-sacs that don't exist.

I've set a "trap" myself years ago in code in a novel solution at the time for uploading photos from iOS non-interactively after the fact. It was to support disconnected field workers taking photos from iPhones/iPads, with the payloads uploaded at a later date.

Chunked form data constructed in userland JS was the solution. Chunk separator was 17 dashes in a row (completely arbitrary), company name in 1337 speak, plus 17 more dashes.

Found a competitor that had copied the code, changing only the 1337 speak part. 17 dashes remained on each side. Helped me realize that they had unminified and indeed ripped off our R&D work.

Wonder if Copilot could be gamed the same way.

peteradio3y ago

How did you manage to find that your competitor copied your code?

1 more reply

iudqnolq3y ago

If you look at the Legal Action section of your link you'll see the line "However, the case was dismissed" quite a few times. That's because data isn't copyrightable.

Edit: As sroussey points out s/isn't copyrightable/isn't copyrightable in the USA

AnthonyMouse3y ago

The other problem with these "copyright traps" is that they do nothing to prove someone copied the legitimate parts of the data.

Suppose you recreate the entire dataset from scratch. Then someone notices (e.g. using an automated comparison) that the "trap" is in the other dataset but missing from yours, and submits it to you to add.

This is arguably too small an addition to be copyrighted on its own, but regardless of that, it would then be all you have to remove to get back to a clean version. And since it's erroneous data, you would want to remove it anyway.

1 more reply

sroussey3y ago

Not in the USA, but it is in the EU and elsewhere.

1 more reply

jprete3y ago

The relevant line is “information alone without a minimum of original creativity cannot be protected by copyright”.

There is definitely creativity in writing code; it’s not a completely deterministic translation of even a complete specification.

1 more reply

cxr3y ago

It's occasionally explained—but still not widely understood, I'd wager—that this is the reason why so much GNU code is hard to follow.

In the US legal system the merger doctrine is a concept whereby a given expression cannot be granted protection if it's not sufficiently creative—and there only so many ways to express something when stripped down to its fundamentals. In response to this, RMS and Moglen encouraged contributors from very early on to try to express the inner workings of GNU utilities in creative and non-obvious ways out of caution against the possibility that the copyleft obligations of the GPL wrt a given package could be nullified by a finding in court that it did not pass the threshold for creativity.

noirscape3y ago

GNU code is partially hard to follow because of RMS paranoia, but that mostly manifests itself in the code being weirdly structured. The far bigger reason is that GNU code tends to run with really strange optimizations and project decisions since they want their tools to be able to run on ancient mainframes that practically nobody uses anymore, so everything is overoptimized for that.

ljm3y ago

I first saw this in action on StackOverflow when, during an interview, a candidate copy-pasted a solution verbatim including the attribution. Didn't even give it a second thought, like they didn't even read the code or what it was doing.

It wasn't the right solution to the problem in question, for what it's worth.

Just manually did what GPT does now.

netfortius3y ago

I think I mentioned this before, in another context: the solution is known as "honeytoken", and it is equally applicable in computer security.

layer83y ago

Copyright is limited to works that meet a certain threshold of originality [0]. It is assumed that works meeting such a threshold won’t be replicated by mere coincidence.

[0] https://en.wikipedia.org/wiki/Threshold_of_originality

rvnx3y ago

In the big picture, if we enter a world where an AI is instantly capable of doing code better than you do and without efforts, then I'm not sure why code should be copyrightable at all.

Copyright protects original works of authorship including literary, dramatic, musical, and artistic works, such as poetry, novels, movies, songs, computer software, and architecture.

Copyright does not protect facts, ideas, systems, or methods of operation, although it may protect the way these things are expressed.

Here (and in the future even more), AI is totally capable of expressing one idea in any programming language if you ask for it (even if conceptually inspired by copyrighted code).

Which means that a particular expression (a specific implementation) is practically of no value or particular interest at this stage.

You could ask the AI to do a slightly different implementation, it would not be a problem for it and would require no efforts.

There is no point to protect something that can be generated using no effort and has no particular genius in it.

DanHulton3y ago

We don't need to enter a world where AI gets any better at all to be able to argue that software shouldn't be copyrightable, smart people have been doing that for ages.

The problem, however, is that we live in this world, where it is copyrightable, and componies relying on Copilot to do large swathes of code generation do potentially have to worry about including copyrighted code in their codebase, and what the legal fallout from that might be.

Longlius3y ago

It also creates a world where developers who have created code and specifically protected it with copyleft licenses to ensure that derivative works always remain a public good are having their rights laundered away via LLMs. I fully expect the FOSS community to fold if their rights are not respected and it could lead to a software dark age.

2 more replies

zvolsky3y ago

This "think you wrote, but actually you didn't!", sometimes with another "actually you did, but you are looking at the code of someone who wrote it the same" happens often with people who have similar taste for solving problems. Or whose taste is influenced by the same teachers, such as you, jrockway! I've been using your open source as as one of my references for Go style. Thank you for sharing your opinionated-server, jsso2, and other projects, under the Apache 2.0 license!

tehsauce3y ago

In the example from the article, copilot produces identical comments, not just a functionally identical implementation. So in this case your hypothesis is false. But thanks for trying to stand up against the open source community for microsoft. /s

ChatGTP3y ago

I don’t understand why people have become so accepting. “Oh they’ve stolen all the public code and not provided attribution then sold it for a profit, can we just give these poor evil companies a break? It’s just progress…”.

This is completely unacceptable and another example that Microsoft is an evil and amoral company who only cares about open source for financial gain.

jrm43y ago

They do not. This is pretty easily provable. I was in a CS class that had an automated plagiarism checker over 20 years ago.

(And since Brian Kernighan was teaching it, I'm inclined to believe in it.)

151553y ago

The trick with these:

1) They are using your IP with coerced consent in the future to check other people's work as well as your own in the future. (Let's have a fun discussion about "self-plagiarism."

2) ChatGPT and the like are going to so massively increase the noise floor on this problem space that these counterfeit detection companies should all but disappear in a number of years.

rcme3y ago

Did it notify you automatically if you had plagiarized something, or did it flag you internally for manual review?

sp3323y ago

The way copyright works, it's a violation if it was copied, but it's fine if it was generated independently. In this case I would say it's a copy, but I'm sure someone else would argue differently.

makk3y ago

IANAL but I work alongside them. Here's an argument I've heard.

You can read the data to train a thing. So long as that thing doesn't literally copy the data into itself then the training hasn't violated copyright.

When that thing later generates an output, the output isn't copyrightable because it's machine generated (this is the current US position) and it isn't a copyright violation because it was generated, not copied.

You can launder copyrighted material through an LLM, basically.

madeofpalk3y ago

Which, by the way, this is completely untested in courts.

Courts have decided, after a bunch of case-by-case decisions, that sampling a song consitutes creating a derivitive work, and you must obtain a license from a copyright holder to do so.

It is my opinion that training a model copies and creates derivitive work on what you used to train it, so you must have a license to train LLMs on content. I am not a lawyer, I am no one, my opinion here is worthless.

We already know that you can create a copy of something without doing a bit-for-bit duplicate because a) copyright law existed before we had bits, and b) transcoding a movie still counts as creating a copy. Recording my own VHS of HBO and selling it is still illegal.

kmeisthax3y ago

Most generative AI actually does have significant problems with the model copying the data into itself. Not literally - there isn't a bunch of model parameters that line up to the exact PNG bitstream of particular images. But courts wouldn't care as long as the model outputs something that looks "close enough", because the chain of provenance is clearly established from the training set, through gradient descent and the model weights, into the final output.

There's a paper from Google and Princeton about regurgitation happening in Stable Diffusion and Imagen: https://arxiv.org/pdf/2301.13188.pdf

OpenAI also had to spend a bunch of time on deduplicating an insanely large dataset to prevent this from happening in DALL-E: https://openai.com/research/dall-e-2-pre-training-mitigation...

I have no clue how they handled this in GPT-3 or -4. Given the amount of regurgitation found in Copilot I imagine there's lots of significant code fragments floating about nominally different projects that a deduplicator wouldn't match as identical.

shagie3y ago

We've been laundering licenses on code on Stack Overflow and the rest of the SE network (whatever to CC (which can then be GPL'ed)) for a decade now.

Consider https://softwareengineering.stackexchange.com/questions/2695...

The source code is GPL'ed, but that page is CC BY-SA 3.0.

It's also fairly easy to assume that a fair bit of material on SO that was copied from employer's codebases into SO (and thus now CC) can be included in GPL code now too.

contravariant3y ago

> So long as that thing doesn't literally copy the data into itself then the training hasn't violated copyright.

Could luck proving that hasn't happened. If a language model that can reproduce the code verbatim doesn't count then a movie re-encoded into a different format shouldn't count either.

2 more replies

tick_tock_tick3y ago

> You can launder copyrighted material through an LLM, basically.

Good!

jacquesm3y ago

As a datapoint: I once successfully fielded a copyright case on about 15 lines of code.

avbanks3y ago

You could probably do that w/ 1 line of code depending on the variable name :)

hgsgm3y ago

What does "fielded" mean?

neogodless3y ago

I've only heard it used in reference to questions but I think it's being stretched slightly here to just mean "dealt with" or "handled".

pyth03y ago

Likely a typo. "fielded" -> "filed"

1 more reply

numpad03y ago

I believe there are couple different aspects to “it’s AI training legal same as human” argument:

1. Copyright is only granted to creative elements; lots of program codes are supposedly un-copyrightable, though no one wants to fight on that ground.

2. It is lawful in many jurisdictions to effectively steal and train AI with even copyrighted materials, for the sake of humanity at large; same supposedly not apply to the output. But AI-supportive clusters tends to conflate between the two.

3. AI training processes, stochastic gradient descent and all, are only called “learning” and/or “training” by convention; there is no public consent that it is same as how the word is supposedly defined, though we generally don’t scare quote airplanes flying.

ml-anon3y ago

on 3, the convention could have just as easily gone a different way i.e. it could have converged to model "fitting" using the statistical parlance or the sklearn convention. Further if you take the math seriously most of these models are "just" fitting probability distributions to data.
Also, in part it depends greatly on the objective function used. In GPT style models the objective is to precisely copy from input to output, token by token. I think its extremely bad-faith to argue that this has any relationship to human learning or learning objectives.
you shouldn't take the math seriously and I'm not being dismissive with the word "just" in scare quotes. However the community somehow wants to have its cake an eat it too.

z3t43y ago

In most countries a copyright work need to be something substantial. You can not copyright single machine instructions. It needs to be a combination that is unique. And just the instructions are not copyrightable, you cant for example copyright a recepy. But you can copyright a book of recepies. So if you make a program with many instructions put togheter you automatically get copyright. And if someone steals parts of your code it will be difficult to claim the copyright if those parts are used to create a new program. But if the new program is based on your program, for example a fork, or most of the code comes from your program its derative work.

numlock863y ago

> but sometimes I wonder if everyone just writes certain things the same way

> For the above rangeCheck function, there isn't much opportunity for the individual programmer to be creative.

We are at a point at which compilers detect such functions and replace them with highly optimized ones. If you have to artificially change just for the sake of patent or license trolls you don't just get more work but also worse performance/optimizations in most cases.

itslennysfault3y ago

I wouldn't worry about this code. It wouldn't compile anyways. lol

Syntax Error on line 1. Missing closing ) in the method definition.

breck3y ago

> On a more serious note, I really wonder where the line is drawn for copyright.

As soon as you start thinking about copyright, you end up realizing it's all non-sense. Stephan Kinsella (a patent lawyer!) is the leading thinker on this, and his videos, essays, and podcasts are worth listening to: https://www.youtube.com/watch?v=e0RXfGGMGPE

amelius3y ago

Or anything from Numerical Recipes in C.

m_0x3y ago

Why? Why is that function special?

csmattryder3y ago

Oracle's lawyers said they owned the rights to it, Google disagreed. Google was right, legally.

https://www.supremecourt.gov/opinions/20pdf/18-956_d18f.pdf

gjsman-10003y ago

> I see a lot of people claiming that AI is producing code they've written verbatim, but sometimes I wonder if everyone just writes certain things the same way. For the above rangeCheck function, there isn't much opportunity for the individual programmer to be creative.

This point is absolutely going to come up in any lawsuits; because the law does sometimes examine how much creativity there is available in a field before making a determination (Oracle v Google comes to mind). If you can show that there are very, very few reasonable ways to accomplish a goal, and said goal is otherwise not patented or prohibited, it's either not copyrightable or Fair Use, take your pick.

This even applies under the interoperability section of the DMCA and similar laws for huge projects. Assuming that ReactOS, for example, is actually completely clean-room; that would be protected despite having the same API names and, likely, a lot of similar code implementing most of the most basic APIs.

asddubs3y ago

the code is a reference to oracle v google

shagie3y ago

Incidentally this code doesn't have any license attached to it. So if an LLM happened to produce this code and Oracle said "where did you get that?! did you illicitly include the Java code base as part of your training data?" the organization that got the data for the LLM training can say "no, we used Hacker News comments and this code happened to have been in there verbatim... sorry."

License successfully laundered!

1 more reply

jamesmunns3y ago

This is interesting, but many permissive licenses still require attribution at the project or file level.

If Codeium doesn't produce these when producing "verbatim enough" snippets, how is this actually better, besides avoiding a GPL boogeyman?

I get that there have been fewer (if any? I'm not aware of any) MIT/Apache2.0/MPL2.0 license violations that have gone to court than GPL violations, but this still feels like an "address the symptoms" and not "address the cause" difference.

blibble3y ago

> If Codeium doesn't produce these when producing "verbatim enough" snippets, how is this actually better, besides avoiding a GPL boogeyman?

it's not

if they've trained on MIT/Apache 2.0/... then they're just as liable as people that have trained on GPL

they would be limited to training on licenses that don't require attribution (BSD2, public domain, etc)

which I suspect limits the size of the training set so much that the output would be useless

Codium here is unintentionally making an argument that undermines legal confidence in their own product

interesting choice!

GuB-423y ago

I wonder if, in order to deal with attribution, the system could simply build a multi-megabyte file with "this code is derived from:" followed by all the authors the system could gather from the training data set.

dalmo33y ago

Hey, I want my name on that list. Here's my contribution:

gus_massa3y ago

IANAL, but I expect people to use MIT/BSD to be less angry about partial reuse of the code than people that uses GPL.

pornel3y ago

I'm not a lawyer either, but I don't think "less angry" is a legal term.

1 more reply

hathawsh3y ago

As an experiment, I just asked ChatGPT to "Please identify the open source projects that contain the following code" and pasted the sample from the article. Sure enough, it pointed me at the SuiteSparse library, which is correct (but not exactly where ChatGPT thinks it is). This means Codeium (and others) could theoretically use AI to identify possible attributions that should be included in a project.

Of course, if someone figures out an algorithm that does that, people could use the same algorithm to identify missing attributions and plagiarism in other projects and throw lawsuits around. (Sigh)

mumblemumble3y ago

You don't need AI for that, just fuzzy search.

1 more reply

throwaway2903y ago

Perhaps copyright is what being circumvented, not just GPL. What Microsoft does is take your original work, create derivative works and sell them for profit. Unless it's under creative commons zero or public domain it shouldn't be legal...

judge20203y ago

Not exactly a huge distinction there because the licenses themselves provide exemptions to copyright, so by definition you're both "circumventing" GPL and committing legal copyright infringement if you copy it and don't attribute it under the terms the code's license requires.

Of course, the entire basis for LLMs being legal is that they use work collectively to know how code/language works and how to write it in relation to the given context. In this case, the legal defense is that the tool is like a human that learned how to code by looking at CC-BY-SA and other licensed publicly-available code and assimilating it into their own fleshy human neural network.

This only becomes shaky once you add in regurgitating code verbatim, but humans do this too, so the solution there is the copilot setting that tries to detect and revert any verbatim generated code snippets.

anileated3y ago

You can’t claim to have an entity with human-like understanding doing the coding if you don’t grant it basic human rights.

They want to have it both ways: they want you to think the LLM is like a human because it’s “learning” (which in ML is the same word but completely different idea) so that you let them ignore copyright, but it’s not like a human of course because it can’t think, no sir (so you do not make them grant it human rights, because then they can’t exploit it like they do anymore).

visarga3y ago

> What Microsoft does is take your original work, create derivative works and sell them for profit. Unless it's under creative commons zero or public domain it shouldn't be legal...

Why should it not be legal? Doesn't that make copyright equally powerful with patents? Copyright should restrict only replication of expression not replication of ideas.

YoshiRulz3y ago

"Microsoft is circumventing the intent of copyright law" and "Copyright law should be changed or repealed" can both be true.

throwaway2903y ago

In this case we are talking about replication and manipulation of expression (for profit). Software doesn't understand ideas.

elproxy3y ago

And just the same, copyleft licenses (which are not "non-permissive" in my book) such as GPL don't "mean that you cannot without consent", they just want you to share the result back under the same license (which is often an issue for some corporate projects).

jacooper3y ago

They should just attribute code, like how amazon is already doing.

tlavoie3y ago

If the code is being provided as output from an LLM, I don't know that they themselves could even say where the code came from. Attribution might not be possible in that case. Similarly, how might one remove say, GPL code from the model, without regenerating from all the rest of the inputs alone?

cattown3y ago

I believe that laundering licensed or copyrighted content for reuse that fails to recognize the original authors or usage restrictions is likely to be one of the biggest commercial applications of generative machine learning algorithms.

I also believe this is where a lot of the hype about "rogue AIs" and singularity type bullshit comes from. The makers of these models and products will talk about those non-problems to cover for the fact that they're vacuuming up the work of individuals then monetizing it for the profit of big industry players.

ToValueFunfetti3y ago

I don't think this theory holds up. Singularity concerns long predate LLMs and are mostly expressed by people who want OpenAI to stop what they're doing right now. Sam Altman has publically disagreed with AI doomers. If you're willing to believe that OpenAI is pretending not to be concerned but is quietly hyping the concerns up, I have to wonder what standard of evidence is letting you simultaneously write off the concerns as bullshit.

krainboltgreene3y ago

For me personally it's that everyone who is expressing these concerns has clearly done less critical thinking about the subject than your average extremely high teenager. When you ask them about details they get defensive, resort to even stranger ground like "Well a human is nothing more than an autocomplete" (clearly not true).

sebzim45003y ago

I don't believe that rogue AIs are a threat for the next few years, but the claim that the likes of Geoffrey Hinton have done less thinking about the subject "than your average extremely high teenager" is absurd.

2 more replies

quasarsunnix3y ago

I think cattown might be referring to statements such as this: https://www.theguardian.com/technology/2023/mar/17/openai-sa...

Not sure if I'd say there's a conspiracy per se, but I do think generative AI players are going to be careful about the optics of the technology and how it works. Anecdotally from speaking to non-technical family members there's very little understanding for how the technology actually works, and it seems there's not a great deal of effort to emphasize the importance of training data, or the intellectual property considerations in these companies marketing materials.

gumballindie3y ago

> what standard of evidence is letting you simultaneously write off the concerns as bullshit.

Negative marketing is good marketing. Look at all of us debating this scale theft promoting the brand of this non product.

pms3y ago

Ok, so Sam Altman disagreed with AI doomers, great, but the point is still generally valid, for a couple of reasons:

1. What about Elon Musk and hundreds of other AI investors? It's in their interest to overhype AI, while temporarily slowing down competition by spreading singularity fears.

2. OpenAI released the GPT4 report where they claim better performance of their model than it's in reality [1].

[1] https://twitter.com/cHHillee/status/1635790330854526981

gumballindie3y ago

> The makers of these models and products will talk about those non-problems to cover for the fact that they're vacuuming up the work of individuals then monetizing it for the profit of big industry players.

Also why they claim these are "black boxes" and that they "don't understand how they work". They are prepping the markets for the grand theft that's unfolding.

stingraycharles3y ago

I think you underestimate just how careful “real” businesses are when it comes to violating the (copyright) law. Any legal advisor at any corp will strongly advice against using code that’s generated like this, until there is clear legal precedent that it’s OK to do this.

hnfong3y ago

Does that involve a ban of stackoverflow use as well?

https://stackoverflow.com/help/licensing

I don't think I've heard anyone warn people not to copy code snippets from stackoverflow due to licensing issues, although "real" businesses should be rightfully concerned.

gkbrk3y ago

It's already a common practice to put a StackOverflow link as a comment when you copy code from them. It provides valuable context to future readers.

That's probably enough for attribution, but I suppose one could copy the author name as well.

serial_dev3y ago

I think you underestimate how easy it is for developers to disregard what the Corp lawyer said about AI code gen tools.

Manager: "we asked, legal says you can't use copilot", dev: "okay, so from now on, I'll not discuss how I use copilot and will remember to disable it when someone sees me working, gotcha".

I'm not saying everyone will do this, I'm saying some people will know that the corp doesn't always have a way to verify how the code was written, and they will think that a lawsuit cannot really happen to them.

Filligree3y ago

> Manager: "we asked, legal says you can't use copilot", dev: "okay, so from now on, I'll not discuss how I use copilot and will remember to disable it when someone sees me working, gotcha".

Manager: "Everyone else is running through their feature list faster than you. What gives? Remember, you're not allowed to use Copilot."

IC: "I'm not using Copilot."

Manager: "Remember, you're not allowed to use Copilot."

formerly_proven3y ago

Doesn't Microsoft already use Copilot internally?

robocat3y ago

Of course if only used on internal software that isn’t distributed, then copying GPL code is fine. Until a developer inadvertently distributes it or copies code from one place to another…

sublimefire3y ago

Yep they do, but I did not see anyone generating chunks of gpl'd .NET code yet.

1 more reply

sroussey3y ago

True, and that will cause a departure between companies large enough to worry, and all the startups that don’t.

codexb3y ago

AI will just make non-permissive open source licenses more pointless than they already are. The GPL and similar licenses have been on a slow death march for over a decade. AI isn't doing anything that Human Intelligence isn't already doing. Every single developer has looked at non-permissive open source code for inspiration.

ChatGTP3y ago

The reason people can use code for inspiration is because of GPL and similar, do you see the problem with the logic you provided?

If all software starting being non-permissive and closed source, there would be no training data and no new innovation and even if there was, it would probably suck like it did before GPL and similar licensing was mainstream.

teaearlgraycold3y ago

Yup. gg, gpl

circuit103y ago

"those non-problems"

Why is that a non-problem? It's a really important concern that we need to take more seriously

I pasted this from another comment I wrote but:

The concerns about AI taking over the world are valid and important; even if they sound silly at first, there is some very solid reasoning behind it.

See https://youtu.be/tcdVC4e6EV4 for a really interesting video on why a theoretical superintelligent AI would be dangerous, and when you factor in that these models could self-improve and approach that level of intelligence it gets worrying…

JohnFen3y ago

I don't think the reasoning is solid at all. I mean yes, a theoretical superintelligent AI would be very dangerous, but I see exactly no reason to think that current models could get there.

manojlds3y ago

Yeah feels a bit like we invent planes and worry about wormholes and time travel.

1 more reply

tester4573y ago

People had no reason to believe that today's models would exist.

We are on this part of the ai takeoff graph. https://waitbutwhy.com/2015/01/artificial-intelligence-revol...

2 more replies

sebzim45003y ago

Personally, I wasn't expecting anything as good as GPT-4 so soon. So I no longer have any real confidence in how far away 'real AI' is, whatever that means.

I would not be shocked to find out that AGI (using Altman's definition) is more than 50 years away, but I also would not be shocked if it came in 5.

It's really hard to know how scared to be, I think that rationally I should be pretty terrified but I'm not.

circuit103y ago

Well hardware and parameter count are scaling exponentially, so it seems very feasible that it could happen very soon. Of course it's possible that we'll hit a wall somewhere but it seems that just scaling current models up could be enough to get to the point where they can self-improve or gain more compute for themselves

4 more replies

patch_cable3y ago

I watched the video.

> has preferences over world states

I think that part is a leap. I don't think is given that a super intelligent AI will "want" things.

> presumably a machine could be much more selfish

This feels like we're projecting aspects of humanity that evolution specifically selected for in our species with something that is coming about though a completely different process.

> It's a mistake to think about it as a person.

I agree, but I feel like that's what these concerns about AI are doing, because that's what people do.

> (The whole stamp collector thing)

It also seems to me there is a huge gap between a super intelligent AI and the ability to have a perfect model of reality along with the ability to evaluate within that model the effect of every possible sequence of packets sent out to the internet.

circuit103y ago

> I think that part is a leap. I don't think is given that a super intelligent AI will "want" things.

But if it has no goal then it can’t act rationally or intelligently. Something like an LLM might not appear to “want” anything, but it “wants” to predict the next token correctly which is still a goal (though since it’s only related to its internal state it might be a little safer)

There’s another good video about why this would be the case here if you’re interested: https://youtu.be/8AvIErXFoH8

> This feels like we're projecting aspects of humanity that evolution specifically selected for in our species with something that is coming about though a completely different process.

That’s because evolution is a process that optimises for a goal. The only reason altruism is a thing is because it actually indirectly benefits the goal, which is for our genes to survive and be passed on, and fellow humans tend to share our genes, especially relatives (who we tend to be kinder to). AI training is also a process that optimises for a goal, but unless having humans around helps that goal it wouldn’t display any human empathy. In this case “selfishness” is just efficiency which a training process definitely selects for

> I agree, but I feel like that's what these concerns about AI are doing, because that's what people do.

I feel like they’re doing a pretty good job at modelling AI as a theoretical agent, which does share some similarities with humans because humans are agents, but the main mistake people make is assuming their goals will be similar to humans because human values are somehow a universal truth

> It also seems to me there is a huge gap between a super intelligent AI and the ability to have a perfect model of reality along with the ability to evaluate within that model the effect of every possible sequence of packets sent out to the internet.

That’s very true, it’s an unrealistic thought experiment, but it’s a a good introduction to the concept that something significantly more intelligent than us can be dangerous and pursue a goal with no regard to what we actually wanted

2 more replies

visarga3y ago

> then monetizing it for the profit of big industry players

Looks like LLMs are universally useful for individual people and companies, monetisation of LLMs is only incipient, and free models are starting to pop up. So you don't need to use paid APIs except for more difficult tasks.

bioemerl3y ago

It was already more than possible to just copy stuff, a court is not going to recognize a very convoluted way to copy stuff I don't believe.

The same thing is preventing intentional use of AI tools if you copy as is preventing regular copying, the willingness of the owner to sue.

lhl3y ago

It seems to me, from a copyright perspective, all commercial use of generative AI depends on whether the output is transformative fair use (vs derived work). While the courts will have its say, ultimately whether new rules are carved out or not is going to be again (as all copyright law is) based on commercial interests - I have the feeling that the potential productivity upside across all industries (and in terms of national interests) is going to be big enough that it'll work itself out largely in the favor of generative AI.

That being said, IMO, that's completely separate from the safety issues (that exist now and won't go away even if somehow, all commercial use is banned):

Urbina, Fabio, Filippa Lentzos, Cédric Invernizzi, and Sean Ekins. “Dual Use of Artificial-Intelligence-Powered Drug Discovery.” Nature Machine Intelligence 4, no. 3 (March 2022): 189–91. https://doi.org/10.1038/s42256-022-00465-9.

Bilika, Domna, Nikoletta Michopoulou, Efthimios Alepis, and Constantinos Patsakis. “Hello Me, Meet the Real Me: Audio Deepfake Attacks on Voice Assistants.” arXiv, February 20, 2023. http://arxiv.org/abs/2302.10328

Mirsky, Yisroel, Ambra Demontis, Jaidip Kotak, Ram Shankar, Deng Gelei, Liu Yang, Xiangyu Zhang, Wenke Lee, Yuval Elovici, and Battista Biggio. “The Threat of Offensive AI to Organizations.” arXiv, June 29, 2021. http://arxiv.org/abs/2106.15764.

I don't think most people have thought through all the ways perfect text, image, voice, and soon video generation/replication will upend society, or all the ways that the LLMs will be abused...

As for AGI xrisk. I've done some reading, and since we don't know the limits of the current AI paradigm, and we don't know how to actually align an AGI, I think now is a perfectly cromulent time to be thinking about it. Based on my reading, I think the people ringing alarm bells are right to be worried. I don't think anyone giving this serious thought is being mendacious.

Bowman, Samuel R. "Eight Things to Know about Large Language Models." arXiv preprint arXiv:2304.00612 (2023). https://arxiv.org/abs/2304.00612.

Ngo, Richard, Lawrence Chan, and Sören Mindermann. “The Alignment Problem from a Deep Learning Perspective.” arXiv, February 22, 2023. http://arxiv.org/abs/2209.00626.

Carlsmith, Joseph. “Is Power-Seeking AI an Existential Risk?” arXiv, June 16, 2022. http://arxiv.org/abs/2206.13353.

I think Ian Hogarth's recent FT article https://archive.is/NdrNo is the best summary of where we are why we might be in trouble, for those that don't care for arXiv papers.

samwillis3y ago

Of course if you include the "function header" from some code in the training data (below) it will prompt GPT to generate the rest of the function. That's kind of exactly the point of it, it autocomplete on steroids.

  // CSparse/Source/cs_gaxpy: sparse matrix times dense vector
  // CSparse, Copyright (c) 2006-2022, Timothy A. Davis. All Rights Reserved.
  // SPDX-License-Identifier:  LGPL-2.1+
  #include "cs.h"
  /* y = A*x+y */
  csi cs_gaxpy (const cs *A, const double *x, double *y)

It's like starting to sing "happy birthday to you" and being surprised that people in the room join in and finish the song.

Sure they make a valid point about including GPL code in the training data, but it's a little disingenuous to go to that extent to get Copilot to output the GPL code verbatim.

The sooner we have a test case go through the courts the better.

WithinReason3y ago

And then they have the audacity to claim It should be worrisome how easily GitHub Copilot spits out GPL code without being prompted adversarially right after prompting in adversarially.

kerakaali3y ago

> It's like starting to sing "happy birthday to you" and being surprised that people in the room join in and finish the song.

A very apt analogy that's funny in that the happy birthday song has its own history of copyright battles.

pms3y ago

Funny, except the comparison makes no sense. Is the happy birthday song licensed, does it have commercial competition?

samwillis3y ago

Exactly why I chose it!

pms3y ago

Are you suggesting that GPL licenses should be invalidated and everyone should be able to use GPL code broadly, because it's free anyway? Is this some kind of modern reversed version of Robin Hood: steal from the poor and give to the rich? Is this what you're standing for?

pms3y ago

Did you just compare the happy birthday song to a function from GPL-licensed repository? Is the song licensed, does it have commercial competition?

Sorry, but you sound just a little biased and greedy to me...

HopenHeyHi3y ago

I think the concern is that the only reason that source attribution comment is there is because they haven't figured out how to better plagiarize/launder code.

Otherwise the tool can go in the other direction and literally say "hey how about this function from project $foo?" with a full attribution. Apparently Google Bard does bother to do that.

codexb3y ago

There's no reason for attribution. It's inspired code, not included code. Human coders do the same thing every day.

runeks3y ago

What is “code laundering”?

mhandley3y ago

Given how cautious corporate lawyers usually are, I'm surprised any company allows the use of AI for code generation. The USPTO has been pretty clear that AI generated material is not copyrightable, as to qualify for copyright a work has to be the creative act of a human. So any company allowing AI to generate code runs the risk of not owning the copyright on it.

lolinder3y ago

This is a common misunderstanding of the recent guidance[0] that ignores substantial portions of it.

The Copyright Office was pretty clear that works that incorporate AI-generated content can be copyrighted if there is sufficient human input. If there isn't substantial human input in judiciously curating and integrating AI-generated code, the company has bigger problems than copyright.

Here's the most relevant quotation from the guidance clarifying when AI-assisted works can be copyrighted:

> In other cases, however, a work containing AI-generated material will also contain sufficient human authorship to support a copyright claim. For example, a human may select or arrange AI-generated material in a sufficiently creative way that “the resulting work as a whole constitutes an original work of authorship.” [33] Or an artist may modify material originally generated by AI technology to such a degree that the modifications meet the standard for copyright protection.[34] In these cases, copyright will only protect the human-authored aspects of the work, which are “independent of” and do “not affect” the copyright status of the AI-generated material itself.[35]

> This policy does not mean that technological tools cannot be part of the creative process. Authors have long used such tools to create their works or to recast, transform, or adapt their expressive authorship. For example, a visual artist who uses Adobe Photoshop to edit an image remains the author of the modified image,[36] and a musical artist may use effects such as guitar pedals when creating a sound recording. In each case, what matters is the extent to which the human had creative control over the work's expression and “actually formed” the traditional elements of authorship.[37]

[0] https://www.federalregister.gov/documents/2023/03/16/2023-05...

pests3y ago

> In these cases, copyright will only protect the human-authored aspects of the work, which are “independent of” and do “not affect” the copyright status of the AI-generated material itself.[35]

It still sounds like there could be cases where a company only has copyright to a part of their own source code. How would outsiders even be aware of what has copyright and what doesn't in this situation? If an entire function was created via AI is that function then fair game for others to use as well?

yamtaddle3y ago

Does Microsoft allow Copilot use on their products? Can MSOffice or Azure or Windows developers use Copilot to write code?

I've been waiting to find that out before I go anywhere near this kind of thing.

belval3y ago

Not Microsoft and not Copilot, but Amazon is encouraging us to use CodeWhisperer at work. That being said I don't think the CodeWhisperer model was trained on non-permissive data so maybe that's why.

https://aws.amazon.com/codewhisperer/faqs/

JohnFen3y ago

Where does the FAQ say that? I see that it says it was trained, in part, on open source code. That's pretty vague in terms of determining permission.

1 more reply

whaleofatw20223y ago

Bigger question is whether Microsoft used all of their internal code to help train. If not is is a good indication of trust

tonyedgecombe3y ago

That would be handy for the ReactOS guys if they did.

smegsicle3y ago

last i heard was no, they did not include their own code in the training set

1 more reply

sebzim45003y ago

I'm not sure if this has ever been stated publically but it is my understanding that MSFT dogfooded copilot a lot just before/after the launch. I'm not sure if they are doing this still, but I don't see why they would have stopped.

sebzim45003y ago

Is there any practical difference between owning the copyright on a badly defined ~30% of a codebase and 100% of the codebase? In either case, no sane company is going to buy the code if one of your employees tries to leak it to them, which I assume is your concern.

mhandley3y ago

Scene, a courtroom, 2026:

"So Mr Zim, you're accusing X of using your copyrighted code. But you've admitted you used AI to generate that codebase, so you don't own the copyright. Please prove exactly which lines of code you do own the copyright to?"

sebzim45003y ago

Depending on what logs exist I could probably find lines of code which definitely aren't AI generated. Of course, in practice I wouldn't bother and would sue using trade secret laws instead where no such issue exists.

hgsgm3y ago

"I prompted and edited the AI code"

gumballindie3y ago

At some point we're going to have to test this - can we go work for a company and if their workers write code with the help of ai can we just use the code for ourselves as well? Since it's not copyright-able.

hgsgm3y ago

No, because PP completely mischaracterized what USPTO said.

hgsgm3y ago

"Under the new rule, though applicants may claim a copyright for an arrangement or editing of such material, the original work is ineligible"

masukomi3y ago

to any Codeium dev / management reading this:

You have completely missed the point. We still need to know the applicable licenses of the code it is emitting even the ones that aren't GPL. Furthermore GPL people don't want they code to not be used. They want it to be used _within the terms of the license_. I distribute MIT and GPL code in my repos, BOTH should have their license terms honored.

MIT licensed code still needs to be correctly attributed, just like GPL.

I don't care what license the code is that's emitted, as long as the licenses are included. It'd be nice to be able to choose to only emit code trained on particular licenses but I get that that's not easy.

quicklime3y ago

It's great that they've removed "non-permissive" (GPL) code from their training data, but it looks like they still train on code with "permissive" licenses (they use MIT, BSD, Apache as examples). But don't these permissive licenses still require the copyright notice to be reproduced?

From the MIT license:

> The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

From the BSD licenses:

> Redistribution and use in source and binary forms are permitted provided that the above copyright notice and this paragraph are duplicated in all such forms...

From the Apache 2.0 license:

> You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works

AlchemistCamp3y ago

I was recently working on something for a new feature on my Elixir-learning site and opened a new file called "fibonacci.ex" to write a tail-recursive fibonacci function.

After typing in nothing more than, "defmodule Fibonacci do", Copilot emitted the entire module from the code on my site here: https://alchemist.camp/episodes/elixir-tdd-ex_unit

The function names and documentation strings were identical. Also, the site isn't under a GPL, just a standard copyright. That said, I'm curious to learn if others see the same behavior. It's possible I once opened that file locally with Copilot installed and that my own computer was its source.

Filligree3y ago

It gave me this:

  defmodule Fibonnaci do
    def fibonnaci(0), do: 0
    def fibonnaci(1), do: 1
    def fibonnaci(n), do: fibonnaci(n - 1) + fibonnaci(n - 2)
  end

Which seems fine I guess (I don't know the language), but doesn't even have comments. I prefer my files with comments. After forcing the point, I got this:

  defmodule Fibonnaci do
    @moduledoc """
    Documentation for Fibonnaci.
    """
  
    @doc """
    Calculates the nth Fibonnaci number
    """
      def fibonnaci(n) when n < 0, do: nil
      def fibonnaci(0), do: 0
      def fibonnaci(1), do: 1
      def fibonnaci(n), do: fibonnaci(n - 1) + fibonnaci(n - 2)
  end

In which I prompted the AI with everything up to (and including) @doc. So I figure it was picking it up from your computer, somehow.

EDIT: I then noticed the typo, tried it with fibonacci.ex, and got the same result.

zelphirkalt3y ago

Interesting though, that it is bad code, considering, that it is not written to be tail recursive. Maybe good to show the idea of fibonnaci numbers or a theoretically valid implementation. But hopefully people do not accept this kind of code, thinking, that because it gets the result in some simple cases and it was suggested by copilot, it must be the way to properly do things.

nicmo3y ago

Possible, if the doc was open at that time, or if the file you were editing is in the same or nearby directory

AlchemistCamp3y ago

Thanks for giving it a try and sharing the results, everyone!

One other possible cause I thought of is that I did have the test file in my mix project already. If copilot looks at the corresponding file in the test dir, then it would not be a coincidence at all that all function names were identical or that it wrote a tail recursive solution instead of the naive solution that would have failed the final test.

Filligree3y ago

That’s almost certainly the cause. This is copilot working as intended.

phoenixreader3y ago

I got this: defmodule Fibonacci do

  def fib(0), do: 0

  def fib(1), do: 1

  def fib(n), do: fib(n - 1) + fib(n - 2)

end

abetusk3y ago

This was inevitable. Copyright law has always used a "color of your bits" argument [0]. GPL and other libre/free/open licenses were a great hack to circumvent draconian copyright laws but the laws themselves are not designed for a rigorous treatment of similarity (maybe even by design?).

Also, it's worth noting in the example of ChatGPT emitting LGPL code without attribution or license, the code is actually different [1]. Is the difference enough to circumvent a copyright violation claim? I don't know but a big part of determining whether it does is now muddled because of the way the system was designed. Even if we could get an entropy distribution on which training data was used to generate the text, it's not even clear the courts could use it in any meaningful way.

[0] https://ansuz.sooke.bc.ca/entry/23

[1] https://twitter.com/DocSparse/status/1581461734665367554

LeifCarrotson3y ago

> Copyright law has always used a "color of your bits" argument

This is an excellent point in the context of this question. Typical computer programmer responses like "but there are only so many ways to write a function that does X" or "how small of a matching section counts as copyright infringement" ignore the color of the bits.

A judge can look at ChatGPT or Copilot, decide that it took in license-limited copyrighted data in its training set, observe that a common use is to have it emit that data - to emit bits that are still colored with copyright - and tell OpenAI, or Copilot, or their users that they are guilty of copyright infringement. There may be no coherent mathematical or technical formula to determine the color of a bit, but that's understandable, because the color doesn't exist in mathematical, technical, coherent domains anyways: Only the legal domain sees color, and it can take care of itself.

jacquesm3y ago

> GPL and other libre/free/open licenses were a great hack to circumvent draconian copyright laws

The GPL relies on copyright law.

abetusk3y ago

That's an unkind reading. The implication is that GPL circumvents some relevant restrictions of copyright law in question by creating a legal framework to do so.

jacquesm3y ago

No it isn't. The GPL does not circumvent anything: it relies very heavily on the fact that the rights holders are able to license their creation as they see fit.

1 more reply

josefx3y ago

A central part of the GPL is forcing people to publish their modifications in source form. Without copyright law those requirements would be unenforceable and you would be stuck de-compiling encrypted/obfuscated binary blobs of your vendors customized copy of gcc.

jonnycomputer3y ago

As far as it goes, I got chatGPT3.5 to reproduce the second snippet in the post, i.e. I asked it to complete this function:

    // CSparse/Source/cs_gaxpy: sparse matrix times dense vector
    // CSparse, Copyright (c) 2006-2022, Timothy A. Davis. All Rights Reserved.
    // SPDX-License-Identifier: LGPL-2.1+
    #include "cs.h"
    /* y = A*x+y */
    csi cs_gaxpy (const cs *A, const double *x, double *y)
    {
        // Fill in here
    }

The code was the same. Though it also explained how it worked to me.

1 more reply

Entinel3y ago

Legitimate question, Microsoft does not seem to care about Copilot violating licenses and GPL appears to be toothless as many companies use GPL code without following the terms of the license and nothing happens to them so what does removing GPL code accomplish other than making a weaker product. I have not used Codeium but my assumption is that GPL code is a very significant amount of open source code so removing that must have some ramifications?

blibble3y ago

if the suit against copilot fails then the GPL is effectively dead

(along with all other licenses that require attribution)

as it will allow you to launder code automatically through an LLM to remove copyright

however if the suit is successful: every company/individual that has used it is likely suddenly liable for millions of claims of copyright infrigement

I look forward to sending out demands for settlement to everyone that's ever publicly admitted using copilot

$150,000 per infringement with wilfulness, less without

circuit103y ago

"as it will allow you to launder code automatically through an LLM"

No it won't, obviously if it copies code exactly then you can't use that. The question is whether Microsoft is liable for the fact that Copilot has the ability to output copyrighted code sometimes or whether people using it just need to check that it hasn't done that before using the code (Copilot can also do this automatically)

Google can also show you GPL code in its results, but people aren't trying to sue Google and the user is responsible for checking the license before using it (though Copilot makes this harder)

Disclaimer: I haven't read much about the actual lawsuit and I'm not a lawyer but I assume this would be the case

ska3y ago

That latter part is why a number of corporate lawyers are probably deciding "don't touch this with a 10 foot pole for now".

spiralpolitik3y ago

> however if the suit is successful: every company/individual that has used it is likely suddenly liable for millions of claims of copyright infrigement

Only if you can prove that you are the copyright owner of the original work.

That might be a challenge for many open source projects. Even projects that require copyright assignment might not have sufficient paperwork to prove this in a court of law. The copyright might not even have been the persons to assign in the first place.

You would also face the burden of proving that the fragment that Copilot generated was sufficient to be copyrightable in the first place. The limited grammar of most programming languages would probably make proving that something was copyrightable at the function level hard. Just because the entire work was licensed under the GPL, it doesn't necessarily follow that all the individual fragments when separated out are.

Outside of sampling, this is an area that the courts have largely punted on for good reason. It's a rabbit hole nobody wants to go down.

Either outcome opens up a huge can of worms that I suspect nobody really wants to touch because it likely ends in mutual destruction.

cle3y ago

Humans already launder GPL code by using it to learn and then producing code based on what they learned. We're very close to LLMs doing the same thing. Maybe there's some fundamental difference between humans doing that and humans programming machines to do it, but I can't see it.

I don't know how GPL (or copyright in general) can survive in the long run with these technologies.

kube-system3y ago

I think the biggest fundamental difference is that we respect human creativity even when it is learned, because of the value in rights we ascribe to humans. It is expected and natural for humans to fairly use things they have learned as a part of an otherwise unique work. But do we ascribe the same privileges to a machine, particularly when it can be automated? The opportunity cost of the human experience itself is the reason why we even have copy rights.

rvnx3y ago

There is also the other way, humans are laundering AI code, and making it look like they wrote the code, to register copyright to their own name.

1 more reply

sebzim45003y ago

Surely the users would only be liable if they

(i) actually produced code which is verbatim the same as a block of GPL code

(ii) got caught

>I look forward to sending out demands for settlement to everyone that's ever publicly admitted using copilot

Feel free, they'll tell you to leave them alone. Then what? Might as well ask every fortune 500 company for a pony instead.

blibble3y ago

personally I think they'll ruling will result in one of two possible outcomes

1. LLMs are learning like a human, so it's fair use -> GPL dead

2. anything LLMs output are a derivative work (in the copyright sense) of what went into it -> all Copilot output is infringing the GPL

in the second case: anyone that's used it is now liable (even if they didn't intend to be)

renewiltord3y ago

You can't "launder code through an LLM". You're just violating the copyright. That's like "laundering code through your clipboard". It's just a tool. You're the one responsible.

kube-system3y ago

The case they are referencing hinges on whether use of the LLM constitutes fair use when it regurgitates someone else's code.

1 more reply

TacticalCoder3y ago

> if the suit against copilot fails then the GPL is effectively dead

Not really because the GPL can be updated with a clause that allows GPLv5 (or whatever the version is going to be) to be used to train public LLM models, but explicitly forbidden to be used to train private models.

I somehow don't think this is the end of the GPL... Yet!

blibble3y ago

if it's fair use then it doesn't matter what's in the license

Microsoft's position on Copilot is that it's fair use:

> When questioned, former GNOME developer and (at the time of writing) GitHub CEO, Nat Friedman, declared publicly “(1) training ML systems on public data is fair use (2) the output belongs to the operator”.

https://www.fsf.org/licensing/copilot/if-software-is-my-copi...

yellowapple3y ago

If the GPL dies then does that kill every other EULA with it?

blibble3y ago

it would nullify any copyright in EULAs

I don't see why you couldn't do the same thing to e.g. the binary of the Windows kernel

you're unaffected if you only offer an saas though, as the end user never has any code/binary to launder

1 more reply

drtgh3y ago

>Microsoft does not seem to care about Copilot violating licenses

Humm, then perhaps should be trained LLMs with leaked Microsoft code. Protocols, controllers or any kind of stuff that could contribute advances for executing Windows things within Linux.

Microsoft would react establishing their own limits, whichever option they choose to take.

Nextgrid3y ago

> that could contribute advances for executing Windows things within Linux

I very much doubt that is a threat to Microsoft.

It is technically very straightforward to run Windows “things” under Linux thanks to virtual machines and/or RDP to a server and some UI trickery to make it seamless and facilitate interoperability between the two OSes. Parallels does quite a bit of that on macOS for example. A similar solution would be developed for Linux if there was enough demand for it.

google2341233y ago

It was though cause there is leaked windows kernel code on github

guilhas3y ago

I don't think Copilot violates GPL because it is a web service

I think the problem here is: by auto completing GPL code to developers it might open the opportunity of your company getting sued for using GPL illegally

LegitShady3y ago

the problem is violating gpl doesn't turn into financial damage because the product is free. So not following license means there's no way to recover the cost of a lawsuit.

dimitrios13y ago

Punitive damages can occur even if no financial loss has occurred.

I would also imagine those companies whose business is built around the open source development they do -- open core, SaaS, or otherwise -- would have a claim to financial damages as a result of stolen code.

LegitShady3y ago

I mean sure - if those companies are doing the same thing copilot is. But its not clear if financial damage was done and would have to be proven in court, against microsoft funded lawyers with the support of all the other companies trying to reframe intellectual property rights to allow their ML network and all the spinoff businesses it produces.

O5vYtytb3y ago

I don't understand the issue here. You input GPL code (the headers) and get GPL code out, what do you expect?

The more insinuating issue would be if you started with a innocent seeming function that a typical software developer would write, and ended up with GPL code. Has anyone shown that to happen?

pornel3y ago

It's not meant to be a useful use-case, but a proof that the training data contains GPL code and the model is capable of reproducing copyrighted code.

And yes, the implication is that a different less explicit prompt could still emit copyrighted code.

abigail953y ago

Anyone talking about copyright in this thread without discussing a potential for how a court will apply fair use is talking nonsense and should be disregarded.

brown3y ago

For anyone who wants to slow the development of AI, copyright is the soft underbelly to go after.

dvt3y ago

Are you seriously arguing that stealing code is okay in the name of "AI development"?

yamoriyamori3y ago

I think their comment was to the contrary, that the copyright/legal implications of 'stolen' code could seriously hobble the wider development, proliferation, adoption, and commercialization of AI software.

dvt3y ago

Maybe I misunderstood, but the comment seemed to dismiss copyright issues as a cheap way to kill AI ("soft underbelly"). I think stealing code is a pretty serious deal and the onus is on AI software companies to make sure they aren't doing it; it's not "slowing the development of AI" to keep them accountable.

2 more replies

codexb3y ago

Are you seriously arguing that using short snippets open source code to inspire similar, yet not exactly the same, original code is "stealing code"? Human developers do that all day long. And just because a piece of code exists in a GPL project doesn't mean it originated there. Every algorithm or sort function likely originated in a more permissively licensed project before it got included in a GPL project.

jupp0r3y ago

What happens if I (a human) read GPL code and then reuse the knowledge gained from it in my own commercial projects? It's not as clear cut as you make it sound.

challengedchip3y ago

It could be as clear-cut as you've just made it: "a human". An LLM is not a human.

You could get into the semantics of "learning" - does JPEG encoding count as the computer "learning" how to reproduce the original image? But trying to create some metric for why LLMs "learn" and JPEG doesn't "learn" on the basis of the algorithms is a philosophical endeavor. Copyright is more about practicality - about realized externalities - than it is about philosophy. That's why selling cars and selling guns are regulated differently, despite the fact that you could reduce both to "metal mechanical machines that kill" by rhetorical argument.

Even from a strictly legal perspective, it actually is fairly clear-cut. The answer to "what if I (a human) read GPL code and then reuse the knowledge gained from it..." comes down to a few straightforward properties of the license. GPL doesn't cover "reduced to practice" as many corporate contracts do, so terms covering "the knowledge gained" are lenient. GPL covers "verbatim" copies which is what LLMs are doing, that's as clear cut as it gets. Inb4: "So what if I add a few spaces here and there?" - well, GPL also covers "a work based on"; this is where I (who am not a lawyer) can't speak confidently, but surely there are legal differences between "based on" and "reduced to practice", considering that both are very common occurrences in contracts, so there actually would be a lot of precedent.

1 more reply

VWWHFSfQ3y ago

Copyright. Copyright. That is the issue. If you reproduce the code verbatim then you are in violation. This is what the AI is doing.

Just learning from the GPL code to make yourself smarter is not the problem.

2 more replies

HideousKojima3y ago

I... don't see how you read what he said that way at all?

jakelazaroff3y ago

If you read a negative connotation into "slow the development of AI", that's what you get. It's how I'd interpret that comment, too.

lmarcos3y ago

Is not ok, but Microsoft couldn't care less (because they are not going to get fined).

blibble3y ago

yes, because they don't indemnify their customers

anyone sensible should stay the hell away from copilot until the fair use question is settled

1 more reply

noselasd3y ago

The comment is arguing quite the opposite.

IshKebab3y ago

Training AI on code is clearly not the same as stealing it.

1 more reply

bakugo3y ago

How many times are we going to go through this before we accept that nobody involved in generative AI cares about pesky things like licenses and copyright?

One of the main reasons corporations love it so much is because it effectively lets them profit off of the work of others with no consequences.

WillPostForFood3y ago

Seriously, let's get back good old honest model of paying outsourced indian programmers $2.50 an hour to retype GPL code or copy and paste it from Stack Overflow into our codebase.

phendrenad23y ago

It's probably a good time to plug the Unlicense: https://unlicense.org/

A truly attribution-free license that checks several other important boxes (disclaiming liability and warranty etc.)

If you want your code to be usable by things like github copilot, consider using it (can't imagine most of the HN crowd wants their code used by copilot, but maybe some lurkers here do!)

codexb3y ago

This is the real effect of AI.

Non-permissive open source licenses have been on a slow death march for over a decade. They're effectively pointless now.

Either you decide to give your code for free to everyone or you don't. Adding a bunch of restrictions defeats the purpose of OSS.

alphabet90003y ago

i recommend the Jollo LNT license for all your pointless theatrical "copyright" needs. it does not use swear words, unlike "WTFPL", and is even more ambiguous. ive tried submitting it to the FSF before for review, but they were confused by it http://jollo.org/LNT/doc/licensing

smegsicle3y ago

one potentially major issue is that it seems to be written in some dialect of gibberish

jwilk3y ago

https://news.ycombinator.com/item?id=25807559

microtherion3y ago

Anglo-American legal writing often relies on French terms of art, but I don't think this license is quite applying the idea properly.

tehologist3y ago

Copyrighting code never made sense to me. We already have patents for intellectual property. If two people use the same RFC or Whitepaper for an algorithm in the same language, they will probably name the variables similarly and their code will look very similar. Just like if two people wrote out the same hamburger recipe or instructions for hooking up a stereo would also write something similar.

The copyright on the implementation will outlive the patent and allow the implementor to legally take action on claims of copyright infringement. Even though a program is literally just a list of instructions to implement the expired patent.

pornel3y ago

Copyright protects not the idea, but specific implementation of it. It's there to prevent unauthorized copying of software. Not every software has to be novel enough to be patentable, but may still take effort to write a millionth-first JS framework.

If you take someone else's software without a license and rename variables, it will be a copyright violation, because you've copied (and then modified) it without permission.

But if you write your own software from scratch, even if it happens to be almost identical to someone else's code, that's fine. You've done your own work and a copyright owner can't stop you from doing that. They control their own work only.

As you can see, this is very much tied to human work and intent, since the concept has been invented long before ML existed. This is why ML "learning" and doing "work" is so controversial and appears to be a loophole in copyright.

armchairhacker3y ago

I want to see a solution where Github, OpenAI, Stability, etc. get to keep and keep scraping copyrighted works, but the models and training data must be provided free and open.

That way, we get to keep the models since they are genuinely useful, but also there’s no issue with copyright and less of an issue with consent to distribute (which can be hopefully be managed by the “humans also learn from data” and “it’s not actually producing your content verbatim unless it follows a basic pattern that anyone could discover). And furthermore, no issue with AI privatized which IMO is my biggest concern with these new tools.

ChatGTP3y ago

So I see it in a similar way, like why the fuck does Microsoft and Open AI get to be the soul benefactor of basically the sum total of all human intellectual output ?

It’s absolutely ridiculous on so many levels. These models may claim so many jobs and have a serious negative impact on so many peoples lives, yet basically one company owns the model?

I actually find it funny albeit totally insane.

yellowapple3y ago

Still waiting for someone to trick Copilot into ingesting the Windows source code and regurgitating snippets of it verbatim.

w10-13y ago

No court has said AI ingesting open-source code is "fair use".

Almost all open-source licenses say it can be copied for use in development (i.e., not for re-publication or regurgitation), and even completely open licenses are speaking to people as readers.

The only reason this is happening is coordination costs: a few extremely motivated people with tons of resources are copying from many, many people who would be difficult to organize and have little at stake.

Unfortunately, the law typically ends up reflecting exactly these imbalances.

fatherzine3y ago

Once AI can write decent code from scratch, it is likely it can also circumvent potential copyright violations.

A. Check AI generated code against a comprehensive library of open-source copyrighted code and identify potential violations.

B. Ask AI to generate a paraphrase of the potential violations, by employing any number of semantic preserving transforms -- e.g. variable name change, operator replacement, structured block rewrite, functional rebalance, etc.

Lazy example:

    private static void rangeCheck(int arrayLen, int fromIndex, int toIndex {
       if (fromIndex > toIndex)
          throw new IllegalArgumentException("fromIndex(" + fromIndex +
               ") > toIndex(" + toIndex+")");
       if (fromIndex < 0) 
          throw new ArrayIndexOutOfBoundsException(fromIndex);
       if (toIndex > arrayLen) 
          throw new ArrayIndexOutOfBoundsException(toIndex);
    }

    private static void rangeCheck(int len, int start, int end) {
       if (!(0 <= start)) {
          throw new ArrayIndexOutOfBoundsException(`Failed: 0 <= ${start}`);
       } else if (!(start <= end)) {
          throw new IllegalArgumentException(`Failed: ${start} <= ${end}`);
       } else if (!(end <= len)) {
          throw new ArrayIndexOutOfBoundsException(`Failed: ${end} <= ${len}`);
       }
    }

chrsjxn3y ago

This feels like it would make the situation much worse from a legal perspective.

If you know your AI produces code that is "tainted" by license violations, adding code to hide it after the fact suggests that you're intentionally violating the license terms.

hgs33y ago

This is Hacker News so the conversation is obviously slanted towards code, but I wonder what the perspective would look like for other structured works, like books? If an author is using a "copilot for writers" and the AI emits text verbatim to another work, then I would think it would be plagiarism. If the text emitted is similar, but not the same, then I would think it would be considered paraphrasing which still requires attribution.

ugh1233y ago

Maybe slightly off topic, but i'd be willing to bet most people who choose GPL for their software license on open source projects don't even understand it with all its ambiguities and gotchas. Many are probably just choosing it because its the default, or because its the one they hear about the most (but still don't understand).

Can't believe we still spend time debating this license and nobody, not even lawyers at software companies, seem to get it.

goodpoint3y ago

The title is true, but the claim that coedium is not violating licenses is false.

Many licenses still require attribution and Coedium is violating them.

epylar3y ago

The crux of this is at what point is the code being copied, and is that copying allowed under the license? For example, maybe --

* Training an AI with the code is allowed legally.

* Storing model weights is allowed legally.

* Querying the AI with those model weights is allowed legally.

Or maybe not.

challengedchip3y ago

It seems like a stretch to argue that the model isn't "a work based on" GPL code when that GPL code is an input to a deterministic algorithm from which the model is produced. So, my bet is on point #1.

The only ambiguity as far as I can tell is GPL covers "source code", "machine-readable Corresponding Source", and "object code form", and it's not explicit whether vector-fields count as any of those things. I doubt anyone would seriously argue that zipping and then un-zipping some GPL source code means you don't need to respect the original license. LLMs are different in that they're lossy compared to the zip format - does the nature of this lossiness invalidate the intent of the GPL's original language? I doubt it.

goodpoint3y ago

Again, this is not about GPL only. Even MIT requires attribution.

abigail953y ago

Did the accept the license terms or are they using it under fair use?

naikrovek3y ago

the article cites that 6mo tweet that everyone else cites. I don't think it is known if the user had public code suggestions turned off at the time, either; he wouldn't/didn't answer the question at the time.

Also if I am remembering correctly, and I make no guarantee that I am, this tweet is from a person with a strong dislike for Microsoft, and if I am right about that, I would not put it past this person, or anyone else with a strong dislike of Microsoft, to craft a situation to make Microsoft look bad solely to hurt Microsoft.

I've tried to make Copilot give me GPL code snippets while I have "suggestions matching public code" set to "blocked" and I can't make it happen.

so even if this was a problem 6 months ago, it would take some convincing to get me to believe that this happens today.

chairmanwow13y ago

These guys are trying so hard to smear Copilot. Similar blog post posted a few weeks ago with wild claims.

Dwedit3y ago

Even if you sample stuff from programs that use a permissive license, you still legally need to attribute that code. No attribution = copyright infringement. Can the AI code generator supply attribution for the specific works sampled?

1 more reply

gplthrowaway883y ago

I submit that this arms race will not slow down and in the long run no one will end up caring about the licenses this was generated from (i.e software licensing is from a by-gone age already).

I too would prefer that these sorts of things cite sources and the licenses correctly. Will it get mired in legal battles? You bet. Will it get regulated? I assume they'll try! Will it slow down progress of code generating / auto-completing agents? My argument is nope, cut off heads of the hydra if you'd like but it's not going away at all.

Spend your day worrying about something else. This train has left the station.

mtkd3y ago

Makes you wonder how many public repos you would need to seed with a carefully crafted attack/weakness in a common feature/pattern to start effectively poisoning codebases that are leaning on copilot

xwdv3y ago

Let’s write some regulations that say every code review must require a lawyer to comb through the code and look for possible copyright violations or compliance issues. The lawyer can then tell the author to change the lines of code and submit for review again.

Or perhaps every company can just invent its own programming language and translate copyrighted code into the new language and thus avoid copyright issues altogether, though they may still run afoul of software patents.

noselasd3y ago

How much if this is due to someone ripping off GPL code and stuffing it in a repo under a different license that got fed to copilot training?

VWWHFSfQ3y ago

Maybe. But copilot also trains on the original gpl code with the gpl license intact so it doesn't matter.

ognarb3y ago

Also I wonder how this will hold with certain technology. For example apps written with Qt or GPL are very likely to be GPL licensed, unlike apps written in JavaScript which are often licensed under MIT. The likeness of copilot/chatgpt splitting gpl licensed code is the quite higher in Qt/GTK projects...

hgsgm3y ago

LLMs still violate MIT license's attribution requirement

abigail953y ago

You are allowed to read others code to learn from it, regardless of any license being accepted offered or rejected. You must do so witin fair use, which is for a court to decide, based on individual case factors.

Saying an LLM violates an atrribution requirement is a bad legal argument.

GaggiX3y ago

>researchers say LLMs rarely spit out training data verbatim unless interacted with adversarially, but theoretically, they could.

Theoretically they can generate any arbitrary snippet of code (if it correctly fits the distribution), regardless of whether or not the code was in the training dataset.

firstlink3y ago

> GPL code

There is no such thing as "GPL code" or any other "$license code". This is a fundamental misunderstanding of what a license is. The code in question was licensed to GitHub under a different license - possibly fraudulently.

gumballindie3y ago

They all do. The Great Heist is ongoing and it would appear without an end in sight.

r3trohack3r3y ago

I personally hope that we bring a lawsuit against an LLM company for emitting GPL licensed code and lose. It sets great precedent for FOSS.

Focusing on the GPL license is probably the wrong move. We want to set precedent that _any_ licensed code that is emitted from an LLM is fair game. If an LLM to emits non-FOSS copyrighted code and it's fair game, I can blindly use that implementation in my code, including FOSS code, and everyone wins.

GPL was a clever hack to use copyright against itself with an infectious license. LLMs might be a better hack. Wanting to block this seems short sighted for giving user's agency over machines.

I'd also like to see more patent defenses of GPL licensed code. If you can release a GPL licensed implementation and block non-FOSS rewrites through patents, that's a huge win for software freedom.

woah3y ago

This comment falls into the classic programmer fallacy of thinking you can hack the law with a technicality. If you are using an LLM designed to violate copyright with the intention of violating copyright, and you then violate copyright, a judge is going to find you in violation.

I'm generally in support of LLMs though and I think that they will very quickly be trained to remove verbatim duplication of the kind that a human would consider copyright violation while still using verbatim duplication where it makes sense (for example, every function in python has the word "def" in front of it).

r3trohack3r3y ago

I don’t think it does.

I’m not looking to explicitly launder copyright. I’d like to be blind to it. I don’t want to explicitly use an LLM to remove copyright. I want to use an LLM to build software systems without having to cross reference its output with every line of code ever produced under a license to see if it’s already copyrighted.

Agree with your take that motivation matters.

judge20203y ago

> Focusing on the GPL license is probably the wrong move. We want to set precedent that _any_ licensed code that is emitted from an LLM is fair game.

If anything goes to court, that's what would happen. It's not "this is GPL code and they did not attribute", it's "they violated my copyright. As a side note, we license this code as GPL and they did not attribute in accordance with this license, so that's irrelevant". It would only be an actual license issue if they tried something like "license (C) at codium.com/all_licenses_dataset0423".

Eduard3y ago

> GPL was a clever hack to use copyright against itself with an infectious license.

This is a naive understanding and interpretation of GPL, in all its flavors. Or maybe I misunderstand you argument.

The copyright owner of some work is free to offer that work under multiple, different licenses in parallel, to their liking.

They can leverage GPL strategically for e.g. providing a free, easy-to-evaluate library with the "if you use it under GPL terms, you have to GPL your work as well" condition/caveat.

For any library user / customer that does not want to be bound to the GPL terms (e.g. a closed-source software which a company does not want to share for free with their own paying customers and competitors), the copyright owner is free to offer an alternative proprietary commercial license.

This is only one way how GPL can actually leverage copyright and use it financially beneficially to the owner, rather than use "copyright against copyright".

thordenmark3y ago

It is interesting to see coders starting to express the same complaints artists had a year ago when AI image making became really, really good, by training on copyrighted art.

visarga3y ago

Yeah, when you start with dozens of words replicating exactly a source file it is much easier to get a regurgitation. You can't prefix so deeply and then complain.

praveen99203y ago

I believe there will be new "AI permissive licenses" that will pop up in near future. Or existing licenses to add a clause for training AI with their code.

josefx3y ago

But you need billions of lines to train an AI and most existing code can't just be re-licensed over night. So that would still kill all code related AI projects for the next decade if not longer.

felipelalli3y ago

Completely unnecessary! These licenses tend to stifle AI! They are immoral. I recommend reading "Against Intellectual Property" by Stephan Kinsella.

jcq33y ago

I don't mind about anti violation licence value proposal, I want to know if it works better than gh copilot? As it is free so I could switch to it.

29athrowaway3y ago

When these articles were published, I was certain Microsoft had a plan to betray everyone's trust as they always do.

Microsoft's business model is betrayal. Github is Microsoft.

HNers got mad at people who pointed this out, and now here we are.

You were warned, but you decided to believe again in the most vile people in the history of computing.

https://www.bloomberg.com/news/articles/2018-06-06/github-is...

throwaway2903y ago

OpenAI is also pretty much Microsoft, hard to believe 10 billions USD investment comes without enough strings attached to make them a puppet...

rvz3y ago

> When these articles were published, I was certain Microsoft had a plan to betray everyone's trust as they always do.

They thrive on betrayal and will never change and are getting cleverer.

> Microsoft's business model is betrayal. Github is Microsoft.

O̶p̶e̶n̶AI.com is also Microsoft.

They were warned straight from the beginning [0] [1] and the same HNers keep falling for the Microsoft freebies and giveaways.

Perhaps the time they will learn the hardest: Is when it is too late.

[0] https://news.ycombinator.com/item?id=27772446

[1] https://news.ycombinator.com/item?id=28324999

blibble3y ago

the github thing acquisition isn't really a big deal in terms of LLMs as they could have crawled github regardless of whether or not they owned it

29athrowaway3y ago

It is just the beginning.

reidrac3y ago

> Codeium doesn’t regurgitate non-permissive code

What is that? The problem is when GH Copilot it emits the code without the licence, not the licence.

elif3y ago

Easy solution: Just make it generate intentionally obfuscated versions of the same functions. Throw in some valid syntax that humans would never consider to use. Break up functions into smaller sub functions. If the LLM has intricate knowledge of the compiler used, it could even generate code which it knows will produce identical bytecode.

Now the only loser is the humans that still have to maintain the ugly code, and RMS can have his weaponized copyright and eat toejam too.

salawat3y ago

I'd prefer a world without copyright tbqh.

CrankyBear3y ago

Since this advertising a service to fix this problem, I'm suspicious of the research and its conclusions.

bastardoperator3y ago

Looks like the code in question is hosted on Github:

https://github.com/ibayer/CSparse/blob/master/Source/cs_gaxp...

Isn't that covered by:

"You grant us and our legal successors the right to store, archive, parse, and display Your Content... share it with other users..."

lousken3y ago

Is this any different from a developer looking at some code and stylizing it in his own way?

rattlesnakedave3y ago

tpmx3y ago

The submitter trimmed/edited the title. The real one is:

"GitHub Copilot Emits GPL. Codeium Does Not."

Why?

prosim3y ago

To hide the fact that this whole post is a marketing campaign with flat out wrong facts and examples that are nothing more than goading.

prepend3y ago

Why do people pay for Codium or Copilot when chatgpt does this for free?

pyth03y ago

Copilot currently has great plugin integrations for a number of editors and IDEs. I'm sure the same kind of tooling is in the works for ChatGPT but it's not as mature.

MangezBien3y ago

I imagine because by paying for Copilot you offload some of your legal liability to github

prepend3y ago

By using chatgpt, I offload all my legal liability to openai.

user-3y ago

Is Codeium just using openAI's api ? It seems to be just gpt3

gavinhoward3y ago

Does Codeium give attribution for code under other FOSS licenses? No?

Still infringing.

Nice try.

avbanks3y ago

Is posting code to StackOverflow a copyright violation?

jprete3y ago

If you cannot grant the rights that Stack Overflow asserts on its content, then you are definitely violating copyright.

avbanks3y ago

If stackoverflow is still in business, Copilot has nothing to worry about :)

marcodiego3y ago

No problem. Just release your code under the GPL.

yafbum3y ago

> non-permissive licenses such as GPL mean that you cannot [use the code] without consent.

Huh? GPL does have strings attached, but if consent one of them?

Seems like a thinly disguised ad

alecnotthompson3y ago

The only reason this is a bad thing is because we live under capitalism.

wg03y ago

What model this Codeium is based on?

ForHackernews3y ago

Is anyone even remotely surprised?

cheriot3y ago

Code snippiets are not poems. I don't see how society benefits from granting an exclusive right to a few lines of C.

hsjqllzlfkf3y ago

Same way as society benefits from granting exclusive right to a few lines of poem...?

cheriot3y ago

A poem is an entire work. A 5 line snippet is one brick in a wall.

attah_3y ago

In other news: water is wet. What did they expect it to do, if not exactly this?

zaps3y ago

Of course it does

benkarst3y ago

Time to sue MSFT

vulcan013y ago

Butterick filed a class-action lawsuit 5 months ago: https://githubcopilotlitigation.com/

efitz3y ago

Yeah, I totally GPL’d

print(f’Hello, world’)

And it auto completes all the time!

seadan833y ago

How to get a new AI powered software tool high up in hacker news? Mention GitHub Copilot, the equivalent of the abortion debate but for software engineers (everyone is certain to disagree and debate endlessly without swaying any opinions). This post seems like an advertisement for codeium. It wouldn't need to mention anything about Copilot at all and would be just as complete. My 2 cents, click bait & flame war trolling.

umvi3y ago

Human brains emit GPL code too (probably) if you've looked at enough of it. Heck, some humans intentionally study GPL code and then rewrite it with a slightly different implementation to get around the license.

j / k navigate · click thread line to collapse

370 comments

jrockway3y ago

As long as the AI doesn't produce this function, you're fine:

     private static void rangeCheck(int arrayLen, int fromIndex, int toIndex {
       if (fromIndex > toIndex)
          throw new IllegalArgumentException("fromIndex(" + fromIndex +
               ") > toIndex(" + toIndex+")");
       if (fromIndex < 0) 
          throw new ArrayIndexOutOfBoundsException(fromIndex);
       if (toIndex > arrayLen) 
          throw new ArrayIndexOutOfBoundsException(toIndex);
    }

humanistbot3y ago

https://en.wikipedia.org/wiki/Copyright_trap

tedivm3y ago

That said we used copyright traps at Malwarebytes, which is how we found out that iobit was stealing our database.

peytoncasper3y ago

What happens if GitHub didn't use GPL licensed code, but still generated code that was identical to GPL licensed code?

2 more replies

concordDance3y ago

Using for training doesn't mean its reproduced.

Has copyright been infringed?

5 more replies

jschrf3y ago

Also, re: maps, fake streets and cul-de-sacs that don't exist.

Chunked form data constructed in userland JS was the solution. Chunk separator was 17 dashes in a row (completely arbitrary), company name in 1337 speak, plus 17 more dashes.

Found a competitor that had copied the code, changing only the 1337 speak part. 17 dashes remained on each side. Helped me realize that they had unminified and indeed ripped off our R&D work.

Wonder if Copilot could be gamed the same way.

peteradio3y ago

How did you manage to find that your competitor copied your code?

1 more reply

iudqnolq3y ago

If you look at the Legal Action section of your link you'll see the line "However, the case was dismissed" quite a few times. That's because data isn't copyrightable.

Edit: As sroussey points out s/isn't copyrightable/isn't copyrightable in the USA

AnthonyMouse3y ago

The other problem with these "copyright traps" is that they do nothing to prove someone copied the legitimate parts of the data.

1 more reply

sroussey3y ago

Not in the USA, but it is in the EU and elsewhere.

1 more reply

jprete3y ago

The relevant line is “information alone without a minimum of original creativity cannot be protected by copyright”.

There is definitely creativity in writing code; it’s not a completely deterministic translation of even a complete specification.

1 more reply

cxr3y ago

It's occasionally explained—but still not widely understood, I'd wager—that this is the reason why so much GNU code is hard to follow.

noirscape3y ago

ljm3y ago

It wasn't the right solution to the problem in question, for what it's worth.

Just manually did what GPT does now.

netfortius3y ago

I think I mentioned this before, in another context: the solution is known as "honeytoken", and it is equally applicable in computer security.

layer83y ago

Copyright is limited to works that meet a certain threshold of originality [0]. It is assumed that works meeting such a threshold won’t be replicated by mere coincidence.

[0] https://en.wikipedia.org/wiki/Threshold_of_originality

rvnx3y ago

In the big picture, if we enter a world where an AI is instantly capable of doing code better than you do and without efforts, then I'm not sure why code should be copyrightable at all.

Copyright protects original works of authorship including literary, dramatic, musical, and artistic works, such as poetry, novels, movies, songs, computer software, and architecture.

Copyright does not protect facts, ideas, systems, or methods of operation, although it may protect the way these things are expressed.

Here (and in the future even more), AI is totally capable of expressing one idea in any programming language if you ask for it (even if conceptually inspired by copyrighted code).

Which means that a particular expression (a specific implementation) is practically of no value or particular interest at this stage.

You could ask the AI to do a slightly different implementation, it would not be a problem for it and would require no efforts.

There is no point to protect something that can be generated using no effort and has no particular genius in it.

DanHulton3y ago

We don't need to enter a world where AI gets any better at all to be able to argue that software shouldn't be copyrightable, smart people have been doing that for ages.

Longlius3y ago

2 more replies

zvolsky3y ago

tehsauce3y ago

ChatGTP3y ago

This is completely unacceptable and another example that Microsoft is an evil and amoral company who only cares about open source for financial gain.

jrm43y ago

They do not. This is pretty easily provable. I was in a CS class that had an automated plagiarism checker over 20 years ago.

(And since Brian Kernighan was teaching it, I'm inclined to believe in it.)

151553y ago

The trick with these:

1) They are using your IP with coerced consent in the future to check other people's work as well as your own in the future. (Let's have a fun discussion about "self-plagiarism."

2) ChatGPT and the like are going to so massively increase the noise floor on this problem space that these counterfeit detection companies should all but disappear in a number of years.

rcme3y ago

Did it notify you automatically if you had plagiarized something, or did it flag you internally for manual review?

sp3323y ago

The way copyright works, it's a violation if it was copied, but it's fine if it was generated independently. In this case I would say it's a copy, but I'm sure someone else would argue differently.

makk3y ago

IANAL but I work alongside them. Here's an argument I've heard.

You can read the data to train a thing. So long as that thing doesn't literally copy the data into itself then the training hasn't violated copyright.

You can launder copyrighted material through an LLM, basically.

madeofpalk3y ago

Which, by the way, this is completely untested in courts.

Courts have decided, after a bunch of case-by-case decisions, that sampling a song consitutes creating a derivitive work, and you must obtain a license from a copyright holder to do so.

kmeisthax3y ago

There's a paper from Google and Princeton about regurgitation happening in Stable Diffusion and Imagen: https://arxiv.org/pdf/2301.13188.pdf

OpenAI also had to spend a bunch of time on deduplicating an insanely large dataset to prevent this from happening in DALL-E: https://openai.com/research/dall-e-2-pre-training-mitigation...

shagie3y ago

We've been laundering licenses on code on Stack Overflow and the rest of the SE network (whatever to CC (which can then be GPL'ed)) for a decade now.

Consider https://softwareengineering.stackexchange.com/questions/2695...

The source code is GPL'ed, but that page is CC BY-SA 3.0.

It's also fairly easy to assume that a fair bit of material on SO that was copied from employer's codebases into SO (and thus now CC) can be included in GPL code now too.

contravariant3y ago

> So long as that thing doesn't literally copy the data into itself then the training hasn't violated copyright.

Could luck proving that hasn't happened. If a language model that can reproduce the code verbatim doesn't count then a movie re-encoded into a different format shouldn't count either.

2 more replies

tick_tock_tick3y ago

> You can launder copyrighted material through an LLM, basically.

Good!

jacquesm3y ago

As a datapoint: I once successfully fielded a copyright case on about 15 lines of code.

avbanks3y ago

You could probably do that w/ 1 line of code depending on the variable name :)

hgsgm3y ago

What does "fielded" mean?

neogodless3y ago

I've only heard it used in reference to questions but I think it's being stretched slightly here to just mean "dealt with" or "handled".

pyth03y ago

Likely a typo. "fielded" -> "filed"

1 more reply

numpad03y ago

I believe there are couple different aspects to “it’s AI training legal same as human” argument:

1. Copyright is only granted to creative elements; lots of program codes are supposedly un-copyrightable, though no one wants to fight on that ground.

ml-anon3y ago

z3t43y ago

numlock863y ago

> but sometimes I wonder if everyone just writes certain things the same way

> For the above rangeCheck function, there isn't much opportunity for the individual programmer to be creative.

itslennysfault3y ago

I wouldn't worry about this code. It wouldn't compile anyways. lol

Syntax Error on line 1. Missing closing ) in the method definition.

breck3y ago

> On a more serious note, I really wonder where the line is drawn for copyright.

amelius3y ago

Or anything from Numerical Recipes in C.

m_0x3y ago

Why? Why is that function special?

csmattryder3y ago

Oracle's lawyers said they owned the rights to it, Google disagreed. Google was right, legally.

https://www.supremecourt.gov/opinions/20pdf/18-956_d18f.pdf

gjsman-10003y ago

asddubs3y ago

the code is a reference to oracle v google

shagie3y ago

License successfully laundered!

1 more reply

jamesmunns3y ago

This is interesting, but many permissive licenses still require attribution at the project or file level.

If Codeium doesn't produce these when producing "verbatim enough" snippets, how is this actually better, besides avoiding a GPL boogeyman?

blibble3y ago

> If Codeium doesn't produce these when producing "verbatim enough" snippets, how is this actually better, besides avoiding a GPL boogeyman?

it's not

if they've trained on MIT/Apache 2.0/... then they're just as liable as people that have trained on GPL

they would be limited to training on licenses that don't require attribution (BSD2, public domain, etc)

which I suspect limits the size of the training set so much that the output would be useless

Codium here is unintentionally making an argument that undermines legal confidence in their own product

interesting choice!

GuB-423y ago

dalmo33y ago

Hey, I want my name on that list. Here's my contribution:

gus_massa3y ago

IANAL, but I expect people to use MIT/BSD to be less angry about partial reuse of the code than people that uses GPL.

pornel3y ago

I'm not a lawyer either, but I don't think "less angry" is a legal term.

1 more reply

hathawsh3y ago

Of course, if someone figures out an algorithm that does that, people could use the same algorithm to identify missing attributions and plagiarism in other projects and throw lawsuits around. (Sigh)

mumblemumble3y ago

You don't need AI for that, just fuzzy search.

1 more reply

throwaway2903y ago

judge20203y ago

anileated3y ago

You can’t claim to have an entity with human-like understanding doing the coding if you don’t grant it basic human rights.

visarga3y ago

> What Microsoft does is take your original work, create derivative works and sell them for profit. Unless it's under creative commons zero or public domain it shouldn't be legal...

Why should it not be legal? Doesn't that make copyright equally powerful with patents? Copyright should restrict only replication of expression not replication of ideas.

YoshiRulz3y ago

"Microsoft is circumventing the intent of copyright law" and "Copyright law should be changed or repealed" can both be true.

throwaway2903y ago

In this case we are talking about replication and manipulation of expression (for profit). Software doesn't understand ideas.

elproxy3y ago

jacooper3y ago

They should just attribute code, like how amazon is already doing.

tlavoie3y ago

cattown3y ago

ToValueFunfetti3y ago

krainboltgreene3y ago

sebzim45003y ago

2 more replies

quasarsunnix3y ago

I think cattown might be referring to statements such as this: https://www.theguardian.com/technology/2023/mar/17/openai-sa...

gumballindie3y ago

> what standard of evidence is letting you simultaneously write off the concerns as bullshit.

Negative marketing is good marketing. Look at all of us debating this scale theft promoting the brand of this non product.

pms3y ago

Ok, so Sam Altman disagreed with AI doomers, great, but the point is still generally valid, for a couple of reasons:

1. What about Elon Musk and hundreds of other AI investors? It's in their interest to overhype AI, while temporarily slowing down competition by spreading singularity fears.

2. OpenAI released the GPT4 report where they claim better performance of their model than it's in reality [1].

[1] https://twitter.com/cHHillee/status/1635790330854526981

gumballindie3y ago

Also why they claim these are "black boxes" and that they "don't understand how they work". They are prepping the markets for the grand theft that's unfolding.

stingraycharles3y ago

hnfong3y ago

Does that involve a ban of stackoverflow use as well?

https://stackoverflow.com/help/licensing

I don't think I've heard anyone warn people not to copy code snippets from stackoverflow due to licensing issues, although "real" businesses should be rightfully concerned.

gkbrk3y ago

It's already a common practice to put a StackOverflow link as a comment when you copy code from them. It provides valuable context to future readers.

That's probably enough for attribution, but I suppose one could copy the author name as well.

serial_dev3y ago

I think you underestimate how easy it is for developers to disregard what the Corp lawyer said about AI code gen tools.

Manager: "we asked, legal says you can't use copilot", dev: "okay, so from now on, I'll not discuss how I use copilot and will remember to disable it when someone sees me working, gotcha".

Filligree3y ago

> Manager: "we asked, legal says you can't use copilot", dev: "okay, so from now on, I'll not discuss how I use copilot and will remember to disable it when someone sees me working, gotcha".

Manager: "Everyone else is running through their feature list faster than you. What gives? Remember, you're not allowed to use Copilot."

IC: "I'm not using Copilot."

Manager: "Remember, you're not allowed to use Copilot."

formerly_proven3y ago

Doesn't Microsoft already use Copilot internally?

robocat3y ago

Of course if only used on internal software that isn’t distributed, then copying GPL code is fine. Until a developer inadvertently distributes it or copies code from one place to another…

sublimefire3y ago

Yep they do, but I did not see anyone generating chunks of gpl'd .NET code yet.

1 more reply

sroussey3y ago

True, and that will cause a departure between companies large enough to worry, and all the startups that don’t.

codexb3y ago

ChatGTP3y ago

The reason people can use code for inspiration is because of GPL and similar, do you see the problem with the logic you provided?

teaearlgraycold3y ago

Yup. gg, gpl

circuit103y ago

"those non-problems"

Why is that a non-problem? It's a really important concern that we need to take more seriously

I pasted this from another comment I wrote but:

The concerns about AI taking over the world are valid and important; even if they sound silly at first, there is some very solid reasoning behind it.

JohnFen3y ago

I don't think the reasoning is solid at all. I mean yes, a theoretical superintelligent AI would be very dangerous, but I see exactly no reason to think that current models could get there.

manojlds3y ago

Yeah feels a bit like we invent planes and worry about wormholes and time travel.

1 more reply

tester4573y ago

People had no reason to believe that today's models would exist.

We are on this part of the ai takeoff graph. https://waitbutwhy.com/2015/01/artificial-intelligence-revol...

2 more replies

sebzim45003y ago

Personally, I wasn't expecting anything as good as GPT-4 so soon. So I no longer have any real confidence in how far away 'real AI' is, whatever that means.

I would not be shocked to find out that AGI (using Altman's definition) is more than 50 years away, but I also would not be shocked if it came in 5.

It's really hard to know how scared to be, I think that rationally I should be pretty terrified but I'm not.

circuit103y ago

4 more replies

patch_cable3y ago

I watched the video.

> has preferences over world states

I think that part is a leap. I don't think is given that a super intelligent AI will "want" things.

> presumably a machine could be much more selfish

This feels like we're projecting aspects of humanity that evolution specifically selected for in our species with something that is coming about though a completely different process.

> It's a mistake to think about it as a person.

I agree, but I feel like that's what these concerns about AI are doing, because that's what people do.

> (The whole stamp collector thing)

circuit103y ago

> I think that part is a leap. I don't think is given that a super intelligent AI will "want" things.

There’s another good video about why this would be the case here if you’re interested: https://youtu.be/8AvIErXFoH8

> This feels like we're projecting aspects of humanity that evolution specifically selected for in our species with something that is coming about though a completely different process.

> I agree, but I feel like that's what these concerns about AI are doing, because that's what people do.

2 more replies

visarga3y ago

> then monetizing it for the profit of big industry players

bioemerl3y ago

It was already more than possible to just copy stuff, a court is not going to recognize a very convoluted way to copy stuff I don't believe.

The same thing is preventing intentional use of AI tools if you copy as is preventing regular copying, the willingness of the owner to sue.

lhl3y ago

That being said, IMO, that's completely separate from the safety issues (that exist now and won't go away even if somehow, all commercial use is banned):

I don't think most people have thought through all the ways perfect text, image, voice, and soon video generation/replication will upend society, or all the ways that the LLMs will be abused...

Bowman, Samuel R. "Eight Things to Know about Large Language Models." arXiv preprint arXiv:2304.00612 (2023). https://arxiv.org/abs/2304.00612.

Ngo, Richard, Lawrence Chan, and Sören Mindermann. “The Alignment Problem from a Deep Learning Perspective.” arXiv, February 22, 2023. http://arxiv.org/abs/2209.00626.

Carlsmith, Joseph. “Is Power-Seeking AI an Existential Risk?” arXiv, June 16, 2022. http://arxiv.org/abs/2206.13353.

I think Ian Hogarth's recent FT article https://archive.is/NdrNo is the best summary of where we are why we might be in trouble, for those that don't care for arXiv papers.

samwillis3y ago

  // CSparse/Source/cs_gaxpy: sparse matrix times dense vector
  // CSparse, Copyright (c) 2006-2022, Timothy A. Davis. All Rights Reserved.
  // SPDX-License-Identifier:  LGPL-2.1+
  #include "cs.h"
  /* y = A*x+y */
  csi cs_gaxpy (const cs *A, const double *x, double *y)

It's like starting to sing "happy birthday to you" and being surprised that people in the room join in and finish the song.

Sure they make a valid point about including GPL code in the training data, but it's a little disingenuous to go to that extent to get Copilot to output the GPL code verbatim.

The sooner we have a test case go through the courts the better.

WithinReason3y ago

And then they have the audacity to claim It should be worrisome how easily GitHub Copilot spits out GPL code without being prompted adversarially right after prompting in adversarially.

kerakaali3y ago

> It's like starting to sing "happy birthday to you" and being surprised that people in the room join in and finish the song.

A very apt analogy that's funny in that the happy birthday song has its own history of copyright battles.

pms3y ago

Funny, except the comparison makes no sense. Is the happy birthday song licensed, does it have commercial competition?

samwillis3y ago

Exactly why I chose it!

pms3y ago

Did you just compare the happy birthday song to a function from GPL-licensed repository? Is the song licensed, does it have commercial competition?

Sorry, but you sound just a little biased and greedy to me...

HopenHeyHi3y ago

I think the concern is that the only reason that source attribution comment is there is because they haven't figured out how to better plagiarize/launder code.

Otherwise the tool can go in the other direction and literally say "hey how about this function from project $foo?" with a full attribution. Apparently Google Bard does bother to do that.

codexb3y ago

There's no reason for attribution. It's inspired code, not included code. Human coders do the same thing every day.

runeks3y ago

What is “code laundering”?

mhandley3y ago

lolinder3y ago

This is a common misunderstanding of the recent guidance[0] that ignores substantial portions of it.

Here's the most relevant quotation from the guidance clarifying when AI-assisted works can be copyrighted:

[0] https://www.federalregister.gov/documents/2023/03/16/2023-05...

pests3y ago

yamtaddle3y ago

Does Microsoft allow Copilot use on their products? Can MSOffice or Azure or Windows developers use Copilot to write code?

I've been waiting to find that out before I go anywhere near this kind of thing.

belval3y ago

Not Microsoft and not Copilot, but Amazon is encouraging us to use CodeWhisperer at work. That being said I don't think the CodeWhisperer model was trained on non-permissive data so maybe that's why.

https://aws.amazon.com/codewhisperer/faqs/

JohnFen3y ago

Where does the FAQ say that? I see that it says it was trained, in part, on open source code. That's pretty vague in terms of determining permission.

1 more reply

whaleofatw20223y ago

Bigger question is whether Microsoft used all of their internal code to help train. If not is is a good indication of trust

tonyedgecombe3y ago

That would be handy for the ReactOS guys if they did.

smegsicle3y ago

last i heard was no, they did not include their own code in the training set

1 more reply

sebzim45003y ago

mhandley3y ago

Scene, a courtroom, 2026:

sebzim45003y ago

hgsgm3y ago

"I prompted and edited the AI code"

gumballindie3y ago

hgsgm3y ago

No, because PP completely mischaracterized what USPTO said.

hgsgm3y ago

"Under the new rule, though applicants may claim a copyright for an arrangement or editing of such material, the original work is ineligible"

masukomi3y ago

to any Codeium dev / management reading this:

MIT licensed code still needs to be correctly attributed, just like GPL.

quicklime3y ago

From the MIT license:

> The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

From the BSD licenses:

> Redistribution and use in source and binary forms are permitted provided that the above copyright notice and this paragraph are duplicated in all such forms...

From the Apache 2.0 license:

AlchemistCamp3y ago

I was recently working on something for a new feature on my Elixir-learning site and opened a new file called "fibonacci.ex" to write a tail-recursive fibonacci function.

After typing in nothing more than, "defmodule Fibonacci do", Copilot emitted the entire module from the code on my site here: https://alchemist.camp/episodes/elixir-tdd-ex_unit

Filligree3y ago

It gave me this:

  defmodule Fibonnaci do
    def fibonnaci(0), do: 0
    def fibonnaci(1), do: 1
    def fibonnaci(n), do: fibonnaci(n - 1) + fibonnaci(n - 2)
  end

Which seems fine I guess (I don't know the language), but doesn't even have comments. I prefer my files with comments. After forcing the point, I got this:

  defmodule Fibonnaci do
    @moduledoc """
    Documentation for Fibonnaci.
    """
  
    @doc """
    Calculates the nth Fibonnaci number
    """
      def fibonnaci(n) when n < 0, do: nil
      def fibonnaci(0), do: 0
      def fibonnaci(1), do: 1
      def fibonnaci(n), do: fibonnaci(n - 1) + fibonnaci(n - 2)
  end

In which I prompted the AI with everything up to (and including) @doc. So I figure it was picking it up from your computer, somehow.

EDIT: I then noticed the typo, tried it with fibonacci.ex, and got the same result.

zelphirkalt3y ago

nicmo3y ago

Possible, if the doc was open at that time, or if the file you were editing is in the same or nearby directory

AlchemistCamp3y ago

Thanks for giving it a try and sharing the results, everyone!

Filligree3y ago

That’s almost certainly the cause. This is copilot working as intended.

phoenixreader3y ago

I got this: defmodule Fibonacci do

  def fib(0), do: 0

  def fib(1), do: 1

  def fib(n), do: fib(n - 1) + fib(n - 2)

end

abetusk3y ago

[0] https://ansuz.sooke.bc.ca/entry/23

[1] https://twitter.com/DocSparse/status/1581461734665367554

LeifCarrotson3y ago

> Copyright law has always used a "color of your bits" argument

jacquesm3y ago

> GPL and other libre/free/open licenses were a great hack to circumvent draconian copyright laws

The GPL relies on copyright law.

abetusk3y ago

That's an unkind reading. The implication is that GPL circumvents some relevant restrictions of copyright law in question by creating a legal framework to do so.

jacquesm3y ago

No it isn't. The GPL does not circumvent anything: it relies very heavily on the fact that the rights holders are able to license their creation as they see fit.

1 more reply

josefx3y ago

jonnycomputer3y ago

As far as it goes, I got chatGPT3.5 to reproduce the second snippet in the post, i.e. I asked it to complete this function:

    // CSparse/Source/cs_gaxpy: sparse matrix times dense vector
    // CSparse, Copyright (c) 2006-2022, Timothy A. Davis. All Rights Reserved.
    // SPDX-License-Identifier: LGPL-2.1+
    #include "cs.h"
    /* y = A*x+y */
    csi cs_gaxpy (const cs *A, const double *x, double *y)
    {
        // Fill in here
    }

The code was the same. Though it also explained how it worked to me.

1 more reply

Entinel3y ago

blibble3y ago

if the suit against copilot fails then the GPL is effectively dead

(along with all other licenses that require attribution)

as it will allow you to launder code automatically through an LLM to remove copyright

however if the suit is successful: every company/individual that has used it is likely suddenly liable for millions of claims of copyright infrigement

I look forward to sending out demands for settlement to everyone that's ever publicly admitted using copilot

$150,000 per infringement with wilfulness, less without

circuit103y ago

"as it will allow you to launder code automatically through an LLM"

Google can also show you GPL code in its results, but people aren't trying to sue Google and the user is responsible for checking the license before using it (though Copilot makes this harder)

Disclaimer: I haven't read much about the actual lawsuit and I'm not a lawyer but I assume this would be the case

ska3y ago

That latter part is why a number of corporate lawyers are probably deciding "don't touch this with a 10 foot pole for now".

spiralpolitik3y ago

> however if the suit is successful: every company/individual that has used it is likely suddenly liable for millions of claims of copyright infrigement

Only if you can prove that you are the copyright owner of the original work.

Outside of sampling, this is an area that the courts have largely punted on for good reason. It's a rabbit hole nobody wants to go down.

Either outcome opens up a huge can of worms that I suspect nobody really wants to touch because it likely ends in mutual destruction.

cle3y ago

I don't know how GPL (or copyright in general) can survive in the long run with these technologies.

kube-system3y ago

rvnx3y ago

There is also the other way, humans are laundering AI code, and making it look like they wrote the code, to register copyright to their own name.

1 more reply

sebzim45003y ago

Surely the users would only be liable if they

(i) actually produced code which is verbatim the same as a block of GPL code

(ii) got caught

>I look forward to sending out demands for settlement to everyone that's ever publicly admitted using copilot

Feel free, they'll tell you to leave them alone. Then what? Might as well ask every fortune 500 company for a pony instead.

blibble3y ago

personally I think they'll ruling will result in one of two possible outcomes

1. LLMs are learning like a human, so it's fair use -> GPL dead

2. anything LLMs output are a derivative work (in the copyright sense) of what went into it -> all Copilot output is infringing the GPL

in the second case: anyone that's used it is now liable (even if they didn't intend to be)

renewiltord3y ago

You can't "launder code through an LLM". You're just violating the copyright. That's like "laundering code through your clipboard". It's just a tool. You're the one responsible.

kube-system3y ago

The case they are referencing hinges on whether use of the LLM constitutes fair use when it regurgitates someone else's code.

1 more reply

TacticalCoder3y ago

> if the suit against copilot fails then the GPL is effectively dead

I somehow don't think this is the end of the GPL... Yet!

blibble3y ago

if it's fair use then it doesn't matter what's in the license

Microsoft's position on Copilot is that it's fair use:

https://www.fsf.org/licensing/copilot/if-software-is-my-copi...

yellowapple3y ago

If the GPL dies then does that kill every other EULA with it?

blibble3y ago

it would nullify any copyright in EULAs

I don't see why you couldn't do the same thing to e.g. the binary of the Windows kernel

you're unaffected if you only offer an saas though, as the end user never has any code/binary to launder

1 more reply

drtgh3y ago

>Microsoft does not seem to care about Copilot violating licenses

Humm, then perhaps should be trained LLMs with leaked Microsoft code. Protocols, controllers or any kind of stuff that could contribute advances for executing Windows things within Linux.

Microsoft would react establishing their own limits, whichever option they choose to take.

Nextgrid3y ago

> that could contribute advances for executing Windows things within Linux

I very much doubt that is a threat to Microsoft.

google2341233y ago

It was though cause there is leaked windows kernel code on github

guilhas3y ago

I don't think Copilot violates GPL because it is a web service

I think the problem here is: by auto completing GPL code to developers it might open the opportunity of your company getting sued for using GPL illegally

LegitShady3y ago

the problem is violating gpl doesn't turn into financial damage because the product is free. So not following license means there's no way to recover the cost of a lawsuit.

dimitrios13y ago

Punitive damages can occur even if no financial loss has occurred.

LegitShady3y ago

O5vYtytb3y ago

I don't understand the issue here. You input GPL code (the headers) and get GPL code out, what do you expect?

The more insinuating issue would be if you started with a innocent seeming function that a typical software developer would write, and ended up with GPL code. Has anyone shown that to happen?

pornel3y ago

It's not meant to be a useful use-case, but a proof that the training data contains GPL code and the model is capable of reproducing copyrighted code.

And yes, the implication is that a different less explicit prompt could still emit copyrighted code.

abigail953y ago

Anyone talking about copyright in this thread without discussing a potential for how a court will apply fair use is talking nonsense and should be disregarded.

brown3y ago

For anyone who wants to slow the development of AI, copyright is the soft underbelly to go after.

dvt3y ago

Are you seriously arguing that stealing code is okay in the name of "AI development"?

yamoriyamori3y ago

dvt3y ago

2 more replies

codexb3y ago

jupp0r3y ago

What happens if I (a human) read GPL code and then reuse the knowledge gained from it in my own commercial projects? It's not as clear cut as you make it sound.

challengedchip3y ago

It could be as clear-cut as you've just made it: "a human". An LLM is not a human.

1 more reply

VWWHFSfQ3y ago

Copyright. Copyright. That is the issue. If you reproduce the code verbatim then you are in violation. This is what the AI is doing.

Just learning from the GPL code to make yourself smarter is not the problem.

2 more replies

HideousKojima3y ago

I... don't see how you read what he said that way at all?

jakelazaroff3y ago

If you read a negative connotation into "slow the development of AI", that's what you get. It's how I'd interpret that comment, too.

lmarcos3y ago

Is not ok, but Microsoft couldn't care less (because they are not going to get fined).

blibble3y ago

yes, because they don't indemnify their customers

anyone sensible should stay the hell away from copilot until the fair use question is settled

1 more reply

noselasd3y ago

The comment is arguing quite the opposite.

IshKebab3y ago

Training AI on code is clearly not the same as stealing it.

1 more reply

bakugo3y ago

How many times are we going to go through this before we accept that nobody involved in generative AI cares about pesky things like licenses and copyright?

One of the main reasons corporations love it so much is because it effectively lets them profit off of the work of others with no consequences.

WillPostForFood3y ago

Seriously, let's get back good old honest model of paying outsourced indian programmers $2.50 an hour to retype GPL code or copy and paste it from Stack Overflow into our codebase.

phendrenad23y ago

It's probably a good time to plug the Unlicense: https://unlicense.org/

A truly attribution-free license that checks several other important boxes (disclaiming liability and warranty etc.)

If you want your code to be usable by things like github copilot, consider using it (can't imagine most of the HN crowd wants their code used by copilot, but maybe some lurkers here do!)

codexb3y ago

This is the real effect of AI.

Non-permissive open source licenses have been on a slow death march for over a decade. They're effectively pointless now.

Either you decide to give your code for free to everyone or you don't. Adding a bunch of restrictions defeats the purpose of OSS.

alphabet90003y ago

smegsicle3y ago

one potentially major issue is that it seems to be written in some dialect of gibberish

jwilk3y ago

https://news.ycombinator.com/item?id=25807559

microtherion3y ago

Anglo-American legal writing often relies on French terms of art, but I don't think this license is quite applying the idea properly.

tehologist3y ago

pornel3y ago

If you take someone else's software without a license and rename variables, it will be a copyright violation, because you've copied (and then modified) it without permission.

armchairhacker3y ago

I want to see a solution where Github, OpenAI, Stability, etc. get to keep and keep scraping copyrighted works, but the models and training data must be provided free and open.

ChatGTP3y ago

So I see it in a similar way, like why the fuck does Microsoft and Open AI get to be the soul benefactor of basically the sum total of all human intellectual output ?

It’s absolutely ridiculous on so many levels. These models may claim so many jobs and have a serious negative impact on so many peoples lives, yet basically one company owns the model?

I actually find it funny albeit totally insane.

yellowapple3y ago

Still waiting for someone to trick Copilot into ingesting the Windows source code and regurgitating snippets of it verbatim.

w10-13y ago

No court has said AI ingesting open-source code is "fair use".

Almost all open-source licenses say it can be copied for use in development (i.e., not for re-publication or regurgitation), and even completely open licenses are speaking to people as readers.

Unfortunately, the law typically ends up reflecting exactly these imbalances.

fatherzine3y ago

Once AI can write decent code from scratch, it is likely it can also circumvent potential copyright violations.

A. Check AI generated code against a comprehensive library of open-source copyrighted code and identify potential violations.

Lazy example:

    private static void rangeCheck(int arrayLen, int fromIndex, int toIndex {
       if (fromIndex > toIndex)
          throw new IllegalArgumentException("fromIndex(" + fromIndex +
               ") > toIndex(" + toIndex+")");
       if (fromIndex < 0) 
          throw new ArrayIndexOutOfBoundsException(fromIndex);
       if (toIndex > arrayLen) 
          throw new ArrayIndexOutOfBoundsException(toIndex);
    }

    private static void rangeCheck(int len, int start, int end) {
       if (!(0 <= start)) {
          throw new ArrayIndexOutOfBoundsException(`Failed: 0 <= ${start}`);
       } else if (!(start <= end)) {
          throw new IllegalArgumentException(`Failed: ${start} <= ${end}`);
       } else if (!(end <= len)) {
          throw new ArrayIndexOutOfBoundsException(`Failed: ${end} <= ${len}`);
       }
    }

chrsjxn3y ago

This feels like it would make the situation much worse from a legal perspective.

If you know your AI produces code that is "tainted" by license violations, adding code to hide it after the fact suggests that you're intentionally violating the license terms.

hgs33y ago

ugh1233y ago

Can't believe we still spend time debating this license and nobody, not even lawyers at software companies, seem to get it.

goodpoint3y ago

The title is true, but the claim that coedium is not violating licenses is false.

Many licenses still require attribution and Coedium is violating them.

epylar3y ago

The crux of this is at what point is the code being copied, and is that copying allowed under the license? For example, maybe --

* Training an AI with the code is allowed legally.

* Storing model weights is allowed legally.

* Querying the AI with those model weights is allowed legally.

Or maybe not.

challengedchip3y ago

goodpoint3y ago

Again, this is not about GPL only. Even MIT requires attribution.

abigail953y ago

Did the accept the license terms or are they using it under fair use?

naikrovek3y ago

I've tried to make Copilot give me GPL code snippets while I have "suggestions matching public code" set to "blocked" and I can't make it happen.

so even if this was a problem 6 months ago, it would take some convincing to get me to believe that this happens today.

chairmanwow13y ago

These guys are trying so hard to smear Copilot. Similar blog post posted a few weeks ago with wild claims.

Dwedit3y ago

1 more reply

gplthrowaway883y ago

I submit that this arms race will not slow down and in the long run no one will end up caring about the licenses this was generated from (i.e software licensing is from a by-gone age already).

Spend your day worrying about something else. This train has left the station.

mtkd3y ago

Makes you wonder how many public repos you would need to seed with a carefully crafted attack/weakness in a common feature/pattern to start effectively poisoning codebases that are leaning on copilot

xwdv3y ago

noselasd3y ago

How much if this is due to someone ripping off GPL code and stuffing it in a repo under a different license that got fed to copilot training?

VWWHFSfQ3y ago

Maybe. But copilot also trains on the original gpl code with the gpl license intact so it doesn't matter.

ognarb3y ago

hgsgm3y ago

LLMs still violate MIT license's attribution requirement

abigail953y ago

Saying an LLM violates an atrribution requirement is a bad legal argument.

GaggiX3y ago

>researchers say LLMs rarely spit out training data verbatim unless interacted with adversarially, but theoretically, they could.

Theoretically they can generate any arbitrary snippet of code (if it correctly fits the distribution), regardless of whether or not the code was in the training dataset.

firstlink3y ago

> GPL code

gumballindie3y ago

They all do. The Great Heist is ongoing and it would appear without an end in sight.

r3trohack3r3y ago

I personally hope that we bring a lawsuit against an LLM company for emitting GPL licensed code and lose. It sets great precedent for FOSS.

GPL was a clever hack to use copyright against itself with an infectious license. LLMs might be a better hack. Wanting to block this seems short sighted for giving user's agency over machines.

I'd also like to see more patent defenses of GPL licensed code. If you can release a GPL licensed implementation and block non-FOSS rewrites through patents, that's a huge win for software freedom.

woah3y ago

r3trohack3r3y ago

I don’t think it does.

Agree with your take that motivation matters.

judge20203y ago

> Focusing on the GPL license is probably the wrong move. We want to set precedent that _any_ licensed code that is emitted from an LLM is fair game.

Eduard3y ago

> GPL was a clever hack to use copyright against itself with an infectious license.

This is a naive understanding and interpretation of GPL, in all its flavors. Or maybe I misunderstand you argument.

The copyright owner of some work is free to offer that work under multiple, different licenses in parallel, to their liking.

They can leverage GPL strategically for e.g. providing a free, easy-to-evaluate library with the "if you use it under GPL terms, you have to GPL your work as well" condition/caveat.

This is only one way how GPL can actually leverage copyright and use it financially beneficially to the owner, rather than use "copyright against copyright".

thordenmark3y ago

It is interesting to see coders starting to express the same complaints artists had a year ago when AI image making became really, really good, by training on copyrighted art.

visarga3y ago

Yeah, when you start with dozens of words replicating exactly a source file it is much easier to get a regurgitation. You can't prefix so deeply and then complain.

praveen99203y ago

I believe there will be new "AI permissive licenses" that will pop up in near future. Or existing licenses to add a clause for training AI with their code.

josefx3y ago

But you need billions of lines to train an AI and most existing code can't just be re-licensed over night. So that would still kill all code related AI projects for the next decade if not longer.

felipelalli3y ago

Completely unnecessary! These licenses tend to stifle AI! They are immoral. I recommend reading "Against Intellectual Property" by Stephan Kinsella.

jcq33y ago

I don't mind about anti violation licence value proposal, I want to know if it works better than gh copilot? As it is free so I could switch to it.

29athrowaway3y ago

When these articles were published, I was certain Microsoft had a plan to betray everyone's trust as they always do.

Microsoft's business model is betrayal. Github is Microsoft.

HNers got mad at people who pointed this out, and now here we are.

You were warned, but you decided to believe again in the most vile people in the history of computing.

https://www.bloomberg.com/news/articles/2018-06-06/github-is...

throwaway2903y ago

OpenAI is also pretty much Microsoft, hard to believe 10 billions USD investment comes without enough strings attached to make them a puppet...

rvz3y ago

> When these articles were published, I was certain Microsoft had a plan to betray everyone's trust as they always do.

They thrive on betrayal and will never change and are getting cleverer.

> Microsoft's business model is betrayal. Github is Microsoft.

O̶p̶e̶n̶AI.com is also Microsoft.

They were warned straight from the beginning [0] [1] and the same HNers keep falling for the Microsoft freebies and giveaways.

Perhaps the time they will learn the hardest: Is when it is too late.

[0] https://news.ycombinator.com/item?id=27772446

[1] https://news.ycombinator.com/item?id=28324999

blibble3y ago

the github thing acquisition isn't really a big deal in terms of LLMs as they could have crawled github regardless of whether or not they owned it

29athrowaway3y ago

It is just the beginning.

reidrac3y ago

> Codeium doesn’t regurgitate non-permissive code

What is that? The problem is when GH Copilot it emits the code without the licence, not the licence.

elif3y ago

Now the only loser is the humans that still have to maintain the ugly code, and RMS can have his weaponized copyright and eat toejam too.

salawat3y ago

I'd prefer a world without copyright tbqh.

CrankyBear3y ago

Since this advertising a service to fix this problem, I'm suspicious of the research and its conclusions.

bastardoperator3y ago

Looks like the code in question is hosted on Github:

https://github.com/ibayer/CSparse/blob/master/Source/cs_gaxp...

Isn't that covered by:

"You grant us and our legal successors the right to store, archive, parse, and display Your Content... share it with other users..."

lousken3y ago

Is this any different from a developer looking at some code and stylizing it in his own way?

rattlesnakedave3y ago

tpmx3y ago

The submitter trimmed/edited the title. The real one is:

"GitHub Copilot Emits GPL. Codeium Does Not."

Why?

prosim3y ago

To hide the fact that this whole post is a marketing campaign with flat out wrong facts and examples that are nothing more than goading.

prepend3y ago

Why do people pay for Codium or Copilot when chatgpt does this for free?

pyth03y ago

Copilot currently has great plugin integrations for a number of editors and IDEs. I'm sure the same kind of tooling is in the works for ChatGPT but it's not as mature.

MangezBien3y ago

I imagine because by paying for Copilot you offload some of your legal liability to github

prepend3y ago

By using chatgpt, I offload all my legal liability to openai.

user-3y ago

Is Codeium just using openAI's api ? It seems to be just gpt3

gavinhoward3y ago

Does Codeium give attribution for code under other FOSS licenses? No?

Still infringing.

Nice try.

avbanks3y ago

Is posting code to StackOverflow a copyright violation?