private static void rangeCheck(int arrayLen, int fromIndex, int toIndex {
if (fromIndex > toIndex)
throw new IllegalArgumentException("fromIndex(" + fromIndex +
") > toIndex(" + toIndex+")");
if (fromIndex < 0)
throw new ArrayIndexOutOfBoundsException(fromIndex);
if (toIndex > arrayLen)
throw new ArrayIndexOutOfBoundsException(toIndex);
}
On a more serious note, I really wonder where the line is drawn for copyright. I see a lot of people claiming that AI is producing code they've written verbatim, but sometimes I wonder if everyone just writes certain things the same way. For the above rangeCheck function, there isn't much opportunity for the individual programmer to be creative. Perhaps there is a matter of taste on what exceptions you throw, or in what order. But the chosen ones are certainly what most people think of first, and the order to validate arguments, then check the low value, then check the high value, is pretty much what anyone would do. Perhaps you could format the error message differently. That's about it. So when someone "rips off" your code wholesale, it's could just be that everyone writing that function would have typed in the exact same bytes as you. You know your style guide is working when you look at code, think you wrote it, but actually you didn't!That said we used copyright traps at Malwarebytes, which is how we found out that iobit was stealing our database.
I've set a "trap" myself years ago in code in a novel solution at the time for uploading photos from iOS non-interactively after the fact. It was to support disconnected field workers taking photos from iPhones/iPads, with the payloads uploaded at a later date.
Chunked form data constructed in userland JS was the solution. Chunk separator was 17 dashes in a row (completely arbitrary), company name in 1337 speak, plus 17 more dashes.
Found a competitor that had copied the code, changing only the 1337 speak part. 17 dashes remained on each side. Helped me realize that they had unminified and indeed ripped off our R&D work.
Wonder if Copilot could be gamed the same way.
Edit: As sroussey points out s/isn't copyrightable/isn't copyrightable in the USA
In the US legal system the merger doctrine is a concept whereby a given expression cannot be granted protection if it's not sufficiently creative—and there only so many ways to express something when stripped down to its fundamentals. In response to this, RMS and Moglen encouraged contributors from very early on to try to express the inner workings of GNU utilities in creative and non-obvious ways out of caution against the possibility that the copyleft obligations of the GPL wrt a given package could be nullified by a finding in court that it did not pass the threshold for creativity.
It wasn't the right solution to the problem in question, for what it's worth.
Just manually did what GPT does now.
Copyright protects original works of authorship including literary, dramatic, musical, and artistic works, such as poetry, novels, movies, songs, computer software, and architecture.
Copyright does not protect facts, ideas, systems, or methods of operation, although it may protect the way these things are expressed.
Here (and in the future even more), AI is totally capable of expressing one idea in any programming language if you ask for it (even if conceptually inspired by copyrighted code).
Which means that a particular expression (a specific implementation) is practically of no value or particular interest at this stage.
You could ask the AI to do a slightly different implementation, it would not be a problem for it and would require no efforts.
There is no point to protect something that can be generated using no effort and has no particular genius in it.
The problem, however, is that we live in this world, where it is copyrightable, and componies relying on Copilot to do large swathes of code generation do potentially have to worry about including copyrighted code in their codebase, and what the legal fallout from that might be.
This is completely unacceptable and another example that Microsoft is an evil and amoral company who only cares about open source for financial gain.
(And since Brian Kernighan was teaching it, I'm inclined to believe in it.)
1) They are using your IP with coerced consent in the future to check other people's work as well as your own in the future. (Let's have a fun discussion about "self-plagiarism."
2) ChatGPT and the like are going to so massively increase the noise floor on this problem space that these counterfeit detection companies should all but disappear in a number of years.
You can read the data to train a thing. So long as that thing doesn't literally copy the data into itself then the training hasn't violated copyright.
When that thing later generates an output, the output isn't copyrightable because it's machine generated (this is the current US position) and it isn't a copyright violation because it was generated, not copied.
You can launder copyrighted material through an LLM, basically.
1. Copyright is only granted to creative elements; lots of program codes are supposedly un-copyrightable, though no one wants to fight on that ground.
2. It is lawful in many jurisdictions to effectively steal and train AI with even copyrighted materials, for the sake of humanity at large; same supposedly not apply to the output. But AI-supportive clusters tends to conflate between the two.
3. AI training processes, stochastic gradient descent and all, are only called “learning” and/or “training” by convention; there is no public consent that it is same as how the word is supposedly defined, though we generally don’t scare quote airplanes flying.
Also, in part it depends greatly on the objective function used. In GPT style models the objective is to precisely copy from input to output, token by token. I think its extremely bad-faith to argue that this has any relationship to human learning or learning objectives.
you shouldn't take the math seriously and I'm not being dismissive with the word "just" in scare quotes. However the community somehow wants to have its cake an eat it too.
> For the above rangeCheck function, there isn't much opportunity for the individual programmer to be creative.
We are at a point at which compilers detect such functions and replace them with highly optimized ones. If you have to artificially change just for the sake of patent or license trolls you don't just get more work but also worse performance/optimizations in most cases.
Syntax Error on line 1. Missing closing ) in the method definition.
As soon as you start thinking about copyright, you end up realizing it's all non-sense. Stephan Kinsella (a patent lawyer!) is the leading thinker on this, and his videos, essays, and podcasts are worth listening to: https://www.youtube.com/watch?v=e0RXfGGMGPE
This point is absolutely going to come up in any lawsuits; because the law does sometimes examine how much creativity there is available in a field before making a determination (Oracle v Google comes to mind). If you can show that there are very, very few reasonable ways to accomplish a goal, and said goal is otherwise not patented or prohibited, it's either not copyrightable or Fair Use, take your pick.
This even applies under the interoperability section of the DMCA and similar laws for huge projects. Assuming that ReactOS, for example, is actually completely clean-room; that would be protected despite having the same API names and, likely, a lot of similar code implementing most of the most basic APIs.
If Codeium doesn't produce these when producing "verbatim enough" snippets, how is this actually better, besides avoiding a GPL boogeyman?
I get that there have been fewer (if any? I'm not aware of any) MIT/Apache2.0/MPL2.0 license violations that have gone to court than GPL violations, but this still feels like an "address the symptoms" and not "address the cause" difference.
it's not
if they've trained on MIT/Apache 2.0/... then they're just as liable as people that have trained on GPL
they would be limited to training on licenses that don't require attribution (BSD2, public domain, etc)
which I suspect limits the size of the training set so much that the output would be useless
Codium here is unintentionally making an argument that undermines legal confidence in their own product
interesting choice!
Of course, if someone figures out an algorithm that does that, people could use the same algorithm to identify missing attributions and plagiarism in other projects and throw lawsuits around. (Sigh)
Of course, the entire basis for LLMs being legal is that they use work collectively to know how code/language works and how to write it in relation to the given context. In this case, the legal defense is that the tool is like a human that learned how to code by looking at CC-BY-SA and other licensed publicly-available code and assimilating it into their own fleshy human neural network.
This only becomes shaky once you add in regurgitating code verbatim, but humans do this too, so the solution there is the copilot setting that tries to detect and revert any verbatim generated code snippets.
Why should it not be legal? Doesn't that make copyright equally powerful with patents? Copyright should restrict only replication of expression not replication of ideas.
I also believe this is where a lot of the hype about "rogue AIs" and singularity type bullshit comes from. The makers of these models and products will talk about those non-problems to cover for the fact that they're vacuuming up the work of individuals then monetizing it for the profit of big industry players.
Not sure if I'd say there's a conspiracy per se, but I do think generative AI players are going to be careful about the optics of the technology and how it works. Anecdotally from speaking to non-technical family members there's very little understanding for how the technology actually works, and it seems there's not a great deal of effort to emphasize the importance of training data, or the intellectual property considerations in these companies marketing materials.
Negative marketing is good marketing. Look at all of us debating this scale theft promoting the brand of this non product.
1. What about Elon Musk and hundreds of other AI investors? It's in their interest to overhype AI, while temporarily slowing down competition by spreading singularity fears.
2. OpenAI released the GPT4 report where they claim better performance of their model than it's in reality [1].
Also why they claim these are "black boxes" and that they "don't understand how they work". They are prepping the markets for the grand theft that's unfolding.
https://stackoverflow.com/help/licensing
I don't think I've heard anyone warn people not to copy code snippets from stackoverflow due to licensing issues, although "real" businesses should be rightfully concerned.
Manager: "we asked, legal says you can't use copilot", dev: "okay, so from now on, I'll not discuss how I use copilot and will remember to disable it when someone sees me working, gotcha".
I'm not saying everyone will do this, I'm saying some people will know that the corp doesn't always have a way to verify how the code was written, and they will think that a lawsuit cannot really happen to them.
If all software starting being non-permissive and closed source, there would be no training data and no new innovation and even if there was, it would probably suck like it did before GPL and similar licensing was mainstream.
Why is that a non-problem? It's a really important concern that we need to take more seriously
I pasted this from another comment I wrote but:
The concerns about AI taking over the world are valid and important; even if they sound silly at first, there is some very solid reasoning behind it.
See https://youtu.be/tcdVC4e6EV4 for a really interesting video on why a theoretical superintelligent AI would be dangerous, and when you factor in that these models could self-improve and approach that level of intelligence it gets worrying…
> has preferences over world states
I think that part is a leap. I don't think is given that a super intelligent AI will "want" things.
> presumably a machine could be much more selfish
This feels like we're projecting aspects of humanity that evolution specifically selected for in our species with something that is coming about though a completely different process.
> It's a mistake to think about it as a person.
I agree, but I feel like that's what these concerns about AI are doing, because that's what people do.
> (The whole stamp collector thing)
It also seems to me there is a huge gap between a super intelligent AI and the ability to have a perfect model of reality along with the ability to evaluate within that model the effect of every possible sequence of packets sent out to the internet.
Looks like LLMs are universally useful for individual people and companies, monetisation of LLMs is only incipient, and free models are starting to pop up. So you don't need to use paid APIs except for more difficult tasks.
The same thing is preventing intentional use of AI tools if you copy as is preventing regular copying, the willingness of the owner to sue.
That being said, IMO, that's completely separate from the safety issues (that exist now and won't go away even if somehow, all commercial use is banned):
Urbina, Fabio, Filippa Lentzos, Cédric Invernizzi, and Sean Ekins. “Dual Use of Artificial-Intelligence-Powered Drug Discovery.” Nature Machine Intelligence 4, no. 3 (March 2022): 189–91. https://doi.org/10.1038/s42256-022-00465-9.
Bilika, Domna, Nikoletta Michopoulou, Efthimios Alepis, and Constantinos Patsakis. “Hello Me, Meet the Real Me: Audio Deepfake Attacks on Voice Assistants.” arXiv, February 20, 2023. http://arxiv.org/abs/2302.10328
Mirsky, Yisroel, Ambra Demontis, Jaidip Kotak, Ram Shankar, Deng Gelei, Liu Yang, Xiangyu Zhang, Wenke Lee, Yuval Elovici, and Battista Biggio. “The Threat of Offensive AI to Organizations.” arXiv, June 29, 2021. http://arxiv.org/abs/2106.15764.
I don't think most people have thought through all the ways perfect text, image, voice, and soon video generation/replication will upend society, or all the ways that the LLMs will be abused...
As for AGI xrisk. I've done some reading, and since we don't know the limits of the current AI paradigm, and we don't know how to actually align an AGI, I think now is a perfectly cromulent time to be thinking about it. Based on my reading, I think the people ringing alarm bells are right to be worried. I don't think anyone giving this serious thought is being mendacious.
Bowman, Samuel R. "Eight Things to Know about Large Language Models." arXiv preprint arXiv:2304.00612 (2023). https://arxiv.org/abs/2304.00612.
Ngo, Richard, Lawrence Chan, and Sören Mindermann. “The Alignment Problem from a Deep Learning Perspective.” arXiv, February 22, 2023. http://arxiv.org/abs/2209.00626.
Carlsmith, Joseph. “Is Power-Seeking AI an Existential Risk?” arXiv, June 16, 2022. http://arxiv.org/abs/2206.13353.
I think Ian Hogarth's recent FT article https://archive.is/NdrNo is the best summary of where we are why we might be in trouble, for those that don't care for arXiv papers.
// CSparse/Source/cs_gaxpy: sparse matrix times dense vector
// CSparse, Copyright (c) 2006-2022, Timothy A. Davis. All Rights Reserved.
// SPDX-License-Identifier: LGPL-2.1+
#include "cs.h"
/* y = A*x+y */
csi cs_gaxpy (const cs *A, const double *x, double *y)
It's like starting to sing "happy birthday to you" and being surprised that people in the room join in and finish the song.Sure they make a valid point about including GPL code in the training data, but it's a little disingenuous to go to that extent to get Copilot to output the GPL code verbatim.
The sooner we have a test case go through the courts the better.
A very apt analogy that's funny in that the happy birthday song has its own history of copyright battles.
Sorry, but you sound just a little biased and greedy to me...
Otherwise the tool can go in the other direction and literally say "hey how about this function from project $foo?" with a full attribution. Apparently Google Bard does bother to do that.
The Copyright Office was pretty clear that works that incorporate AI-generated content can be copyrighted if there is sufficient human input. If there isn't substantial human input in judiciously curating and integrating AI-generated code, the company has bigger problems than copyright.
Here's the most relevant quotation from the guidance clarifying when AI-assisted works can be copyrighted:
> In other cases, however, a work containing AI-generated material will also contain sufficient human authorship to support a copyright claim. For example, a human may select or arrange AI-generated material in a sufficiently creative way that “the resulting work as a whole constitutes an original work of authorship.” [33] Or an artist may modify material originally generated by AI technology to such a degree that the modifications meet the standard for copyright protection.[34] In these cases, copyright will only protect the human-authored aspects of the work, which are “independent of” and do “not affect” the copyright status of the AI-generated material itself.[35]
> This policy does not mean that technological tools cannot be part of the creative process. Authors have long used such tools to create their works or to recast, transform, or adapt their expressive authorship. For example, a visual artist who uses Adobe Photoshop to edit an image remains the author of the modified image,[36] and a musical artist may use effects such as guitar pedals when creating a sound recording. In each case, what matters is the extent to which the human had creative control over the work's expression and “actually formed” the traditional elements of authorship.[37]
[0] https://www.federalregister.gov/documents/2023/03/16/2023-05...
It still sounds like there could be cases where a company only has copyright to a part of their own source code. How would outsiders even be aware of what has copyright and what doesn't in this situation? If an entire function was created via AI is that function then fair game for others to use as well?
I've been waiting to find that out before I go anywhere near this kind of thing.
"So Mr Zim, you're accusing X of using your copyrighted code. But you've admitted you used AI to generate that codebase, so you don't own the copyright. Please prove exactly which lines of code you do own the copyright to?"
You have completely missed the point. We still need to know the applicable licenses of the code it is emitting even the ones that aren't GPL. Furthermore GPL people don't want they code to not be used. They want it to be used _within the terms of the license_. I distribute MIT and GPL code in my repos, BOTH should have their license terms honored.
MIT licensed code still needs to be correctly attributed, just like GPL.
I don't care what license the code is that's emitted, as long as the licenses are included. It'd be nice to be able to choose to only emit code trained on particular licenses but I get that that's not easy.
From the MIT license:
> The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
From the BSD licenses:
> Redistribution and use in source and binary forms are permitted provided that the above copyright notice and this paragraph are duplicated in all such forms...
From the Apache 2.0 license:
> You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works
After typing in nothing more than, "defmodule Fibonacci do", Copilot emitted the entire module from the code on my site here: https://alchemist.camp/episodes/elixir-tdd-ex_unit
The function names and documentation strings were identical. Also, the site isn't under a GPL, just a standard copyright. That said, I'm curious to learn if others see the same behavior. It's possible I once opened that file locally with Copilot installed and that my own computer was its source.
defmodule Fibonnaci do
def fibonnaci(0), do: 0
def fibonnaci(1), do: 1
def fibonnaci(n), do: fibonnaci(n - 1) + fibonnaci(n - 2)
end
Which seems fine I guess (I don't know the language), but doesn't even have comments. I prefer my files with comments. After forcing the point, I got this: defmodule Fibonnaci do
@moduledoc """
Documentation for Fibonnaci.
"""
@doc """
Calculates the nth Fibonnaci number
"""
def fibonnaci(n) when n < 0, do: nil
def fibonnaci(0), do: 0
def fibonnaci(1), do: 1
def fibonnaci(n), do: fibonnaci(n - 1) + fibonnaci(n - 2)
end
In which I prompted the AI with everything up to (and including) @doc. So I figure it was picking it up from your computer, somehow.EDIT: I then noticed the typo, tried it with fibonacci.ex, and got the same result.
One other possible cause I thought of is that I did have the test file in my mix project already. If copilot looks at the corresponding file in the test dir, then it would not be a coincidence at all that all function names were identical or that it wrote a tail recursive solution instead of the naive solution that would have failed the final test.
def fib(0), do: 0
def fib(1), do: 1
def fib(n), do: fib(n - 1) + fib(n - 2)
endAlso, it's worth noting in the example of ChatGPT emitting LGPL code without attribution or license, the code is actually different [1]. Is the difference enough to circumvent a copyright violation claim? I don't know but a big part of determining whether it does is now muddled because of the way the system was designed. Even if we could get an entropy distribution on which training data was used to generate the text, it's not even clear the courts could use it in any meaningful way.
[0] https://ansuz.sooke.bc.ca/entry/23
[1] https://twitter.com/DocSparse/status/1581461734665367554
This is an excellent point in the context of this question. Typical computer programmer responses like "but there are only so many ways to write a function that does X" or "how small of a matching section counts as copyright infringement" ignore the color of the bits.
A judge can look at ChatGPT or Copilot, decide that it took in license-limited copyrighted data in its training set, observe that a common use is to have it emit that data - to emit bits that are still colored with copyright - and tell OpenAI, or Copilot, or their users that they are guilty of copyright infringement. There may be no coherent mathematical or technical formula to determine the color of a bit, but that's understandable, because the color doesn't exist in mathematical, technical, coherent domains anyways: Only the legal domain sees color, and it can take care of itself.
The GPL relies on copyright law.
// CSparse/Source/cs_gaxpy: sparse matrix times dense vector
// CSparse, Copyright (c) 2006-2022, Timothy A. Davis. All Rights Reserved.
// SPDX-License-Identifier: LGPL-2.1+
#include "cs.h"
/* y = A*x+y */
csi cs_gaxpy (const cs *A, const double *x, double *y)
{
// Fill in here
}
The code was the same. Though it also explained how it worked to me.(along with all other licenses that require attribution)
as it will allow you to launder code automatically through an LLM to remove copyright
however if the suit is successful: every company/individual that has used it is likely suddenly liable for millions of claims of copyright infrigement
I look forward to sending out demands for settlement to everyone that's ever publicly admitted using copilot
$150,000 per infringement with wilfulness, less without
No it won't, obviously if it copies code exactly then you can't use that. The question is whether Microsoft is liable for the fact that Copilot has the ability to output copyrighted code sometimes or whether people using it just need to check that it hasn't done that before using the code (Copilot can also do this automatically)
Google can also show you GPL code in its results, but people aren't trying to sue Google and the user is responsible for checking the license before using it (though Copilot makes this harder)
Disclaimer: I haven't read much about the actual lawsuit and I'm not a lawyer but I assume this would be the case
Only if you can prove that you are the copyright owner of the original work.
That might be a challenge for many open source projects. Even projects that require copyright assignment might not have sufficient paperwork to prove this in a court of law. The copyright might not even have been the persons to assign in the first place.
You would also face the burden of proving that the fragment that Copilot generated was sufficient to be copyrightable in the first place. The limited grammar of most programming languages would probably make proving that something was copyrightable at the function level hard. Just because the entire work was licensed under the GPL, it doesn't necessarily follow that all the individual fragments when separated out are.
Outside of sampling, this is an area that the courts have largely punted on for good reason. It's a rabbit hole nobody wants to go down.
Either outcome opens up a huge can of worms that I suspect nobody really wants to touch because it likely ends in mutual destruction.
I don't know how GPL (or copyright in general) can survive in the long run with these technologies.
(i) actually produced code which is verbatim the same as a block of GPL code
(ii) got caught
>I look forward to sending out demands for settlement to everyone that's ever publicly admitted using copilot
Feel free, they'll tell you to leave them alone. Then what? Might as well ask every fortune 500 company for a pony instead.
Not really because the GPL can be updated with a clause that allows GPLv5 (or whatever the version is going to be) to be used to train public LLM models, but explicitly forbidden to be used to train private models.
I somehow don't think this is the end of the GPL... Yet!
Humm, then perhaps should be trained LLMs with leaked Microsoft code. Protocols, controllers or any kind of stuff that could contribute advances for executing Windows things within Linux.
Microsoft would react establishing their own limits, whichever option they choose to take.
I very much doubt that is a threat to Microsoft.
It is technically very straightforward to run Windows “things” under Linux thanks to virtual machines and/or RDP to a server and some UI trickery to make it seamless and facilitate interoperability between the two OSes. Parallels does quite a bit of that on macOS for example. A similar solution would be developed for Linux if there was enough demand for it.
I think the problem here is: by auto completing GPL code to developers it might open the opportunity of your company getting sued for using GPL illegally
I would also imagine those companies whose business is built around the open source development they do -- open core, SaaS, or otherwise -- would have a claim to financial damages as a result of stolen code.
The more insinuating issue would be if you started with a innocent seeming function that a typical software developer would write, and ended up with GPL code. Has anyone shown that to happen?
And yes, the implication is that a different less explicit prompt could still emit copyrighted code.
One of the main reasons corporations love it so much is because it effectively lets them profit off of the work of others with no consequences.
A truly attribution-free license that checks several other important boxes (disclaiming liability and warranty etc.)
If you want your code to be usable by things like github copilot, consider using it (can't imagine most of the HN crowd wants their code used by copilot, but maybe some lurkers here do!)
Non-permissive open source licenses have been on a slow death march for over a decade. They're effectively pointless now.
Either you decide to give your code for free to everyone or you don't. Adding a bunch of restrictions defeats the purpose of OSS.
The copyright on the implementation will outlive the patent and allow the implementor to legally take action on claims of copyright infringement. Even though a program is literally just a list of instructions to implement the expired patent.
If you take someone else's software without a license and rename variables, it will be a copyright violation, because you've copied (and then modified) it without permission.
But if you write your own software from scratch, even if it happens to be almost identical to someone else's code, that's fine. You've done your own work and a copyright owner can't stop you from doing that. They control their own work only.
As you can see, this is very much tied to human work and intent, since the concept has been invented long before ML existed. This is why ML "learning" and doing "work" is so controversial and appears to be a loophole in copyright.
That way, we get to keep the models since they are genuinely useful, but also there’s no issue with copyright and less of an issue with consent to distribute (which can be hopefully be managed by the “humans also learn from data” and “it’s not actually producing your content verbatim unless it follows a basic pattern that anyone could discover). And furthermore, no issue with AI privatized which IMO is my biggest concern with these new tools.
It’s absolutely ridiculous on so many levels. These models may claim so many jobs and have a serious negative impact on so many peoples lives, yet basically one company owns the model?
I actually find it funny albeit totally insane.
Almost all open-source licenses say it can be copied for use in development (i.e., not for re-publication or regurgitation), and even completely open licenses are speaking to people as readers.
The only reason this is happening is coordination costs: a few extremely motivated people with tons of resources are copying from many, many people who would be difficult to organize and have little at stake.
Unfortunately, the law typically ends up reflecting exactly these imbalances.
A. Check AI generated code against a comprehensive library of open-source copyrighted code and identify potential violations.
B. Ask AI to generate a paraphrase of the potential violations, by employing any number of semantic preserving transforms -- e.g. variable name change, operator replacement, structured block rewrite, functional rebalance, etc.
Lazy example:
private static void rangeCheck(int arrayLen, int fromIndex, int toIndex {
if (fromIndex > toIndex)
throw new IllegalArgumentException("fromIndex(" + fromIndex +
") > toIndex(" + toIndex+")");
if (fromIndex < 0)
throw new ArrayIndexOutOfBoundsException(fromIndex);
if (toIndex > arrayLen)
throw new ArrayIndexOutOfBoundsException(toIndex);
}
private static void rangeCheck(int len, int start, int end) {
if (!(0 <= start)) {
throw new ArrayIndexOutOfBoundsException(`Failed: 0 <= ${start}`);
} else if (!(start <= end)) {
throw new IllegalArgumentException(`Failed: ${start} <= ${end}`);
} else if (!(end <= len)) {
throw new ArrayIndexOutOfBoundsException(`Failed: ${end} <= ${len}`);
}
}If you know your AI produces code that is "tainted" by license violations, adding code to hide it after the fact suggests that you're intentionally violating the license terms.
Can't believe we still spend time debating this license and nobody, not even lawyers at software companies, seem to get it.
Many licenses still require attribution and Coedium is violating them.
* Training an AI with the code is allowed legally.
* Storing model weights is allowed legally.
* Querying the AI with those model weights is allowed legally.
Or maybe not.
The only ambiguity as far as I can tell is GPL covers "source code", "machine-readable Corresponding Source", and "object code form", and it's not explicit whether vector-fields count as any of those things. I doubt anyone would seriously argue that zipping and then un-zipping some GPL source code means you don't need to respect the original license. LLMs are different in that they're lossy compared to the zip format - does the nature of this lossiness invalidate the intent of the GPL's original language? I doubt it.
Also if I am remembering correctly, and I make no guarantee that I am, this tweet is from a person with a strong dislike for Microsoft, and if I am right about that, I would not put it past this person, or anyone else with a strong dislike of Microsoft, to craft a situation to make Microsoft look bad solely to hurt Microsoft.
I've tried to make Copilot give me GPL code snippets while I have "suggestions matching public code" set to "blocked" and I can't make it happen.
so even if this was a problem 6 months ago, it would take some convincing to get me to believe that this happens today.
I too would prefer that these sorts of things cite sources and the licenses correctly. Will it get mired in legal battles? You bet. Will it get regulated? I assume they'll try! Will it slow down progress of code generating / auto-completing agents? My argument is nope, cut off heads of the hydra if you'd like but it's not going away at all.
Spend your day worrying about something else. This train has left the station.
Or perhaps every company can just invent its own programming language and translate copyrighted code into the new language and thus avoid copyright issues altogether, though they may still run afoul of software patents.
Saying an LLM violates an atrribution requirement is a bad legal argument.
Theoretically they can generate any arbitrary snippet of code (if it correctly fits the distribution), regardless of whether or not the code was in the training dataset.
There is no such thing as "GPL code" or any other "$license code". This is a fundamental misunderstanding of what a license is. The code in question was licensed to GitHub under a different license - possibly fraudulently.
Focusing on the GPL license is probably the wrong move. We want to set precedent that _any_ licensed code that is emitted from an LLM is fair game. If an LLM to emits non-FOSS copyrighted code and it's fair game, I can blindly use that implementation in my code, including FOSS code, and everyone wins.
GPL was a clever hack to use copyright against itself with an infectious license. LLMs might be a better hack. Wanting to block this seems short sighted for giving user's agency over machines.
I'd also like to see more patent defenses of GPL licensed code. If you can release a GPL licensed implementation and block non-FOSS rewrites through patents, that's a huge win for software freedom.
I'm generally in support of LLMs though and I think that they will very quickly be trained to remove verbatim duplication of the kind that a human would consider copyright violation while still using verbatim duplication where it makes sense (for example, every function in python has the word "def" in front of it).
I’m not looking to explicitly launder copyright. I’d like to be blind to it. I don’t want to explicitly use an LLM to remove copyright. I want to use an LLM to build software systems without having to cross reference its output with every line of code ever produced under a license to see if it’s already copyrighted.
Agree with your take that motivation matters.
If anything goes to court, that's what would happen. It's not "this is GPL code and they did not attribute", it's "they violated my copyright. As a side note, we license this code as GPL and they did not attribute in accordance with this license, so that's irrelevant". It would only be an actual license issue if they tried something like "license (C) at codium.com/all_licenses_dataset0423".
This is a naive understanding and interpretation of GPL, in all its flavors. Or maybe I misunderstand you argument.
The copyright owner of some work is free to offer that work under multiple, different licenses in parallel, to their liking.
They can leverage GPL strategically for e.g. providing a free, easy-to-evaluate library with the "if you use it under GPL terms, you have to GPL your work as well" condition/caveat.
For any library user / customer that does not want to be bound to the GPL terms (e.g. a closed-source software which a company does not want to share for free with their own paying customers and competitors), the copyright owner is free to offer an alternative proprietary commercial license.
This is only one way how GPL can actually leverage copyright and use it financially beneficially to the owner, rather than use "copyright against copyright".
Microsoft's business model is betrayal. Github is Microsoft.
HNers got mad at people who pointed this out, and now here we are.
You were warned, but you decided to believe again in the most vile people in the history of computing.
https://www.bloomberg.com/news/articles/2018-06-06/github-is...
They thrive on betrayal and will never change and are getting cleverer.
> Microsoft's business model is betrayal. Github is Microsoft.
O̶p̶e̶n̶AI.com is also Microsoft.
They were warned straight from the beginning [0] [1] and the same HNers keep falling for the Microsoft freebies and giveaways.
Perhaps the time they will learn the hardest: Is when it is too late.
What is that? The problem is when GH Copilot it emits the code without the licence, not the licence.
Now the only loser is the humans that still have to maintain the ugly code, and RMS can have his weaponized copyright and eat toejam too.
https://github.com/ibayer/CSparse/blob/master/Source/cs_gaxp...
Isn't that covered by:
"You grant us and our legal successors the right to store, archive, parse, and display Your Content... share it with other users..."
"GitHub Copilot Emits GPL. Codeium Does Not."
Why?
Still infringing.
Nice try.
Huh? GPL does have strings attached, but if consent one of them?
Seems like a thinly disguised ad
print(f’Hello, world’)
And it auto completes all the time!