undefined | Better HN

0 pointsdrvortex3y ago0 comments

Your code is not in that thing. That thing has merely read your code and adjusted its own generative code.

It is not directly using your code any more than programmers are using print statements. A book can be copyrighted, the vocabulary of language cannot. A particular program can be copyrighted, but snippets of it cannot, especially when they are used in a different context.

And that is why this lawsuit is dead on arrival.

0 comments

klabb33y ago

> Your code is not in that thing. That thing has merely read your code and adjusted its own generative code.

This is kinda smug, because it overcomplicates things for no reason, and only serves as a faux technocentric strawman. It just muddies the waters for a sane discussion of the topic, which people can participate in without a CS degree.

The AI models of today are very simple to explain: its a product built from code (already regulated, produced by the implementors) and source data (usually works that are protected by copyright and produced by other people). It would be a different product if it didn't have used the training data.

The fact that some outputs are similar enough to source data is circumstantial, and not important other than for small snippets. The elephant in the room is the act of using source data to produce the product, and whether the right to decide that lies with the (already copyright protected) creator or not. That's not something to dismiss.

nickelpro3y ago

It's not something to dismiss but it is something that has already been addressed. Authors Guild v Google. Google Books is built upon scanning millions of books from libraries without first gaining permission from copyright holders, this was found to not be a violation of copyright.

Building a product on top of copyright works that does not directly distribute those works is legal. More specifically, a computer consuming a copyright work is not a violation of copyright.

TAForObvReasons3y ago

At the time the suit was launched, Google search would only display snippet views. The very nature presents the attribution to the user, enabling them to separately obtain a license for the content.

This would be more or less analogous to Copilot linking to lines in repositories. If Copilot was doing that, there wouldn't be much outrage.

The fact that they are producing the entire relevant snippet, without attribution and in a way that does not necessitate referencing the source corpus, suggests the transgression is different. It is further amplified by the fact that the output itself is typically integrated in other copyrighted works.

1 more reply

klabb33y ago

> Authors Guild v Google. Google Books is built upon scanning millions of books from libraries

I agree it's relevant precedent, but not exactly the same. Libraries are a public good and more importantly Google books references the original works. In short, I don't think that's the final word in all seemingly related cases.

> More specifically, a computer consuming a copyright work is not a violation of copyright.

I don't agree with this way of describing technology, as if humans weren't responsible for operating and designing the technology. Law is concerned with humans and their actions. If you create an autonomous scraper that takes copyrighted works and distributes them, you are (morally) responsible for the act of distributing them, even if you didn't "handle" them or even see them yourself.

Neither of the important aspects – remixing and automation – is novel, but the combination is. That's what we should focus on, instead of treating AI as some separate anthropomorphized entity.

1 more reply

forgotpwd163y ago

>Authors Guild v Google

At which case Google paid some hundred million $ to companies and authors, created a registry collecting revenues and giving to rightsholders, provided opt-out to already scanned books, etc. Hey, doesn't sound that bad for same thing to happen with Copilot.

yenwodyah3y ago

But Copilot has been shown to distribute (parts of) the copyrighted works used to create it. That’s the difference.

1 more reply

xtracto3y ago

Say you publish a song and copyright it. Then I record it and save it in a .xz format. It's not an MP3, it is not an audio file. Say I split it into N several chunks and I share it with N different people. Or with the same people, but I share it at N different dates. Say I charge them $10 a month for doing that, and I don't pay you anything.

Am I violating your copyright? Are you entitled to do that?

To make it funnier: Say instead of the .xz, I "compress" it via π compression [1]. So what I share with you is a pair of π indices and data lengths for each of them, from which you can "reconstruct" the audio. Am I illegally violating your copyrights by sharing that?

[1] https://github.com/philipl/pifs

Aeolun3y ago

What you are actually giving people is a set of chords that happen to show up in your song, the machine can suggest an appropriate next chord.

It’s also smart enough to rebuild your song from the chords _if you ask it to_.

varajelle3y ago

I take your code and I compress it in a tar.gz file. Il call that file "the model". Then I ask an algorithm (Gzip) to infer some code using "the model". The algorithm (gzip) just learned how to code by reading your code. It just happened to have it memorized in its model.

1 more reply

BizarroLand3y ago

With the exception that there are infinite types of chords in this case, and even though many musicians follow familiar chord structures the underlying melodies and rhythms are unique enough for any familiar person to be able to differentiate "Red Hot Chill Peppers" from "All-American Rejects", and now there is a system where All-American Rejects hit a few buttons and a song is generated (using audio samples of "Under the Bridge") that sounds like "Under the Bridge pt 2, All-American Rejects Boogaloo".

That's why it's actionable and why there is meat on the bone for this case. The real issue is going to be if they can convince a jury that this software is just stealing code and whether its wrong if a robot does it.

1 more reply

2muchcoffeeman3y ago

I was thinking of something similar as a counter argument and lo and behold, it’s a real thing maths has solved with a real implementation.

obiefernandez3y ago

This analogy is flawed

andrewmcwatters3y ago

This is demonstrably false. It is a system outputting character-for-character repository code.[1]

[1]: https://news.ycombinator.com/item?id=33457517

adriand3y ago

If I use Photoshop to create an image that is identical to a registered trademark, is the rights violation my fault or Adobe’s fault?

xigoi3y ago

Photoshop can't produce copyrighted images on its own.

1 more reply

kyruzic3y ago

No because that's not a trademark violation in anyway. Using GPL code in a non GPL project is a violation of copyright law though.

Aeolun3y ago

Ok, cool. Presumably that is because it’s smart enough to know that there is only one (public) solution to the constraints you set (like asking it to reproduce licensed code).

Now, while you may be able to get it to reproduce one function. One file, and definitely the whole repository seems extremely unlikely.

pmarreck3y ago

It can be modified to not do that (example: mutating the code to a "synonym" that is functionally but not visually identical).

It can also be modified to be opt-in-only (only peoples' code that they permit to be learned on, can use the product)

falcolas3y ago

Perhaps you are right, and it could be so modified.

Could be, but isn’t. And that matters.

ImPostingOnHN3y ago

plagiarism with some words swapped is still plagiarism

Cort3z3y ago

Just to be clear; I cannot prove that they have used my code, but for the sake of argument, lets assume so.

They would have directly used my code when they trained the thing. I see it as an equivalent of creating a zip-file. My code is not directly in the zip file either. Only by the act of un-zipping does it come back, which requires a sequence of math-steps.

Filligree3y ago

But there is no equivalent of "unzipping" for Copilot.

This is a generative neural network. It doesn't contain a copy of your code; it contains weightings that were slightly adjusted by your code. Getting it to output a literal copy is only possible in two cases:

- If your code solves a problem that can only be solved in a single way, for a given coding style / quality level. The AI will usually produce the same result, given the same input, and it's going to be an attempt at a solution. This isn't copyright violation.

- If 'your' code has actually already been replicated hundreds of times over, such that the AI was over-trained on it. In that case it's a copyright violation... but how come you never went after the hundreds of other violations?

account423y ago

There is no guarantee that a ML network only produces the input data under those two conditions. But even for

> If 'your' code has actually already been replicated hundreds of times over, such that the AI was over-trained on it. In that case it's a copyright violation... but how come you never went after the hundreds of other violations?

Replication is not a violation if the terms of the license are followed. Many open source projects are replicated hundreds of times with no license violation - that doesn't mean that you can now ignore the license.

But even if they did violate the license, that doesn't give you the right to do it too. There is no requirement to enforce copyright consistently - see e.g. mods for games which are more often than not redistributing copyrighted content and derivatives of it but usually don't run into trouble because they benefit the copyright owner. But try to make your own game based on that same content and the original publisher will not handle it in the same way as those mods. Same for OSS licenses: The original author does not lose any rights to sue you if they have ignored technical license violations by others when those uses are acceptable to the original author.

heavyset_go3y ago

Neutral nets can and do encode and compress the information they're trained on, and can regurgitate it given the right inputs. It is very likely that someone's code is in that neural net, encoded/compressed/however you want to look at it, which Copilot doesn't have a license to distribute.

You can easily see this happen, the regurgitation of training data, in an over fitted neural net.

CuriouslyC3y ago

This is not necessarily true, the function space defined by the hidden layers might not contain an exact duplicate of the original training input for all (or even most) of the training inputs. Things that are very well represented in the training data probably have a point in the function space that is "lossy compression" level close to the original training image though, not so much in terms of fidelity as in changes to minor details.

heavyset_go3y ago

When I say encoded or compressed, I do not mean verbatim copies. That can happen, but I wouldn't say it's likely for every piece of training data Copilot was trained on.

Pieces of that data are encoded/compressed/transformed, and given the right incantation, a neutral net can put them together to produce a piece of code that is substantially the same as the code it was trained on. Obviously not for every piece of code it was trained on, but there's enough to see this effect in action.

naikrovek3y ago

> which Copilot doesn't have a license to distribute

when you upload code to a public repository on github.com, you necessarily grant GitHub the right to host that code and serve it to other users. the methods used for serving are not specified. This is above and beyond the license specified by the license you choose for your own code.

you also necessarily grant other GitHub users the right to view this code, if the code is in a public repository.

eropple3y ago

Host that code. Serve that code to other users. It does not grant the right to create derivative works of that code outside the purview of the code's license. That would be a non-starter in practice; see every repository with GPL code not written by the repository creator.

Whether the results of these programs is somehow Not A Derivative Work is the question at hand here, not "sharing". I think (and I hope) that the answer to that question won't go the way the AI folks want it to go; the amount of circumlocution needed to excuse that the not actually thinking and perceiving program is deriving data changes from its copyright-protected inputs is a tell that the folks pushing it know it's silly.

2 more replies

vharuck3y ago

The relevant part of GitHub's terms of service:

"4. License Grant to Us

We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.

This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program."

https://docs.github.com/en/site-policy/github-terms/github-t...

I don't think these terms allow using content for Copilot.

heavyset_go3y ago

It's served under the terms of my licenses when viewed on GitHub. Both attribution and licenses are shared.

This is like saying GitHub is free to do whatever they want with copyrighted code that's uploaded to their servers, even use it for profit while violating its licenses. According to this logic, Microsoft can distribute software products based on GPL code to users without making the source available to them in violation of the terms of the GPL. Given that Linux is hosted on GitHub, this logic would say that Microsoft is free to base their next version of Windows on Linux without adhering to the GPL and making their source code available to users, which is clearly a violation of the GPL. Copilot doing the same is no different.

LtWorf3y ago

Then github should make sure that people only upload stuff they are copyright owner of… which it has never done, warned about or tried to enforce.

vkou3y ago

> It is not directly using your code any more than programmers are using print statements. A book can be copyrighted, the vocabulary of language cannot. A particular program can be copyrighted, but snippets of it cannot, especially when they are used in a different context.

So what? Why shouldn't we update the rules of copyright to catch up to advances in technology?

Prior to the invention of the printing press, we didn't have copyright law. Nobody could stop you from taking any book you liked, and paying a scribe to reproduce it, word for word, over and over again. You could then lend, gift, or sell those copies.

The printing press introduced nothing novel to this process! It simply increased the rate at which ink could be put to pages. And yet, in response to its invention, copyright law was created, that banned the most obvious and simple application of this new technology.

I think it's entirely reasonable for copyright law to be updated, to ban the most obvious and simple application of this new technology, both for generating images, and code.

civilized3y ago

> Your code is not in that thing. That thing has merely read your code and adjusted its own generative code.

Completely incorrect. False dichotomy. It's widely known that AI can and does memorize things just like humans do. Memorization isn't a defense to violating copyright, and calling memorization "adjusting a generative model" doesn't make it stop being memorization.

If you memorized Microsoft's code in your brain while working there and exfiltrated it, the fact that it passed through your brain wouldn't be a defense. Substituting "generative model" for "brain" and the fact that it's a tool used by third parties doesn't change this.

moralestapia3y ago

Whatever you say man :^)

https://twitter.com/docsparse/status/1581461734665367554

NicoleJO3y ago

You're wrong. See exposed code. https://justoutsourcing.blogspot.com/2022/03/gpts-plagiarism...

lamontcg3y ago

> but snippets of it cannot

Yeah they can, and the whole functions that Copilot spits out are quite obviously covered by copyright.

> especially when they are used in a different context.

That doesn't matter.

ouid3y ago

it is essentially a weighted sum of your code and other copyright holders code. Do not let the mystique of AI fool you. Copilot does not learn, it glues.

tevon3y ago

I agree.

If I read JRR Tolkien and then go and write a fantasy novel following a unexpected hero on his dangerous quest to undo evil, I haven't infringed, even if I use some of Tolkien's better turns of phrase.

LtWorf3y ago

Games aren't even allowed to use the word "hobbit" without paying royalties. I'm sure you completely ignore what you're talking about.

Filligree3y ago

Hmm. Are you sure that's true?

j / k navigate · click thread line to collapse

0 comments

klabb33y ago

> Your code is not in that thing. That thing has merely read your code and adjusted its own generative code.

nickelpro3y ago

Building a product on top of copyright works that does not directly distribute those works is legal. More specifically, a computer consuming a copyright work is not a violation of copyright.

TAForObvReasons3y ago

At the time the suit was launched, Google search would only display snippet views. The very nature presents the attribution to the user, enabling them to separately obtain a license for the content.

This would be more or less analogous to Copilot linking to lines in repositories. If Copilot was doing that, there wouldn't be much outrage.

1 more reply

klabb33y ago

> Authors Guild v Google. Google Books is built upon scanning millions of books from libraries

> More specifically, a computer consuming a copyright work is not a violation of copyright.

Neither of the important aspects – remixing and automation – is novel, but the combination is. That's what we should focus on, instead of treating AI as some separate anthropomorphized entity.

1 more reply

forgotpwd163y ago

>Authors Guild v Google

yenwodyah3y ago

But Copilot has been shown to distribute (parts of) the copyrighted works used to create it. That’s the difference.

1 more reply

xtracto3y ago

Am I violating your copyright? Are you entitled to do that?

[1] https://github.com/philipl/pifs

Aeolun3y ago

What you are actually giving people is a set of chords that happen to show up in your song, the machine can suggest an appropriate next chord.

It’s also smart enough to rebuild your song from the chords _if you ask it to_.

varajelle3y ago

1 more reply

BizarroLand3y ago

1 more reply

2muchcoffeeman3y ago

I was thinking of something similar as a counter argument and lo and behold, it’s a real thing maths has solved with a real implementation.

obiefernandez3y ago

This analogy is flawed

andrewmcwatters3y ago

This is demonstrably false. It is a system outputting character-for-character repository code.[1]

[1]: https://news.ycombinator.com/item?id=33457517

adriand3y ago

If I use Photoshop to create an image that is identical to a registered trademark, is the rights violation my fault or Adobe’s fault?

xigoi3y ago

Photoshop can't produce copyrighted images on its own.

1 more reply

kyruzic3y ago

No because that's not a trademark violation in anyway. Using GPL code in a non GPL project is a violation of copyright law though.

Aeolun3y ago

Ok, cool. Presumably that is because it’s smart enough to know that there is only one (public) solution to the constraints you set (like asking it to reproduce licensed code).

Now, while you may be able to get it to reproduce one function. One file, and definitely the whole repository seems extremely unlikely.

pmarreck3y ago

It can be modified to not do that (example: mutating the code to a "synonym" that is functionally but not visually identical).

It can also be modified to be opt-in-only (only peoples' code that they permit to be learned on, can use the product)

falcolas3y ago

Perhaps you are right, and it could be so modified.

Could be, but isn’t. And that matters.

ImPostingOnHN3y ago

plagiarism with some words swapped is still plagiarism

Cort3z3y ago

Just to be clear; I cannot prove that they have used my code, but for the sake of argument, lets assume so.

Filligree3y ago

But there is no equivalent of "unzipping" for Copilot.

account423y ago

There is no guarantee that a ML network only produces the input data under those two conditions. But even for

heavyset_go3y ago

You can easily see this happen, the regurgitation of training data, in an over fitted neural net.

CuriouslyC3y ago

heavyset_go3y ago

When I say encoded or compressed, I do not mean verbatim copies. That can happen, but I wouldn't say it's likely for every piece of training data Copilot was trained on.

naikrovek3y ago

> which Copilot doesn't have a license to distribute

you also necessarily grant other GitHub users the right to view this code, if the code is in a public repository.

eropple3y ago

2 more replies

vharuck3y ago

The relevant part of GitHub's terms of service:

"4. License Grant to Us

https://docs.github.com/en/site-policy/github-terms/github-t...

I don't think these terms allow using content for Copilot.

heavyset_go3y ago

It's served under the terms of my licenses when viewed on GitHub. Both attribution and licenses are shared.

LtWorf3y ago

Then github should make sure that people only upload stuff they are copyright owner of… which it has never done, warned about or tried to enforce.

vkou3y ago

So what? Why shouldn't we update the rules of copyright to catch up to advances in technology?

I think it's entirely reasonable for copyright law to be updated, to ban the most obvious and simple application of this new technology, both for generating images, and code.

civilized3y ago

> Your code is not in that thing. That thing has merely read your code and adjusted its own generative code.

moralestapia3y ago

Whatever you say man :^)

https://twitter.com/docsparse/status/1581461734665367554

NicoleJO3y ago

You're wrong. See exposed code. https://justoutsourcing.blogspot.com/2022/03/gpts-plagiarism...

lamontcg3y ago

> but snippets of it cannot

Yeah they can, and the whole functions that Copilot spits out are quite obviously covered by copyright.

> especially when they are used in a different context.

That doesn't matter.

ouid3y ago

it is essentially a weighted sum of your code and other copyright holders code. Do not let the mystique of AI fool you. Copilot does not learn, it glues.

tevon3y ago

I agree.

LtWorf3y ago

Games aren't even allowed to use the word "hobbit" without paying royalties. I'm sure you completely ignore what you're talking about.

Filligree3y ago

Hmm. Are you sure that's true?

j / k navigate · click thread line to collapse