undefined | Better HN

0 pointsspullara2mo ago0 comments

if the actual text of the code isn't the same or obviously derivative, copyright doesn't apply at all.

0 comments

What does derivative mean here? Because IMO it means that the existing work was used as input. So if you used a LLM and it was trained on the existing work, that's a derivative work. If you rot13 encode something as input, so you can't personally read it, and then a device decides to rot13 on it again and output it, that's a derivative work.

spullaraOP2mo ago

In order for it to be creatively derivative you would need to copy the structure, logic, organization, and sequence of operations not just reimplement the functionality. It is pretty clear in this case that wasn't done.

cubefox2mo ago

It's not clear at all.

ghostpepper2mo ago

As a cynical person I assume all the frontier LLMs were trained on datasets that include every open source project, but as a thought experiment, if an LLM was trained on a dataset that included every open source project _execept_ chardet, do you think said LLM would still be able to easily implement something very similar?

spullaraOP2mo ago

There is no doubt in my mind that it could still do it.

nicole_express2mo ago

Of course, the problem with this interpretation is that all modern LLMs are derivatives from huge amounts of text under completely different licenses, including "All rights reserved", and therefore can not be used for any purpose.

I'm not sure how you square the circle of "it's alright to use the LLM to write code, unless the code is a rewrite of an open source project to change its license".

JoshTriplett2mo ago

> Of course, the problem with this interpretation is that all modern LLMs are derivatives from huge amounts of text under completely different licenses, including "All rights reserved", and therefore can not be used for any purpose.

> I'm not sure how you square the circle of "it's alright to use the LLM to write code

You seem like you're on the cusp of stating the obvious correct conclusion: it isn't.

satvikpendem2mo ago

> Because IMO it means that the existing work was used as input

That's your opinion (since you said "IMO"), not the actual legal definition.

bmcahren2mo ago

LLMs do not encode nor encrypt their training data. The fact they can recite training data is a defect not a default. You can understand this more simply by calculating the model size as an inverse of a fantasy compression algorithm that is 50% better than SOTA. You'll find you'd still be missing 80-90% of the training data even if it were as much of a stochastic parrot as you may be implying. The outputs of AI are not derivative just because they saw training data including the original library.

Then onto prompting: 'He fed only the API and (his) test suite to Claude'

This is Google v Oracle all over again - are APIs copyrightable?

thunderfork2mo ago

I find the "compression" argument not very strong, both because copyright still applies to (very) lossy codecs (e.g. your 16kbps Opus file of Thriller infringes, even if the original 192khz/32bit wav file was 12,000kbps), and because copyright still applies to transformed derivative works (a tiny midi file of Thriller might still be enough for the Jackson's label to get you)

satvikpendem2mo ago

> This is Google v Oracle all over again - are APIs copyrightable?

Yes this is the best way to ask the question. If I take a public facing API and reimplement everything, whether it's by human or machine, it should be sufficient. After all, that's what Google did, and it's not like their engineers never read a single line of the Java source code. Even in "clean room" implementations, a human might still have remembered or recalled a previous implementation of some function they had encountered before.

azakai2mo ago

> LLMs do not encode nor encrypt their training data. The fact they can recite training data is a defect not a default.

About this specific point, it is unclear how much of a defect memorization actually is - there are also reasons to see it as necessary for effective learning. This link explains it well:

https://infinitefaculty.substack.com/p/memorization-vs-gener...

tw19842mo ago

> This is Google v Oracle all over again - are APIs copyrightable?

No, it is completely different.

Claude was trained on chardet, anything built by Claude would fail the clean-room reimplementation test.

LegionMammal9782mo ago

"The clean-room reimplementation test" isn't a legal standard, it's a particular strategy used by would-be defendants to clearly meet the standard of "is the new work free of copyrightable expression from the original work".

wizzwizz42mo ago

See also: https://monolith.sourceforge.net/, which seeks to ask the question:

> But how far away from direct and explicit representations do we have to go before copyright no longer applies?

yorwba2mo ago

Copyright protects even very abstract aspects of human creative expression, not just the specific form in which it is originally expressed. If you translate a book into another language, or turn it into a silent movie, none of the actual text may survive, but the story itself remains covered by the original copyright.

So when you clone the behavior of a program like chardet without referencing the original source code except by executing it to make sure your clone produces exactly the same output, you may still be infringing its copyright if that output reflects creative choices made in the design of chardet that aren't fully determined by the functional purpose of the program.

NSUserDefaults2mo ago

If you pirate a movie and reencode it, does that apply as well? You can still watch the movie and it is “obviously” the same movie, even though the bytes are completely different. Here you can use the program and it is, to the user, also the same.

j / k navigate · click thread line to collapse

0 comments

sigseg1v2mo ago

spullaraOP2mo ago

cubefox2mo ago

It's not clear at all.

ghostpepper2mo ago

spullaraOP2mo ago

There is no doubt in my mind that it could still do it.

nicole_express2mo ago

I'm not sure how you square the circle of "it's alright to use the LLM to write code, unless the code is a rewrite of an open source project to change its license".

JoshTriplett2mo ago

> I'm not sure how you square the circle of "it's alright to use the LLM to write code

You seem like you're on the cusp of stating the obvious correct conclusion: it isn't.

satvikpendem2mo ago

> Because IMO it means that the existing work was used as input

That's your opinion (since you said "IMO"), not the actual legal definition.

bmcahren2mo ago

Then onto prompting: 'He fed only the API and (his) test suite to Claude'

This is Google v Oracle all over again - are APIs copyrightable?

thunderfork2mo ago

satvikpendem2mo ago

> This is Google v Oracle all over again - are APIs copyrightable?

azakai2mo ago

> LLMs do not encode nor encrypt their training data. The fact they can recite training data is a defect not a default.

About this specific point, it is unclear how much of a defect memorization actually is - there are also reasons to see it as necessary for effective learning. This link explains it well:

https://infinitefaculty.substack.com/p/memorization-vs-gener...

tw19842mo ago

> This is Google v Oracle all over again - are APIs copyrightable?

No, it is completely different.

Claude was trained on chardet, anything built by Claude would fail the clean-room reimplementation test.

LegionMammal9782mo ago

wizzwizz42mo ago

See also: https://monolith.sourceforge.net/, which seeks to ask the question:

> But how far away from direct and explicit representations do we have to go before copyright no longer applies?

yorwba2mo ago

NSUserDefaults2mo ago

j / k navigate · click thread line to collapse