undefined | Better HN

0 pointslocknitpicker13d ago0 comments

> (...) I'm also remembering how GitHub used "all open repositories" to train their first Copilot without telling anyone.

This is a silly opinion to hold, isn't it? I mean, you release projects under a license with the express purpose of freely distributing your code among anyone in the world that may have any interest whatsoever, and even allow they themselves to share it with anyone they feel fit. But you are somehow outraged if people actually use said code?

Please make it make sense.

0 comments

lelanthran13d ago

> This is a silly opinion to hold, isn't it? I mean, you release projects under a license with the express purpose of freely distributing your code among anyone in the world that may have any interest whatsoever, and even allow they themselves to share it with anyone they feel fit. But you are somehow outraged if people actually use said code?

You're making things up: the outrage is not that people used it, it's that the licence requires attribution at least, and opening the derivative product at worst. Token providers that trained on open source did neither.

> Please make it make sense.

I am skeptical that you didn't know the reason for the outrage because it's been repeated in every single thread where this was discussed.

I myself repeated it multiple times each time this feigned confusion you display appears.

Like I am doing now, yet again.

chasd0012d ago

idk, all the code i've seen produced by an llm doesn't appear to be derived from anything. Also, the source code they were trained on does not exist in the model, it's impossible for the llm to return a code snippet from some other code base. The code snippet doesn't exist in the model in the first place. I guess another way to put it is show your code in the output of an llm that isn't being attributed correctly.

mhitza12d ago

At least on GitHub there was a special flag to exclude code that matches publicly available source code. Thus the chance is higher than 0. Which matches my experience last year when multiple Copilot chats got redacted for that reason.

lelanthran12d ago

> Also, the source code they were trained on does not exist in the model, it's impossible for the llm to return a code snippet from some other code base.

So? Just because a piece of output data is encrypted or compressed and does not resemble the input, does not mean that the process did not take the input.

We have decades of law that regards zipped files as infringment, lossy compression (MP3's) as infringment, etc.

> guess another way to put it is show your code in the output of an llm that isn't being attributed correctly.

Well, a better way of putting it is answering the question "Will that model have existed had none of the code used as input existed".

IOW, can that model be generated or created without first having all that copyrighted code used as input?

dylan60413d ago

Because there's no way the code is distributed properly according to any of the OSS licenses. In fact, it claims authorship with nonsense bylines saying the LLM wrote it.

rpdillon13d ago

They key issue is whether the training is considered to be fair use; but this can only be determined in court. We have some preliminary indications that it definitely can be, but also may not be, depending on four factors, but predominantly the first and fourth factor (how transformative, how it affects the market for the original works).

National Law Review covered some of those nuances last year: https://natlawreview.com/article/federal-courts-issue-first-...

US Copyright Office has a substantial document discussing each of the four factors, and making it clear this is an unanswered question, and details of the particular case will decide which way courts go. It is a prepublication version, and it's over 100 pages, but it covers the issues well, citing arguments on all sides.

https://www.copyright.gov/ai/Copyright-and-Artificial-Intell...

account428d ago

No, the key issue is whether the training is socially acceptable and sustainable. Court decisions based on pre-existing laws are only a small part of that discussion.

locknitpickerOP13d ago

> Because there's no way the code is distributed properly according to any of the OSS licenses.

What are you talking about? There is no distribution, only read access.

evanelias12d ago

Reading means downloading. Downloading is equivalent to making a copy. To make a copy of a copyrighted work, you need a license, unless your activity is fair use. Licenses have terms and conditions that must be followed, such as retaining attribution in all derivative works.

That said, FOSS licenses are non-exclusive. Regarding the original upthread topic of GitHub's copilot training, iirc GitHub's terms and conditions involve granting them a license in order to host your code. Depending what else is in those terms, they may have had the ability to use all hosted code for LLM training through that license, instead of the FOSS licensing on any given Open Source repo. But that would only apply to GitHub/Microsoft, not third party scrapers.

CyberDildonics12d ago

What is the difference?

j / k navigate · click thread line to collapse