undefined | Better HN

0 pointstalldayo1y ago0 comments

I will steelman the idea that a tokenizer and weights are all you need for the "source" of an LLM. They are components that can be modified, redistributed and when put together, reproduce the full experience intended.

If we insist upon the release of training data with Open models, you might as well kiss the idea of usable Open LLMs out the door. Most of the content in training datasets like The Pile are not licensed for redistribution in any way shape or form. It would jeopardize projects that do use transparent training data while not offering anything of value to the community compared to the training code. Republishing all training data is an absolute trap.

0 comments

enriquto1y ago

> Most of the content in training datasets like The Pile are not licensed for redistribution in any way shape or form.

But distributing the weights is a "form" of distribution. You can recover many items of the dataset (most easily, the outliers) by using the weights.

Just because they are codified in a non-readily accessible way, does not mean that you are not distributing them.

It's scary to think that "training" is becoming a thinly veiled way to strip copyright of works.

talldayoOP1y ago

The weights are a transformed, lossy and non-complete permutation of the training material. You cannot recover most of the dataset reliably, which is what stops it from being an outright replacement for the work it's trained on.

> does not mean that you are not distributing them.

Except you literally aren't distributing them. It's like accusing me of pirating a movie because I sent a screenshot or a scene description to my friend.

> It's scary to think that "training" is becoming a thinly veiled way to strip copyright of works.

This is the way it's been for years. Google is given Fair Use for redistributing incomplete parts of copywritten text materials verbatim, since their application is transformative: https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....

Or Corellium, who won their case to use copywritten Apple code in novel and transformative ways: https://www.forbes.com/sites/thomasbrewster/2023/12/14/apple...

j / k navigate · click thread line to collapse