undefined | Better HN

0 pointsJimDabell1y ago0 comments

I don’t think it’s that simple. The source is “the preferred form of the work for making modifications to it” (to use the GPL’s wording).

For an LLM, that’s not the training data. That’s the model itself. You don’t make changes to an LLM by going back to the training data and making changes to it, then re-running the training. You update the model itself with more training data.

You can’t even use the training code and original training data to reproduce the existing model. A lot of it is non-deterministic, so you’ll get different results each time anyway.

Another complication is that the object code for normal software is a clear derivative work of the source code. It’s a direct translation from one form to another. This isn’t the case with LLMs and their training data. The models learn from it, but they aren’t simply an alternative form of it. I don’t think you can describe an LLM as a derivative work of its training data. It learns from it, it isn’t a copy of it. This is mostly the reason why distributing training data is infeasible – the model’s creator may not have the license to do so.

Would it be extremely useful to have the original training data? Definitely. Is distributing it the same as distributing source code for normal software? I don’t think so.

I think new terminology is needed for open AI models. We can’t simply re-use what works for human-editable code because it’s a fundamentally different type of thing with different technical and legal constraints.

0 comments

jononor1y ago

No the preferred way to make modifications is using the the training code. One may also input a snapshot weighs to start from, but the training code is definitely what you would modify to make a change.

exe341y ago

how do you train it in a different language by changing the training code?

jononor1y ago

By selecting different dataset. Of course this dataset does need to exist. In practice building and curating datasets also involves a lot of code.

1 more reply

jononor1y ago

Given a well behaved training setup, you will give an equivalently powerful model, given the same dataset and training scripts, and training settings. At least if you are willing to run it several times, an pick the best one - a process that is commonly used for large models.

j / k navigate · click thread line to collapse

0 pointsJimDabell1y ago0 comments

I don’t think it’s that simple. The source is “the preferred form of the work for making modifications to it” (to use the GPL’s wording).

You can’t even use the training code and original training data to reproduce the existing model. A lot of it is non-deterministic, so you’ll get different results each time anyway.

Would it be extremely useful to have the original training data? Definitely. Is distributing it the same as distributing source code for normal software? I don’t think so.

0 comments

jononor1y ago

exe341y ago

how do you train it in a different language by changing the training code?

jononor1y ago

By selecting different dataset. Of course this dataset does need to exist. In practice building and curating datasets also involves a lot of code.

1 more reply

jononor1y ago

j / k navigate · click thread line to collapse