undefined | Better HN

0 pointscandiddevmike1y ago0 comments

None of Meta's models are "open source" in the FOSS sense, even the latest Llama 3.1. The license is restrictive. And no one has bothered to release their training data either.

This post is an ad and trying to paint these things as something they aren't.

0 comments

JumpCrisscross1y ago

> no one has bothered to release their training data

If the FOSS community sets this as the benchmark for open source in respect of AI, they're going to lose control of the term. In most jurisdictions it would be illegal for the likes of Meta to release training data.

mesebrec1y ago

Regardless of the training data, the license even heavily restricts how you can use the model.

Please read through their "acceptable use" policy before you decide whether this is really in line with open source.

JumpCrisscross1y ago

> Please read through their "acceptable use" policy before you decide whether this is really in line with open source

I'm not taking a specific posiion on this license. I haven't read it closely. My broad point is simply that open source AI, as a term, cannot practically require the training data be made available.

guitarlimeo1y ago

> In most jurisdictions it would be illegal for the likes of Meta to release training data.

How come releasing an LLM trained on that data is not illegal then? I think it should be.

exe341y ago

the training data is the source.

JimDabell1y ago

I don’t think it’s that simple. The source is “the preferred form of the work for making modifications to it” (to use the GPL’s wording).

For an LLM, that’s not the training data. That’s the model itself. You don’t make changes to an LLM by going back to the training data and making changes to it, then re-running the training. You update the model itself with more training data.

You can’t even use the training code and original training data to reproduce the existing model. A lot of it is non-deterministic, so you’ll get different results each time anyway.

Another complication is that the object code for normal software is a clear derivative work of the source code. It’s a direct translation from one form to another. This isn’t the case with LLMs and their training data. The models learn from it, but they aren’t simply an alternative form of it. I don’t think you can describe an LLM as a derivative work of its training data. It learns from it, it isn’t a copy of it. This is mostly the reason why distributing training data is infeasible – the model’s creator may not have the license to do so.

Would it be extremely useful to have the original training data? Definitely. Is distributing it the same as distributing source code for normal software? I don’t think so.

I think new terminology is needed for open AI models. We can’t simply re-use what works for human-editable code because it’s a fundamentally different type of thing with different technical and legal constraints.

2 more replies

JumpCrisscross1y ago

> the training data is the source

Sure. But that's not going to be released. The term open source AI cannot be expected to cover it because it's not practical.

3 more replies

sangnoir1y ago

We've had a similar debate before, but the last time it about whether Linux device drivers based on non-public datasheets under NDA were actually open source. This debate occurred again over drivers that interact with binary blobs.

I disagree with the purists - if you can legally change the source or weights - even without having access to the data used by the upstream authors - it's open enough for me. YMMV.

root_axis1y ago

No. It's an asset used in the training process, the source code can process arbitrary training data.

wrs1y ago

I don’t think even that is true. I conjecture that Facebook couldn’t reproduce the model weights if they started over with the same training data, because I doubt such a huge training run is a reproducible deterministic process. I don’t think anyone has “the” source.

1 more reply

blackeyeblitzar1y ago

AI2 has released training data in their OLMo model: https://blog.allenai.org/hello-olmo-a-truly-open-llm-43f7e73...

j / k navigate · click thread line to collapse

0 comments

JumpCrisscross1y ago

> no one has bothered to release their training data

mesebrec1y ago

Regardless of the training data, the license even heavily restricts how you can use the model.

Please read through their "acceptable use" policy before you decide whether this is really in line with open source.

JumpCrisscross1y ago

> Please read through their "acceptable use" policy before you decide whether this is really in line with open source

I'm not taking a specific posiion on this license. I haven't read it closely. My broad point is simply that open source AI, as a term, cannot practically require the training data be made available.

guitarlimeo1y ago

> In most jurisdictions it would be illegal for the likes of Meta to release training data.

How come releasing an LLM trained on that data is not illegal then? I think it should be.

exe341y ago

the training data is the source.

JimDabell1y ago

I don’t think it’s that simple. The source is “the preferred form of the work for making modifications to it” (to use the GPL’s wording).

You can’t even use the training code and original training data to reproduce the existing model. A lot of it is non-deterministic, so you’ll get different results each time anyway.

Would it be extremely useful to have the original training data? Definitely. Is distributing it the same as distributing source code for normal software? I don’t think so.

2 more replies

JumpCrisscross1y ago

> the training data is the source

Sure. But that's not going to be released. The term open source AI cannot be expected to cover it because it's not practical.

3 more replies

sangnoir1y ago

I disagree with the purists - if you can legally change the source or weights - even without having access to the data used by the upstream authors - it's open enough for me. YMMV.

root_axis1y ago

No. It's an asset used in the training process, the source code can process arbitrary training data.

wrs1y ago

1 more reply

blackeyeblitzar1y ago

AI2 has released training data in their OLMo model: https://blog.allenai.org/hello-olmo-a-truly-open-llm-43f7e73...

j / k navigate · click thread line to collapse