This post is an ad and trying to paint these things as something they aren't.
If the FOSS community sets this as the benchmark for open source in respect of AI, they're going to lose control of the term. In most jurisdictions it would be illegal for the likes of Meta to release training data.
Please read through their "acceptable use" policy before you decide whether this is really in line with open source.
I'm not taking a specific posiion on this license. I haven't read it closely. My broad point is simply that open source AI, as a term, cannot practically require the training data be made available.
How come releasing an LLM trained on that data is not illegal then? I think it should be.
For an LLM, that’s not the training data. That’s the model itself. You don’t make changes to an LLM by going back to the training data and making changes to it, then re-running the training. You update the model itself with more training data.
You can’t even use the training code and original training data to reproduce the existing model. A lot of it is non-deterministic, so you’ll get different results each time anyway.
Another complication is that the object code for normal software is a clear derivative work of the source code. It’s a direct translation from one form to another. This isn’t the case with LLMs and their training data. The models learn from it, but they aren’t simply an alternative form of it. I don’t think you can describe an LLM as a derivative work of its training data. It learns from it, it isn’t a copy of it. This is mostly the reason why distributing training data is infeasible – the model’s creator may not have the license to do so.
Would it be extremely useful to have the original training data? Definitely. Is distributing it the same as distributing source code for normal software? I don’t think so.
I think new terminology is needed for open AI models. We can’t simply re-use what works for human-editable code because it’s a fundamentally different type of thing with different technical and legal constraints.
Sure. But that's not going to be released. The term open source AI cannot be expected to cover it because it's not practical.
I disagree with the purists - if you can legally change the source or weights - even without having access to the data used by the upstream authors - it's open enough for me. YMMV.