For an LLM, that’s not the training data. That’s the model itself. You don’t make changes to an LLM by going back to the training data and making changes to it, then re-running the training. You update the model itself with more training data.
You can’t even use the training code and original training data to reproduce the existing model. A lot of it is non-deterministic, so you’ll get different results each time anyway.
Another complication is that the object code for normal software is a clear derivative work of the source code. It’s a direct translation from one form to another. This isn’t the case with LLMs and their training data. The models learn from it, but they aren’t simply an alternative form of it. I don’t think you can describe an LLM as a derivative work of its training data. It learns from it, it isn’t a copy of it. This is mostly the reason why distributing training data is infeasible – the model’s creator may not have the license to do so.
Would it be extremely useful to have the original training data? Definitely. Is distributing it the same as distributing source code for normal software? I don’t think so.
I think new terminology is needed for open AI models. We can’t simply re-use what works for human-editable code because it’s a fundamentally different type of thing with different technical and legal constraints.