Trying to draw an equivalency between code and weights is [edited for temperament, I guess] not right. They are built from the source material supplied to an algorithm. Weights are data, not code.
Otherwise, everyone on the internet would be an author, and would have a say in the licensing of the weights.
By the same logic, the comparison between a compiled artifact and weights fails because the weights are not "compiled" in any meaningful sense. Analogies will always fail, which is why "preferred form for making modifications" is the rod we use, not vague attempts at drawing analogies between completely different development processes.
> They are built from the source material supplied to an algorithm. Weights are data, not code.
As Lispers know well, code is data and data is code. You can't draw a line in the sand and definitively say that on this side of the line is just code and on that side is just data.
In terms of how they behave, weights function as code that is executed by an interpreter that we call an inference engine.
I'm not comfortable with calling the resulting weights "open source", since people can't look at a set of weights and understand all of the components in the same way as actual source code. It's more like "freeware". You might be able to disassemble it with some work, but otherwise it's an incomprehensible thing you can run and have for free. I think it would be more appropriate to co-opt the term "open source" for weights generated from freely available material, because then there is no confusion whether the "source" is open.
And this is what I think everyone is actually dancing around: I suspect the insistence on publishing the training data has very little to do with a sense of purity around the definition of Open Source and everything to do with frustrations about copyright and intellectual property.
For that same reason, we won't see open source models by this definition any time soon, because the legal questions around data usage are profoundly unsettled and no company can afford to publicize the complete set of data that they trained on until they are.
My personal ethic says that intellectual property is a cancer that sacrifices knowledge and curiosity on the altar of profit, so I'm not overly concerned about forcing companies to reveal where they got the data. If they're releasing the resulting weights under a free license (which, notably, Llama isn't) then that's good enough for me.
> By the same logic, the comparison between a compiled artifact and weights fails because the weights are not "compiled" in any meaningful sense.
To me the weights map to assembly and the training data+models map to source code+compilers. Sure, you can hand me assembly, and with the assembly I may be able to execute the model/program, but having it does not mean that I can stare at it and learn nor modify it with a reasonable understanding of what's going to change.
I got to add that the situation feels even worse than assembly, because assembly, hand-coded or mutilated by an optimizing compiler still does something very specific and deterministic, but the weights on the model makes things equivalent to programming without booleans, but seemingly random numbers and checking for inequalities to get a binary decision.
In contrast, the weights are the preferred form for modification, even for the company that built it. They only very rarely start a brand new training run from scratch, and when they do so it's not to modify the existing work, it's to make a brand new work that builds on what they learned from the previous model.
If the company makes the form of the work that they themselves use as the primary artifact freely available, I'm not sure why we wouldn't call the work open.