Can't believe it's the second time I end up with the very same argument about what open source is today on HN.
Similar to you publish the source for Oracle (the database), but nobody can build a binary from it because it needs magic compliers or test suites that aren't open source?
Heck when the browser was open-sourced, there was an explicit test where the source was given to some dude who didn't work for Netscape to verify that he could actually make a working binary. It's a scene in the movie "Code Rush".
It’s like if I said I open-sourced the Matrix trilogy and only gave you the DVD image and the source to the DVD decoder.
(Edit: Sorry, I replied to the wrong comment. I’m talking primarily about the typical sort of release we see, not this one which is a lot closer to actually open.)
What's missing?
I can simplify the task, can you convincingly explain how the same model can be produced from this dataset? We can start simple, how you can possibly get the same weights after the first single iteration? I.e. the same as original model got. Pay attention to randomness, data selection, initial model state.
Ok, if you can't do that. Can you explain in believable way how to prove that given model was trained on give dataset? I'm not asking you for actually doing all these things, that could be expensive, only to explain how it can be done.
Strict 'open source' includes not only open weights, open data. It also includes the word "reproducible". It's not "reproduced", only "reproducible". And even this is not the case here.
As others have mentioned, reproducible builds can be quite difficult to achieve even with regular software.
Compiler versions, build system versions, system library versions, time stamps, file paths, and more often contribute to getting non-identical yet functionally equivalent binaries, but the software is still open source.
It's an unreasonable request with ordinary code, even more with ML where very few ones have access to the necessary hardware, and where in practice, it is not deterministic.
_Always_, with the right options. And that's the key point. If distributed code is different it means it may be infected or altered in other way. In other words it cannot be trusted.
The same with models, if they are not reproducible or verifiable they cannot be trusted. Trust is the main feature of open source. Calling black box with attached data 'open source', even 'the first' is a bit of a stretch. It's not reproducible and not verifiable. And it's definitely not the first model with open data.
To be correct you should add 'untrusted' if you want to call this thing 'open source'. Like with Meta's models who knows what it holds.
PS: finally I'm negative, fanboys don't like it ;-)
Yes, there is a random element when "producing the binary" but that doesn't mean it isn't open source.