undefined | Better HN

0 pointsdiggan1y ago0 comments

That's... Not how open source works? The "binary" (model weights) is open source and the "software" (training scripts + data used for training) is open source, this release is a real open source release. Independent reproduction is not needed to call something open source.

Can't believe it's the second time I end up with the very same argument about what open source is today on HN.

0 comments

dboreham1y ago

But wouldn't failure to achieve independent reproduction falsify the open claim?

Similar to you publish the source for Oracle (the database), but nobody can build a binary from it because it needs magic compliers or test suites that aren't open source?

Heck when the browser was open-sourced, there was an explicit test where the source was given to some dude who didn't work for Netscape to verify that he could actually make a working binary. It's a scene in the movie "Code Rush".

wrs1y ago

The interesting part of the product we’re taking about (that is, the equivalent of the executable binary of an ordinary software product) is the weights. The “source” is not sufficient to “recompile” the product (i.e., recreate the weights). Therefore, while the source you got is open, you didn’t get all the source to the thing that was supposedly “open source”.

It’s like if I said I open-sourced the Matrix trilogy and only gave you the DVD image and the source to the DVD decoder.

(Edit: Sorry, I replied to the wrong comment. I’m talking primarily about the typical sort of release we see, not this one which is a lot closer to actually open.)

littlestymaar1y ago

> The “source” is not sufficient to “recompile” the product (i.e., recreate the weights). Therefore, while the source you got is open, you didn’t get all the source to the thing that was supposedly “open source”.

What's missing?

wrs1y ago

Well, I’m not experienced in training full-sized LLMs, and it’s conceivable that in this particular case the training process is simple enough that nothing is missing. That would be a rarity, though. But see my edit above — I’m not actually reacting to this release when I say that.

1 more reply

bubaumba1y ago

You are missing key points here. "reproduce" means produce the same. Not just train similar model.

I can simplify the task, can you convincingly explain how the same model can be produced from this dataset? We can start simple, how you can possibly get the same weights after the first single iteration? I.e. the same as original model got. Pay attention to randomness, data selection, initial model state.

Ok, if you can't do that. Can you explain in believable way how to prove that given model was trained on give dataset? I'm not asking you for actually doing all these things, that could be expensive, only to explain how it can be done.

Strict 'open source' includes not only open weights, open data. It also includes the word "reproducible". It's not "reproduced", only "reproducible". And even this is not the case here.

Sayrus1y ago

Reproducible builds are not a requirement for open source software, why is it one for open source models?

wrs1y ago

I would say that functionally reproducible builds are sort of inherent in the concept of “source”. When builds are “not reproducible” that typically just means they’re not bit-for-bit identical, not that they don’t produce the same output for a given input.

1 more reply

Zamiel_Snawley1y ago

If they provide the training code, and data set, how is that not enough to reproduce functionally equivalent weights? I don’t have any experience in the AI field, what else would they need to provide/define?

As others have mentioned, reproducible builds can be quite difficult to achieve even with regular software.

Compiler versions, build system versions, system library versions, time stamps, file paths, and more often contribute to getting non-identical yet functionally equivalent binaries, but the software is still open source.

worewood1y ago

How often do people expect to compile open-source code and get _exactly_ the same binary as the distributed one? I've seen this kind of restriction only on decompilation projects e.g. the SM64 decompilation -- where they deliberately compare the hashes of original vs. compiled binaries, as a way to verify the decompilation is correct.

It's an unreasonable request with ordinary code, even more with ML where very few ones have access to the necessary hardware, and where in practice, it is not deterministic.

bubaumba1y ago

> How often do people expect to compile open-source code and get _exactly_ the same binary

_Always_, with the right options. And that's the key point. If distributed code is different it means it may be infected or altered in other way. In other words it cannot be trusted.

The same with models, if they are not reproducible or verifiable they cannot be trusted. Trust is the main feature of open source. Calling black box with attached data 'open source', even 'the first' is a bit of a stretch. It's not reproducible and not verifiable. And it's definitely not the first model with open data.

To be correct you should add 'untrusted' if you want to call this thing 'open source'. Like with Meta's models who knows what it holds.

PS: finally I'm negative, fanboys don't like it ;-)

e12e1y ago

I expect that if I compile your 3d renderer, and feed it the same scene file you did - I get the same image?

1 more reply

bavell1y ago

I think you are erroneously conflating open source with deterministic builds.

Yes, there is a random element when "producing the binary" but that doesn't mean it isn't open source.

j / k navigate · click thread line to collapse

0 comments

dboreham1y ago

But wouldn't failure to achieve independent reproduction falsify the open claim?

Similar to you publish the source for Oracle (the database), but nobody can build a binary from it because it needs magic compliers or test suites that aren't open source?

wrs1y ago

It’s like if I said I open-sourced the Matrix trilogy and only gave you the DVD image and the source to the DVD decoder.

(Edit: Sorry, I replied to the wrong comment. I’m talking primarily about the typical sort of release we see, not this one which is a lot closer to actually open.)

littlestymaar1y ago

What's missing?

wrs1y ago

1 more reply

bubaumba1y ago

You are missing key points here. "reproduce" means produce the same. Not just train similar model.

Strict 'open source' includes not only open weights, open data. It also includes the word "reproducible". It's not "reproduced", only "reproducible". And even this is not the case here.

Sayrus1y ago

Reproducible builds are not a requirement for open source software, why is it one for open source models?

wrs1y ago

1 more reply

Zamiel_Snawley1y ago

As others have mentioned, reproducible builds can be quite difficult to achieve even with regular software.

worewood1y ago

It's an unreasonable request with ordinary code, even more with ML where very few ones have access to the necessary hardware, and where in practice, it is not deterministic.

bubaumba1y ago

> How often do people expect to compile open-source code and get _exactly_ the same binary

_Always_, with the right options. And that's the key point. If distributed code is different it means it may be infected or altered in other way. In other words it cannot be trusted.

To be correct you should add 'untrusted' if you want to call this thing 'open source'. Like with Meta's models who knows what it holds.

PS: finally I'm negative, fanboys don't like it ;-)

e12e1y ago

I expect that if I compile your 3d renderer, and feed it the same scene file you did - I get the same image?

1 more reply

bavell1y ago

I think you are erroneously conflating open source with deterministic builds.

Yes, there is a random element when "producing the binary" but that doesn't mean it isn't open source.

j / k navigate · click thread line to collapse