undefined | Better HN

0 pointswrs1y ago0 comments

Calling that “open source” renders the word “source” meaningless. By your definition, I can release a binary executable freely and call it “open source” because you can modify it to do whatever you want.

Model weights are like a binary that nobody has the source for. We need another term.

0 comments

sailingparrot1y ago

No it’s not the same as releasing a binary, feels like we can’t get out of the pedantics. I can in theory modify a binary to do whatever I want. In practice it is intractably hard to make any significant modification to a binary, and even if you could, you would then not be legally allowed to e.g. redistribute.

Here, modifying that model is not harder that doing regular ML, and I can redistribute.

Meta doesn’t have access to some magic higher level abstraction for that model that would make working with it easier that they did not release.

The sources in ML are the architecture the training and inference code and a paper describing the training procedure. It’s all there.

the8thbit1y ago

"In practice it is intractably hard to make any significant modification to a binary, and even if you could, you would then not be legally allowed to e.g. redistribute."

It depends on the binary and the license the binary is released under. If the binary is released to the public domain, for example, you are free to make whatever modifications you wish. And there are plenty of licenses like this, that allow closed source software to be used as the user wishes. That doesn't make it open source.

Likewise, there are plenty of closed source projects who's binaries we can poke and prod with much higher understanding of what our changes are actually doing than we're able to get when we poke and prod LLMs. If you want to make a Pokemon Red/Blue or Minecraft mod you have a lot of tools at your disposal.

A project that only exists as a binary which the copyright holder has relinquished rights to, or has released under some similar permissive closed source license, but people have poked around enough to figure out how to modify certain parts of the binary with some degree of predictability is a more apt analogy. Especially if the original author has lost the source code, as there is no source code the speak of when discussing these models.

I would not call that binary "open source", because the source would, in fact, not be open.

wrsOP1y ago

Can you change the tokenizer? No, because all you have is the weights trained with the current tokenizer. Therefore, by any normal definition, you don’t have the source. You have a giant black box of numbers with no ability to reproduce it.

sailingparrot1y ago

> Can you change the tokenizer?

Yes.

You can change it however you like, then look at the paper [1] under section 3.2. to know which hyperparameters were used during training and finetune the model to work with your new tokenizer using e.g. FineWeb [2] dataset.

You'll need to do only a fraction of the training you would have needed to do if you were to start a training from scratch for your tokenizer of choice. The weights released by Meta give you a massive head start and cost saving.

The fact that it's not trivial to do and out of reach of most consumer is not a matter of openness. That's just how ML is today.

[1]: https://scontent-sjc3-1.xx.fbcdn.net/v/t39.2365-6/452387774_...

[2]: https://huggingface.co/datasets/HuggingFaceFW/fineweb

1 more reply

jononor1y ago

You have to write all the code needed to do the modifications you are interested in. That is, there is no source code provided that can be used to make the modifications of interest. One also has to come up with suite le datasets, from scratch. Training setup and data is completely non trivial for a large language model. To replicate Llama would take hundreds of hours of engineering, at least.

sailingparrot1y ago

> You have to write all the code needed to do the modifications you are interested in. That is, there is no source code provided that can be used to make the modifications of interest.

Just like open source?

> Training setup and data is completely non trivial for a large language model. To replicate Llama would take hundreds of hours of engineering, at least.

The entire point of having the pre-trained weight released is to *not* have to do this. You just need to finetune, which can be done with very little data, depending on the task, and many open source toolkits, that work with those weights, exist to make this trivial.

1 more reply

j / k navigate · click thread line to collapse

0 comments

sailingparrot1y ago

Here, modifying that model is not harder that doing regular ML, and I can redistribute.

Meta doesn’t have access to some magic higher level abstraction for that model that would make working with it easier that they did not release.

The sources in ML are the architecture the training and inference code and a paper describing the training procedure. It’s all there.

the8thbit1y ago

"In practice it is intractably hard to make any significant modification to a binary, and even if you could, you would then not be legally allowed to e.g. redistribute."

I would not call that binary "open source", because the source would, in fact, not be open.

wrsOP1y ago

sailingparrot1y ago

> Can you change the tokenizer?

Yes.

The fact that it's not trivial to do and out of reach of most consumer is not a matter of openness. That's just how ML is today.

[1]: https://scontent-sjc3-1.xx.fbcdn.net/v/t39.2365-6/452387774_...

[2]: https://huggingface.co/datasets/HuggingFaceFW/fineweb

1 more reply

jononor1y ago

sailingparrot1y ago

> You have to write all the code needed to do the modifications you are interested in. That is, there is no source code provided that can be used to make the modifications of interest.

Just like open source?

> Training setup and data is completely non trivial for a large language model. To replicate Llama would take hundreds of hours of engineering, at least.

1 more reply

j / k navigate · click thread line to collapse