undefined | Better HN

0 pointsnomel1y ago0 comments

I really wish people would use "open weights" rather than "open source". It's precise and obvious, and leaves an accurate descriptor for actual "open source" models, where the source and methods that that generate the artifact, that is the weights, is open.

0 comments

lolinder1y ago

It's not precise. People who want to use "open weights" instead of "open source" are focusing on the wrong thing.

The weights are, for all practical purposes, source code in their own right. The GPL defines "source code" as "the preferred form of the work for making modifications to it". Almost no one would be capable of reproducing them even if given the source + data. At the same time, the weights are exactly what you need for the one type of modification that's within reach of most people: fine-tuning. That they didn't release the surrounding code that produced this "source" isn't that much different than a company releasing a library but not their whole software stack.

I'd argue that "source" vs "weights" is a dangerous distraction from the far more insidious word in "open source" when used to refer to the Llama license: "open".

The Llama 3.1 license [0] specifically forbids its use by very large organizations, by militaries, and by nuclear industries. It also contains a long list of forbidden use cases. This specific list sounds very reasonable to me on its face, but having a list of specific groups of people or fields of endeavor who are banned from participating runs counter to the spirit of open source and opens up the possibility that new "open" licenses come out with different lists of forbidden uses that sound less reasonable.

To be clear, I'm totally fine with them having those terms in their license, but I'm uncomfortable with setting the precedent of embracing the word "open" for it.

Llama is "nearly-open source". That's good enough for me to be able to use it for what I want, but the word "open" is the one that should be called out. "Source" is fine.

[0] https://github.com/meta-llama/llama-models/blob/main/models/...

TeMPOraL1y ago

Do the costs really matter here? "Weights" are "the preferred form of the work for making modifications to it" in the same sense compiled binary code would be, if for some reason no one could afford to recompile a program from sources.

Fine-tuning and LoRAs and toying with the runtime are all directly equivalent to DLL injection[0], trainers[1], and various other techniques used to tweak a compiled binary before or at runtime, including plain taking at the executable with a hex editor. Just because that's all anyone except the model vendor is able to do, doesn't merit calling the models "open source", much like no one would call binary-only software "open source" just because reverse engineering is a thing.

No, the weights are just artifacts. The source is the dataset and the training code (and possibly the training parameters). This isn't fundamentally different from running an advanced solver for a year, to find a way to make your program 100 byes smaller so it can fit on a Tamagochi. The resulting binary is magic, can't be reproduced without spending $$$$ on compute for th solver, but it is not open source. The source code is the bit that (produced the original binary that) went into the optimizer.

Calling these models "open source" is a runaway misuse of the term, and in some cases, a sleigh of hand.

[0] - https://en.wikipedia.org/wiki/DLL_injection

[1] - https://en.wikipedia.org/wiki/Trainer_(games) - a type of programs popular some 20 years ago, used to cheat at, or mod, single-player games, by keeping track of and directly modifying the memory of the game process. Could be as simple as continuously resetting the ammo counter, or as complex as injecting assembly to add new UI elements.

lolinder1y ago

> Fine-tuning and LoRAs and toying with the runtime are all directly equivalent to DLL injection[0], trainers[1], and various other techniques used to tweak a compiled binary before or at runtime, including plain taking at the executable with a hex editor.

No, because fine tuning is basically just a continuation of the same process that the original creators used to produce the weights in the first place, in the same way that modifying source code directly is in traditional open source. You pick up where they left off with new data and train it a little bit (or a lot!) more to adapt it to your use case.

The weights themselves are the computer program. There exists no corresponding source code. The code you're asking for corresponds not to the source code of a traditional program but to the programmers themselves and the processes used to write the code. Demanding the source code and data that produced the weights is equivalent to demanding a detailed engineering log documenting the process of building the library before you'll accept it as open source.

Just because you can't read it doesn't make it not source code. Once you have the weights, you are perfectly capable of modifying them following essentially the same processes the original authors did, which are well known and well documented in plenty of places with or without the actual source code that implements that process.

> Calling these models "open source" is a runaway misuse of the term, and in some cases, a sleigh of hand.

I agree wholeheartedly, but not because of "source". The sleight of hand is getting people to focus on that instead of the really problematic word.

mcbuilder1y ago

The thing is, the core of the GPT architecture is like 40 lines of code. Everyone knows what the source code is basically (minus optimizations). You just need to bring your own 20TB in data, 100k GPUs, and tens of millions in power budget, and you too can train llama 405b.

sharpshadow1y ago

If I understood the article correctly he intends to let the community make suggestions to selected developers which work on the source somehow. So maybe part of the source will be made visible.

fishermanbill1y ago

Its not open source. Your definition would make most video games open source - we modify them all the time. The small runtime framework IS open source but that's not much benefit as you cant really modify it hugely because the weights fix it to an implementation.

lolinder1y ago

> Your definition would make most video games open source - we modify them all the time.

No, because most video games aren't licensed in a way that makes that explicitly authorized, nor is modding the preferred form of the work for making modifications. The video game has source code that would be more useful, the model does not have source code that would be more useful than the weights.

fngjdflmdflg1y ago

As far as I know it's not just the weights. it's everything but the dataset. So the code used to generate the weights is also open source.

nomelOP1y ago

Is there any other case where "open source" is used for something that can't be reproduced? Seems like a new term is required, in the concept of "open source, non-reproducible artifacts".

I suppose language changes. I just prefer it changes towards being more precise, not less.

xu_ituairo1y ago

This feels somewhat analogous to games like Quake being open-sourced though still needing the user to provide the original game data files.

1 more reply

fishermanbill1y ago

Yes its "freeware" or any one of the similar terms we've used to refer to free software.

rovr1381y ago

Academia - nowadays source is needed is a lot of conferences, but the datasets, depending on where/how it might have be obtained, just can't be used or not available and the exact results can't be reproduced.

Not sure if the code is required under an open source license, but it's the same issue.

---

IMO, source is source and can be used for other datasets. Dataset isn't available, bring your own.

In this case, the source is there. The output is there, and not technically required. What isn't available is the ability to confirm the output comes from that source. That's not required under open source though.

What's disingenuous is the output being called 'open source'.

drexlspivey1y ago

No, the term is fine, “source” in “open source” refers to source code. A dataset by definition is not source code. Stop changing the meaning of words.

1 more reply

TeMPOraL1y ago

In other words, it's everything except the one thing that actually matters.

kibibu1y ago

The dataset is likely absolutely jam packed with copyrighted material that cannot be distributed.

OrangeMusic1y ago

Maybe, but it doesn't mean it's not open source.

1 more reply

llm_trw1y ago

> where the source and methods that that generate the artifact, that is the weights, is open.

When you require the same thing in software, namely the whole stack to run the software in question to be open source, we don't call the license open source.

TeMPOraL1y ago

Nope. Those model releases only open source the equivalent of "run.bat" that does some trivial things and calls into a binary blob. We wouldn't call such a program "open source".

Hell, in case of the models, "the whole stack to run the software" already is open source. Literally everything except the actual sources - the datasets and the build scripts (code doing the training) - is available openly. This is almost a literal inverse of "open source", thus shouldn't be called "open source".

gantrol1y ago

Training a model is like automatic programming, and the key of it is having a well-organized dataset.

If some "opensource" model just have the model and training methods but no dataset, it’s like some repo which released an executable file with a detailed design doc. Where is the source code? Do it yourself, please.

NOTE: I understand the difficulty of open-sourcing datasets. I'm just saying that the term "opensource" is getting diluted.

j / k navigate · click thread line to collapse

0 comments

lolinder1y ago

It's not precise. People who want to use "open weights" instead of "open source" are focusing on the wrong thing.

I'd argue that "source" vs "weights" is a dangerous distraction from the far more insidious word in "open source" when used to refer to the Llama license: "open".

To be clear, I'm totally fine with them having those terms in their license, but I'm uncomfortable with setting the precedent of embracing the word "open" for it.

Llama is "nearly-open source". That's good enough for me to be able to use it for what I want, but the word "open" is the one that should be called out. "Source" is fine.

[0] https://github.com/meta-llama/llama-models/blob/main/models/...

TeMPOraL1y ago

Calling these models "open source" is a runaway misuse of the term, and in some cases, a sleigh of hand.

[0] - https://en.wikipedia.org/wiki/DLL_injection

lolinder1y ago

> Calling these models "open source" is a runaway misuse of the term, and in some cases, a sleigh of hand.

I agree wholeheartedly, but not because of "source". The sleight of hand is getting people to focus on that instead of the really problematic word.

mcbuilder1y ago

sharpshadow1y ago

If I understood the article correctly he intends to let the community make suggestions to selected developers which work on the source somehow. So maybe part of the source will be made visible.

fishermanbill1y ago

lolinder1y ago

> Your definition would make most video games open source - we modify them all the time.

fngjdflmdflg1y ago

As far as I know it's not just the weights. it's everything but the dataset. So the code used to generate the weights is also open source.

nomelOP1y ago

Is there any other case where "open source" is used for something that can't be reproduced? Seems like a new term is required, in the concept of "open source, non-reproducible artifacts".

I suppose language changes. I just prefer it changes towards being more precise, not less.

xu_ituairo1y ago

This feels somewhat analogous to games like Quake being open-sourced though still needing the user to provide the original game data files.

1 more reply

fishermanbill1y ago

Yes its "freeware" or any one of the similar terms we've used to refer to free software.

rovr1381y ago

Not sure if the code is required under an open source license, but it's the same issue.

---

IMO, source is source and can be used for other datasets. Dataset isn't available, bring your own.

What's disingenuous is the output being called 'open source'.

drexlspivey1y ago

No, the term is fine, “source” in “open source” refers to source code. A dataset by definition is not source code. Stop changing the meaning of words.

1 more reply

TeMPOraL1y ago

In other words, it's everything except the one thing that actually matters.

kibibu1y ago

The dataset is likely absolutely jam packed with copyrighted material that cannot be distributed.

OrangeMusic1y ago

Maybe, but it doesn't mean it's not open source.

1 more reply

llm_trw1y ago

> where the source and methods that that generate the artifact, that is the weights, is open.

When you require the same thing in software, namely the whole stack to run the software in question to be open source, we don't call the license open source.

TeMPOraL1y ago

Nope. Those model releases only open source the equivalent of "run.bat" that does some trivial things and calls into a binary blob. We wouldn't call such a program "open source".

gantrol1y ago

Training a model is like automatic programming, and the key of it is having a well-organized dataset.

NOTE: I understand the difficulty of open-sourcing datasets. I'm just saying that the term "opensource" is getting diluted.

j / k navigate · click thread line to collapse