undefined | Better HN

0 pointsZambyte1y ago0 comments

> If so, then how can current ML models be open source?

The source of a language model is the text it was trained on. Llama models are not open source (contrary to their claims), they are open weight.

0 comments

moffkalast1y ago

You can find the entire Llama 3.0 pretraining set here: https://huggingface.co/datasets/HuggingFaceFW/fineweb

15T tokens, 45 terrabytes. Seems fairly open source to me.

ZambyteOP1y ago

Where has Facebook linked that? I can't find anywhere that they actually published that.

nickpsecurity1y ago

Many companies stopped publishing their data sets after people published evidence they were mass, copyright infringement. They dropped the specifics of pretraining data from the model cards.

Aside from licensing content, that content creators don’t like redistribution means a lawful model would probably only use Gutenberg’s collection and permissive code. Anything else, including Wikipedia, usually has licensing requirements they might violate.

moffkalast1y ago

Yeah I don't think I've seen it linked officially, but Meta does this sort of semi-official stuff all the time, leaking models ahead of time for PR, they even have a dedicated Reddit account for releasing unofficial info.

Regardless, it fits the compute used and the claim that they trained from public web data, and was suspiciously published by HF staff shortly after L3 released. It's about as official as the Mistral 7B v0.2 base model. I.e. mostly, but not entirely, probably for some weird legal reasons.

verdverm1y ago

Says it is ~94TB, with >130k downloads, implying more than 12 exabytes of copying, seems a bit off, wonder how they are calculating downloads

root_axis1y ago

No. The text is an asset used by the source to train the model. The source can process arbitrary text. Text is just text, it was written for communication purposes, software (defined by source code) processes that text in a particular way to train a model.

ZambyteOP1y ago

In programming, "source" and "asset" have specific meanings that conflict with how you used them.

Source is the input to some built artifact. It is the source of that artifact. As in: where the artifact comes from. Textual input is absolutely the source of the ML model. What you are using "source" as is analogous to the source of the compiler in traditional programming.

Asset is an artifact used as input, that is revered verbatim by the output. For example, a logo baked into an application to be rendered in the UI. The compilation of the program doesn't make a new logo, it just moves the asset into the built artifact.

ZambyteOP1y ago

I hadn't had my morning coffee yet when I wrote this and I have no idea what I meant instead of "revered", but you get the idea :D

thayne1y ago

I think it would also include the code used to train it

pphysch1y ago

That would be more analogous to the build toolchain than the source code, but yes

tshaddox1y ago

Surely traditional “open source” also needs some notion of a reproducible build toolchain, otherwise the source code itself is approximately useless.

Imagine if the source code was in a programming language of which the basic syntax and semantics were known to no one but the original developers.

Or more realistically, I think it’s a major problem if an open source project can only be built by an esoteric process that only the original developers have access to.

1 more reply

j / k navigate · click thread line to collapse

0 comments

moffkalast1y ago

You can find the entire Llama 3.0 pretraining set here: https://huggingface.co/datasets/HuggingFaceFW/fineweb

15T tokens, 45 terrabytes. Seems fairly open source to me.

ZambyteOP1y ago

Where has Facebook linked that? I can't find anywhere that they actually published that.

nickpsecurity1y ago

Many companies stopped publishing their data sets after people published evidence they were mass, copyright infringement. They dropped the specifics of pretraining data from the model cards.

moffkalast1y ago

verdverm1y ago

Says it is ~94TB, with >130k downloads, implying more than 12 exabytes of copying, seems a bit off, wonder how they are calculating downloads

root_axis1y ago

ZambyteOP1y ago

In programming, "source" and "asset" have specific meanings that conflict with how you used them.

ZambyteOP1y ago

I hadn't had my morning coffee yet when I wrote this and I have no idea what I meant instead of "revered", but you get the idea :D

thayne1y ago

I think it would also include the code used to train it

pphysch1y ago

That would be more analogous to the build toolchain than the source code, but yes

tshaddox1y ago

Surely traditional “open source” also needs some notion of a reproducible build toolchain, otherwise the source code itself is approximately useless.

Imagine if the source code was in a programming language of which the basic syntax and semantics were known to no one but the original developers.

Or more realistically, I think it’s a major problem if an open source project can only be built by an esoteric process that only the original developers have access to.

1 more reply

j / k navigate · click thread line to collapse