Llama 3 implemented in pure NumPy (opens in new tab)

(docs.likejazz.com)

476 pointsorixilus2y ago50 comments

50 comments

It's also worth mentioning that the original implementation by Meta is only 300 lines of very readable code [1].

[1]: https://github.com/meta-llama/llama3/blob/main/llama/model.p...

On line 59, there is a less-than-or-equals comparison between 0 and 1. Curious https://github.com/meta-llama/llama3/blob/main/llama/model.p...

bloaf2y ago

I am a reasonably competent python coder, yet when I see stuff like this I regard it with the same suspicion as a switch in the "more magic" position.

https://www.catb.org/jargon/html/magic-story.html

danielheath2y ago

What's the operator precedence in python?

Is it `assert(0 <= (1 < ndim))` or `assert((0 <= 1) < ndim)`, or something even stranger like `assert(0 <= 1) < ndim`?

1 more reply

blharr2y ago

So is this the case that the information is in the data set? Or the code is very well defined to be so small? As an outsider it's surprising that such a capable model can be so "simple".

jacobn2y ago

The training code is presumably quite a bit more complex than what they've open sourced, but part of the beauty of the GPT-based LLMs is their structural simplicity.

Now, that simplicity can be deceiving - there are a lot of conceptual interconnectedness within these models. They've been put together "just so" if you will.

If you look at the source code to nanoGPT and compare it to Llama3, the most remarkable thing (when you look past the superficial name changes) is just how similar they are.

If I recall correctly the primary differences are:

  - The MLP: Llama3 uses SwiGLU vs the more "traditional" x = x + proj(gelu(expand(x))) in GPT2
  - The token encoders, which is arguably external to the model
  - Attention: Llama3 uses Grouped Query Attention, vs full Multi-Head Attention in GPT2
  - Normalization: Llama3 uses RMSNorm, vs LayerNorm for GPT2

They were published more than five years apart. On the one hand progress has been breathtaking, truly astounding. On the other hand, it's almost exactly the same model.

Goes to show just how much is in the training data.

3 more replies

moritzwarhier2y ago

I think with LLMs in general, the algorithms are very refined and require lots of research, despite being "simple" in terms of entropy, or an imagined Kolgomorov complexity for defining algorithms.

So "simple" is a fuzzy term here, but yes, the entropic complexity is in the data, not the algorithms.

Related to the so-called "Bitter lesson".

Edit: the sister comment pointed out what I failed to express: RILHF and training are also algorithms, and their applications and implementations are probably much more complex than the code that evaluates a given prompt.

So basically, "models" (trained NNs) are also an example for the equivalence of code and data.

Fixed data used by code (the trained model) is code in itself, even when it is not directly written by humans or in a human-readable language.

Edit edit: don't forget to count the imported maths code :) but I assume this is not relevant to the "it's just matrix multiplications" overall argument

SpaceManNabs2y ago

300 lines of this code is a bit different than 300 lines of typical code where you read files, set up a backend/frontend, or parse data. In the latter case, there are a lot of tedious operations. Sure, the former also has that with reshaping and asserts or wtv.

But in a sense, the 300 lines of Llama code are essentially just lines of math. And reading through any math proof will show you that any particular line can hide large amounts of complexity.

This can be true with code with more tedious operations, but those lines are a smaller fraction of the overall code base by definition.

Even the "tedious" parts of the llama code can hide large complexity. Setting a learning rate with a schedule might require reading a paper or two for your particular architecture.

But yes, once you parse all the math and the theory, the lines are kinda simple matmul and forward lol.

1 more reply

hongspike2y ago

The numpy code can seem more accessible and easy to understand. Torch can look scary even though it's similar to numpy.

kureikain2y ago

Do you know why these are so short? What is the algorithm/magic in all of these?

I tried to make sense of it but cannot

Hugsun2y ago

Architecturally, LLMs are very simple compared to many software projects.

The crux of their behavior comes from their learned weights which are gigabytes and can cost millions to obtain via training.

DavidSJ2y ago

The magic is in the billions of learned weights (~synapses). This is just the scaffolding that runs them.

chpatrick2y ago

The magic is the structure of the model, and the real magic is the billions of weights.

_pastel2y ago

Why is max_seq_len set to 2048 [1] when the model card says the context size is 8k [2]?

[1] https://github.com/meta-llama/llama3/blob/14aab0428d3ec3a959...

[2] https://github.com/meta-llama/llama3/blob/14aab0428d3ec3a959...

mkolodny2y ago

That's just the default. You can set max_seq_len to 8k. From the readme [0]:

> All models support sequence length up to 8192 tokens, but we pre-allocate the cache according to max_seq_len and max_batch_size values. So set those according to your hardware.

[0] https://github.com/meta-llama/llama3/tree/14aab0428d3ec3a959...

blt2y ago

the simplicity of the transformer is quite refreshing. especially in vision where the Vision Transformer with linear patch encodings replaces complex intertwined decisions about filter size, striding, pooling, #filters, depth, etc., with the simpler decision of how to allocate your FLOPS between dimensionality, #heads, and #layers.

joennlae2y ago

Trainable Llama-like transformer (with backpropagation) in numpy only (~600 lines)

https://github.com/joennlae/tensorli

Zambyte2y ago

The description says GPT-like, but is is just a GPT, right?

p1esk2y ago

GPT refers to the specific family of models developed at OpenAI.

1 more reply

buildbot2y ago

Cool, instant cuda acceleration via cupy! `import cupy as np`

lnyan2y ago

`import jax.numpy as np`, then we also get a jax implemention after certain modifications: e.g. remove in-place index assignment, replace unsupported functions, etc

ffriend2y ago

JAX requires a bit more work to maintain fixed-size buffers as required by XLA, especially in case of caching and rotary embeddings. But yeah, overall the code can be pretty similar [1].

[1]: https://github.com/dfdx/fabrique/blob/main/fabrique/llama/mo...

cl3misch2y ago

...which should be much faster also on CPU, I assume.

rhdunn2y ago

From the TinyStories dataset card [1] the dataset is generated by GPT-3.5 and GPT-4. Reading the discussions in the community tab [2] it looks like there are a lot of incomplete or misspelled words, incorrect grammar, and even Chinese characters in the dataset.

As such, I'd be weary of using that dataset to train or evaluate models.

[1] https://huggingface.co/datasets/roneneldan/TinyStories

[2] https://huggingface.co/datasets/roneneldan/TinyStories/discu...

nwoli2y ago

It’s just used for checking that the implementation is correct. The dataset is just a toy dataset it doesn’t matter if it has misspelled words

dang2y ago

We changed the URL from https://github.com/likejazz/llama3.np to the article it points to, which gives more background.

AI_hacker2y ago

How does the performance of llama3.np compare to other implementations, especially considering it's a pure NumPy implementation?

johndough2y ago

What is the difference to the llama.np repository credited in the README? https://github.com/hscspring/llama.np

aeyes2y ago

Well, it supports Llama3.

But the other question I have is about the license. The tokenizer.py file is identical, and the rest is very similar - just making minor adjustments here and there.

Can they just take this Apache 2 licensed code, change it a bit and offer it as MIT? They are clearly not the original author.

Scaevolus2y ago

Unfortunately, licenses are only worth as much as your lawyers.

1 more reply

kolinko2y ago

Obligatory Recmo’s Llama1 implementation in numpy :)

https://github.com/recmo/cria

Scene_Cast22y ago

The rotary embeddings bit is neat. I wonder if a complex representation would simplify vs complexify things (readability, performance, expressive power).

johndough2y ago

Some implementations use a complex rotary encoding, but it makes it a bit harder to port to platforms or frameworks which do not support complex numbers natively.

6gvONxR4sf7o2y ago

The tensor cores that do the bulk of the flops on the bulk of the gpus people use are just various sizes of floats, i think. We're in a funny position where progress in models and progress in hardware are kind of linked.

As far as expressive power goes, it shouldn't make a difference for the models in common use, but I could totally imagine models where it improves readability.

threatripper2y ago

> np.sin(freqs)

Didn't we drop 2 pi somewhere?

xchip2y ago

Nice but the tricky part is the training data.

swader9992y ago

The tricky part is getting big enough that no one can successfully sue you for using "your" training data.

whereismyacc2y ago

there are a lot of tricky parts.

ulam22y ago

I'll consider superintelligence achieved if AI can do such work faithfully.

sebzim45002y ago

What? Lots of people could produce this repo, it hardly counts as superintelligence.

j / k navigate · click thread line to collapse

50 comments

ffriend2y ago

It's also worth mentioning that the original implementation by Meta is only 300 lines of very readable code [1].

[1]: https://github.com/meta-llama/llama3/blob/main/llama/model.p...

ebb_earl_co2y ago

On line 59, there is a less-than-or-equals comparison between 0 and 1. Curious https://github.com/meta-llama/llama3/blob/main/llama/model.p...

bloaf2y ago

I am a reasonably competent python coder, yet when I see stuff like this I regard it with the same suspicion as a switch in the "more magic" position.

https://www.catb.org/jargon/html/magic-story.html

danielheath2y ago

What's the operator precedence in python?

Is it `assert(0 <= (1 < ndim))` or `assert((0 <= 1) < ndim)`, or something even stranger like `assert(0 <= 1) < ndim`?

1 more reply

blharr2y ago

So is this the case that the information is in the data set? Or the code is very well defined to be so small? As an outsider it's surprising that such a capable model can be so "simple".

jacobn2y ago

The training code is presumably quite a bit more complex than what they've open sourced, but part of the beauty of the GPT-based LLMs is their structural simplicity.

Now, that simplicity can be deceiving - there are a lot of conceptual interconnectedness within these models. They've been put together "just so" if you will.

If you look at the source code to nanoGPT and compare it to Llama3, the most remarkable thing (when you look past the superficial name changes) is just how similar they are.

If I recall correctly the primary differences are:

  - The MLP: Llama3 uses SwiGLU vs the more "traditional" x = x + proj(gelu(expand(x))) in GPT2
  - The token encoders, which is arguably external to the model
  - Attention: Llama3 uses Grouped Query Attention, vs full Multi-Head Attention in GPT2
  - Normalization: Llama3 uses RMSNorm, vs LayerNorm for GPT2

They were published more than five years apart. On the one hand progress has been breathtaking, truly astounding. On the other hand, it's almost exactly the same model.

Goes to show just how much is in the training data.

3 more replies

moritzwarhier2y ago

I think with LLMs in general, the algorithms are very refined and require lots of research, despite being "simple" in terms of entropy, or an imagined Kolgomorov complexity for defining algorithms.

So "simple" is a fuzzy term here, but yes, the entropic complexity is in the data, not the algorithms.

Related to the so-called "Bitter lesson".

So basically, "models" (trained NNs) are also an example for the equivalence of code and data.

Fixed data used by code (the trained model) is code in itself, even when it is not directly written by humans or in a human-readable language.

Edit edit: don't forget to count the imported maths code :) but I assume this is not relevant to the "it's just matrix multiplications" overall argument

SpaceManNabs2y ago

But in a sense, the 300 lines of Llama code are essentially just lines of math. And reading through any math proof will show you that any particular line can hide large amounts of complexity.

This can be true with code with more tedious operations, but those lines are a smaller fraction of the overall code base by definition.

Even the "tedious" parts of the llama code can hide large complexity. Setting a learning rate with a schedule might require reading a paper or two for your particular architecture.

But yes, once you parse all the math and the theory, the lines are kinda simple matmul and forward lol.

1 more reply

hongspike2y ago

The numpy code can seem more accessible and easy to understand. Torch can look scary even though it's similar to numpy.

kureikain2y ago

Do you know why these are so short? What is the algorithm/magic in all of these?

I tried to make sense of it but cannot

Hugsun2y ago

Architecturally, LLMs are very simple compared to many software projects.

The crux of their behavior comes from their learned weights which are gigabytes and can cost millions to obtain via training.

DavidSJ2y ago

The magic is in the billions of learned weights (~synapses). This is just the scaffolding that runs them.

chpatrick2y ago

The magic is the structure of the model, and the real magic is the billions of weights.

_pastel2y ago

Why is max_seq_len set to 2048 [1] when the model card says the context size is 8k [2]?

[1] https://github.com/meta-llama/llama3/blob/14aab0428d3ec3a959...

[2] https://github.com/meta-llama/llama3/blob/14aab0428d3ec3a959...

mkolodny2y ago

That's just the default. You can set max_seq_len to 8k. From the readme [0]:

> All models support sequence length up to 8192 tokens, but we pre-allocate the cache according to max_seq_len and max_batch_size values. So set those according to your hardware.

[0] https://github.com/meta-llama/llama3/tree/14aab0428d3ec3a959...

blt2y ago

joennlae2y ago

Trainable Llama-like transformer (with backpropagation) in numpy only (~600 lines)

https://github.com/joennlae/tensorli

Zambyte2y ago

The description says GPT-like, but is is just a GPT, right?

p1esk2y ago

GPT refers to the specific family of models developed at OpenAI.

1 more reply

buildbot2y ago

Cool, instant cuda acceleration via cupy! `import cupy as np`

lnyan2y ago

`import jax.numpy as np`, then we also get a jax implemention after certain modifications: e.g. remove in-place index assignment, replace unsupported functions, etc

ffriend2y ago

JAX requires a bit more work to maintain fixed-size buffers as required by XLA, especially in case of caching and rotary embeddings. But yeah, overall the code can be pretty similar [1].

[1]: https://github.com/dfdx/fabrique/blob/main/fabrique/llama/mo...

cl3misch2y ago

...which should be much faster also on CPU, I assume.

rhdunn2y ago

As such, I'd be weary of using that dataset to train or evaluate models.

[1] https://huggingface.co/datasets/roneneldan/TinyStories

[2] https://huggingface.co/datasets/roneneldan/TinyStories/discu...

nwoli2y ago

It’s just used for checking that the implementation is correct. The dataset is just a toy dataset it doesn’t matter if it has misspelled words

dang2y ago

We changed the URL from https://github.com/likejazz/llama3.np to the article it points to, which gives more background.

AI_hacker2y ago

How does the performance of llama3.np compare to other implementations, especially considering it's a pure NumPy implementation?

johndough2y ago

What is the difference to the llama.np repository credited in the README? https://github.com/hscspring/llama.np

aeyes2y ago

Well, it supports Llama3.

But the other question I have is about the license. The tokenizer.py file is identical, and the rest is very similar - just making minor adjustments here and there.

Can they just take this Apache 2 licensed code, change it a bit and offer it as MIT? They are clearly not the original author.

Scaevolus2y ago

Unfortunately, licenses are only worth as much as your lawyers.

1 more reply

kolinko2y ago

Obligatory Recmo’s Llama1 implementation in numpy :)

https://github.com/recmo/cria

Scene_Cast22y ago

The rotary embeddings bit is neat. I wonder if a complex representation would simplify vs complexify things (readability, performance, expressive power).

johndough2y ago

Some implementations use a complex rotary encoding, but it makes it a bit harder to port to platforms or frameworks which do not support complex numbers natively.

6gvONxR4sf7o2y ago

As far as expressive power goes, it shouldn't make a difference for the models in common use, but I could totally imagine models where it improves readability.

threatripper2y ago

> np.sin(freqs)

Didn't we drop 2 pi somewhere?

xchip2y ago

Nice but the tricky part is the training data.

swader9992y ago

The tricky part is getting big enough that no one can successfully sue you for using "your" training data.

whereismyacc2y ago

there are a lot of tricky parts.

ulam22y ago

I'll consider superintelligence achieved if AI can do such work faithfully.

sebzim45002y ago

What? Lots of people could produce this repo, it hardly counts as superintelligence.

j / k navigate · click thread line to collapse