[1]: https://github.com/meta-llama/llama3/blob/main/llama/model.p...
Is it `assert(0 <= (1 < ndim))` or `assert((0 <= 1) < ndim)`, or something even stranger like `assert(0 <= 1) < ndim`?
Now, that simplicity can be deceiving - there are a lot of conceptual interconnectedness within these models. They've been put together "just so" if you will.
If you look at the source code to nanoGPT and compare it to Llama3, the most remarkable thing (when you look past the superficial name changes) is just how similar they are.
If I recall correctly the primary differences are:
- The MLP: Llama3 uses SwiGLU vs the more "traditional" x = x + proj(gelu(expand(x))) in GPT2
- The token encoders, which is arguably external to the model
- Attention: Llama3 uses Grouped Query Attention, vs full Multi-Head Attention in GPT2
- Normalization: Llama3 uses RMSNorm, vs LayerNorm for GPT2
They were published more than five years apart. On the one hand progress has been breathtaking, truly astounding. On the other hand, it's almost exactly the same model.Goes to show just how much is in the training data.
So "simple" is a fuzzy term here, but yes, the entropic complexity is in the data, not the algorithms.
Related to the so-called "Bitter lesson".
Edit: the sister comment pointed out what I failed to express: RILHF and training are also algorithms, and their applications and implementations are probably much more complex than the code that evaluates a given prompt.
So basically, "models" (trained NNs) are also an example for the equivalence of code and data.
Fixed data used by code (the trained model) is code in itself, even when it is not directly written by humans or in a human-readable language.
Edit edit: don't forget to count the imported maths code :) but I assume this is not relevant to the "it's just matrix multiplications" overall argument
But in a sense, the 300 lines of Llama code are essentially just lines of math. And reading through any math proof will show you that any particular line can hide large amounts of complexity.
This can be true with code with more tedious operations, but those lines are a smaller fraction of the overall code base by definition.
Even the "tedious" parts of the llama code can hide large complexity. Setting a learning rate with a schedule might require reading a paper or two for your particular architecture.
But yes, once you parse all the math and the theory, the lines are kinda simple matmul and forward lol.
I tried to make sense of it but cannot
The crux of their behavior comes from their learned weights which are gigabytes and can cost millions to obtain via training.
[1] https://github.com/meta-llama/llama3/blob/14aab0428d3ec3a959...
[2] https://github.com/meta-llama/llama3/blob/14aab0428d3ec3a959...
> All models support sequence length up to 8192 tokens, but we pre-allocate the cache according to max_seq_len and max_batch_size values. So set those according to your hardware.
[0] https://github.com/meta-llama/llama3/tree/14aab0428d3ec3a959...
[1]: https://github.com/dfdx/fabrique/blob/main/fabrique/llama/mo...
As such, I'd be weary of using that dataset to train or evaluate models.
[1] https://huggingface.co/datasets/roneneldan/TinyStories
[2] https://huggingface.co/datasets/roneneldan/TinyStories/discu...
But the other question I have is about the license. The tokenizer.py file is identical, and the rest is very similar - just making minor adjustments here and there.
Can they just take this Apache 2 licensed code, change it a bit and offer it as MIT? They are clearly not the original author.
As far as expressive power goes, it shouldn't make a difference for the models in common use, but I could totally imagine models where it improves readability.
Didn't we drop 2 pi somewhere?