Now, that simplicity can be deceiving - there are a lot of conceptual interconnectedness within these models. They've been put together "just so" if you will.
If you look at the source code to nanoGPT and compare it to Llama3, the most remarkable thing (when you look past the superficial name changes) is just how similar they are.
If I recall correctly the primary differences are:
- The MLP: Llama3 uses SwiGLU vs the more "traditional" x = x + proj(gelu(expand(x))) in GPT2
- The token encoders, which is arguably external to the model
- Attention: Llama3 uses Grouped Query Attention, vs full Multi-Head Attention in GPT2
- Normalization: Llama3 uses RMSNorm, vs LayerNorm for GPT2
They were published more than five years apart. On the one hand progress has been breathtaking, truly astounding. On the other hand, it's almost exactly the same model.Goes to show just how much is in the training data.