In official llama implementation : the constant beta has been removed https://github.com/facebookresearch/llama/blob/main/llama/mo...
In the blog log we observe various lines " feedforward.1.beta', 0.0 " which mean that during the training the beta has degenerated into 0 whereas it should be constant 1.
One method I've seen people do to identify these types of mistakes is by precisely matching model outputs with a reference implementation. HuggingFace does this with tiny-random models: these models have randomized weights, but the output is expected to match exactly, if not, then it's an indicator of a bug. But this approach only works for bugs that arise during inference, detecting issues in data processing, optimizers, or anything that only happens during training is more challenging.
The reference paper for rotary embedding is Roformer https://arxiv.org/pdf/2104.09864v4.pdf
First you shouldn't rotate the values, only keys and queries. This is wrong : v_out = (torch.bmm(v.transpose(0,1), self.R[:m, ...])).transpose(0,1)
Second you shouldn't apply multihead attention which as additional inner weights that will mess with the rotations you have just done. This is wrong : activations, attn_weights = self.multihead (q_out,k_out,v_out)
Instead you should use scaled_dot_product_attention( q_out,k_out,v_out)
Third, each attention head should have been treated similarly, and each attention head should have the same rotation frequencies.
Personally I think it's because of the autoregressive, ODE-like nature of them, but who am I to say anything on that. ;PPPP
- embedding 65 -> 128
- linear 128 -> 128
- ReLU
- linear 128 -> 65
But since there's no non-linearity at all between the first two layers, and they both are linear... the second one is totally useless. This model is effectively a "classical" single hidden layer MLP. And in terms of FLOPS, it's wasting 128128=16k operations out of a total of 128128+65*128=24k operations.
So what's the best fix here? Adding a ReLU or SwiGLU between the embedding and first linear layer, or just deleting the linear? As presumably the embedding layer is required to convert token indexes to the embedding vector and you can't get rid of that, it has a special structure.
Alternatively, you could indeed put a ReLU or other non linearity between embedding and linear, you get a different model with more layers and more parameters, as the given dataset is pretty large I’m quite sure this would bring an improvement to accuracy, but without testing it’s rather impossible to know. Normalisation also acts as some kind of non linearity, but when the author adds it that barely helps accuracy at all, so who knows, sometimes (often) neural networks are counter intuitive…
Particularly:
"Use .shape religiously. assert and plt.imshow are your friends." Thank you. You should always assert pre and post conditions of shape. (Do bear or typeguard allow you to do this using decorators?)
Some nits:
"Before you even look at the paper, pick a small, simple, and fast model that you've done in the past. Then make a helper function to evaluate the model qualitatively." Don't you mean quantitatively? So that you establish a numerical baseline against which you can compare the more advanced method.
"Start by picking apart different components of the paper, and then implementing them one-by-one, training and evaluating as you go." Can you be precise what you mean here? A lot of work is like: "Okay we tried 10 changes things [for unspecified reasons], some major and some minor, to get our final thing, and here's an ablation study to show how much we lose if we remove each piece." If you would say: "Implement the meat first (the major architectural change fundamental to the work, i.e. the ablation study line-item all the way at the bottom with no seasoning or spices on it)" then yeah, that's a good place to start. But you can't start with a broccoli recipe, switch to a meat recipe, and taste it halfway before it's done cooking and you haven't flipped it, you're not going to learn much. This sort of advance is better framed as: "Evaluate each time you make an atomic change to the approach, prioritizing changes in the order that had the most impact in the ablation study from easiest to hardest, respecting the DAG in which certain changes can be made."
You can push some of this directly into Python type annotations thanks to https://peps.python.org/pep-0646/.
e.g.
@overload
def mean(a: ndarray[float, Dim1, *Shape], axis: Literal[0]) -> ndarray[float, *Shape]: ...
@overload
def mean(a: ndarray[float, Dim1, Dim2, *Shape], axis: Literal[1]) -> ndarray[float, Dim1, *Shape]: ...Ultimately, though, I don’t think Python will be nearly as good at that as Julia, whose type system can easily ensure matrix sizes make sense.
It's fine, I waited a bit before default adopting Relu over Tanh for all hidden non-final (not outputting a probability) layers.
A token is a unique integer identifier for a piece of text. The simplest tokenization scheme is just Unicode where one character gets one integer, however LLMs have a limited number of token IDs available for use (the vocabulary), so a more common approach is to glue characters together into common fragments. This post just uses the subset of ASCII needed by TinyShakespeare.
The "loss function" is just a measure of how similar the model's prediction is to the ground truth. Lower loss = better predictions. Different tasks have different loss functions, e.g. edit distance might be one (but not a good one). During training you compute the loss and will generally visualize it on a chart. Whilst the line is heading downwards your NN is getting better, so you can keep training.
PyTorch is a library for working with neural networks and tensors. A tensor is either a single number (0 dimensions, a scalar), an array of numbers (1 dimension, a vector), or a multi-dimensional array of numbers where the 2-dimensional case is called a matrix. But a tensor can have any number of dimensions. PyTorch has a relatively large amount of magic going on in it via reflection and other things, so don't expect the code to make much intuitive sense. It's building a computation graph that can be later executed on the GPU (or CPU). The tutorial is easy to read!
A neural network is a set of neurons, each of which has a number called the bias, and connections between them each of which has an associated weight. Numbers (activations) flow from an input neuron through the connections whilst being adjusted by the weights to arrive at an output neuron, those numbers are then summed then multiplied by the bias before being emitted again to the next layer. The weights and biases are the network parameters and encode its knowledge.
A linear layer is a set of input neurons connected to a set of output neurons, where every input is connected to every output. It's one of the simplest kinds of neural network structure. If you ever saw a diagram of a neural network pre-2010 it probably looked like that. The size of the input and output layers can be different.
ReLU is an activation function. It's just Math.max(0, x) i.e. it sets all negative numbers to zero. These are placed on the outputs of a neuron and are one of those weird mathematical hacks where I can't really explain why it's needed, but introducing "kinks" in the function helps the network learn. Exactly what "kinks" work best is an open area of exploration and later the author will replace ReLU with a newer more complicated function.
Gradients are kind of numeric diffs computed during training that are used to update the model and make it more accurate.
Batch normalization is a way to process the numbers as they flow through the network, which helps the network learn better.
Positional encodings help the network understand the positions of tokens relative to each other, expressed in the form of a vector.
The `@` infix operator in Python is an alias for the __matmul__ method and is used as a shorthand for matrix multiplication (there are linear algebra courses on YouTube that are quite good if you want to learn this in more detail).
An epoch is a complete training run of the dataset. NNs need to be shown the data many times to fully learn, so you repeat the dataset. A batch is how many of the items in the dataset are fed to the network before updating the parameters. These sorts of numbers are called hyperparameters, because they're things you can fiddle with but the word parameters was already used for weights/biases.
Attention is the magic that makes LLMs work. There are good explanations elsewhere, but briefly it processes all the input tokens in parallel to compute some intermediate tensors, and those are then used in a second stage to emit a series of output tokens.
It might be good to include context like "the science communicator/researcher, Andrej Karpathy" so that it is clearer that it is referring to a useful person to look at posts from.
> A token is a unique integer identifier for a piece of text.
A token is a word fragment that's common enough to be useful on its own - for eg., "writing", "written", "writer" all have "writ", so "writ" would be an individual token, and "writer" might be tokenized as "writ" and "er".
An embedding is where the tokens get turned into unique numeric identifiers.
character sequence (string) -> token (small integer) -> embedding (vector of floats)
vocab = sorted(list(set(lines)))Because when you compose linear functions you get linear functions. So having linear everything is a waste of all layers but one.
In order for this not to happen, you need nonlinearity.
Any pointers / references / books that you’ve found particularly helpful in your learning journey?
I know about Karpathy’s video series (and accompanying repos). Anything else come to mind? Thanks!
I would say that learning how to actually build NNs is likely not that important. What's far more important is to know how to use LLMs as an API or library, which is of course 1% coding because the API is so easy and 99% figuring out what their limits are, how best to integrate them into workflows, how to design textual "protocols" to communicate with the AI, how to test non-deterministic systems and so on. Learning how to train a model from scratch is fun but to get competitive results is too expensive, so pragmatism requires focus on being a user for now.
This is about the model itself. Training is another aspect. But usually after having the hyper parameters more or less similar, this should be fine, if the model is correct.
IMO PyTorch tensors should also have their device statically typed; right now you get a run-time error if you try to multiply a tensor in CPU memory by one in GPU memory.