Understand how transformers work by demystifying the math behind them (opens in new tab)

(osanseviero.github.io)

470 pointsLaserPineapple2y ago137 comments

137 comments

The whole "mystery" of transformer is that instead of a linear sequence of static weights times values in each layer, you now have 3 different matrices that are obtained from the same input through multiplication of learned weights, and then you just multiply the matrices together. I.e more parallelism which works out nice, but very restrictive since the attention formula is static.

We arent going to see more progress until we have a way to generalize the compute graph as a learnable parameter. I dunno if this is even possible in the traditional sense of gradients due to chaotic effects (i.e small changes reflect big shifts in performance), it may have to be some form of genetic algorithm or pso that happens under the hood.

logicchains2y ago

>The whole "mystery" of transformer is that instead of a linear sequence of static weights times values in each layer, you now have 3 different matrices that are obtained from the same input through multiplication of learned weights, and then you just multiply the matrices together. I.e more parallelism which works out nice, but very restrictive since the attention formula is static.

That's not it at all. What's special about transformers is they allow each element in a sequence to decide which parts of data are most important to it from each other element in the sequence, then extract those out and compute on them. The big theoretical advantage over RNNs (which were used for sequences prior to transformers), is that transformers support this in a lossless way, as each element has full access to all the information in every other element in the sequence (or at least all the ones that occurred before it in time sequences). RNNs and "linear transformers" on the other hand compress past values, so generally the last element of a long sequence will not have access to all the information in the first element of the sequence (unless the RNN internal state was really really big so it didn't need to discard any information).

ActorNightly2y ago

>What's special about transformers is they allow each element in a sequence to decide which parts of data are most important to it from each other element in the sequence, then extract those out and compute on them.

They do that in theory. In practice, its just all matrix multiplication. You could easily structure a transformer as a bunch of fully connected deep layers and it would be mathematically equivalent, just computationally inefficient.

necroforest2y ago

> We arent going to see more progress until we have a way to generalize the compute graph as a learnable parameter

That's a bold statement since a ton of progress has been made without learning the compute graph.

nomel2y ago

From my naive perspective, there seems to be a plateau, that everyone is converging on, somewhere between ChatGPT 3.5 and 4 level of performance, with some suspecting that the implementation of 4 might involve several expert models, which would already be extra sauce, external to the LLM. This, combined with the observation that generative models converge to the same output, given the same training data, regardless of architecture (having trouble finding the link, it was posted here some weeks ago), external secret sauce, outside the model, might be where the near term gains are.

I suppose we'll see in the next year!

manojlds2y ago

We already have competitors to Transformers

https://arxiv.org/abs/2312.00752

1 more reply

uoaei2y ago

A ton of progress can be made climbing a tree, but if your goal is reaching the moon it becomes clear pretty quickly that climbing taller trees will never get you there.

nethi2y ago

True, but it is the process of climbing trees that gives the insight whether taller trees help or not and if not, what to do next.

1 more reply

gpderetta2y ago

With enough thrust, even p̵i̵g̵s̵ trees can fly.

ActorNightly2y ago

We have made progress in efficiency, not functionality. Instead of searching google or stack overflow or any particular documentation, we just go to Chatgpt.

Information compression is cool, but I want actual AI.

danielmarkbruce2y ago

The idea that there has been no progress in functionality is silly.

Your whole brain might just be doing "information compression" by that analogy. An LLM is sort of learning concepts. Even Word2Vec "learned" than king - male + female = queen and that's a small model that's really just one part (not exact, but similar) of a transformer.

1 more reply

p1esk2y ago

Fascinating. What’s “actual AI”?

4 more replies

ex3ndr2y ago

This is basically this - it can learn ignore some paths, and amplify something more important, then you can just cut this paths without sensible loss of quality. The problem is that you are not going to win anything from this - non-matrix multiplication would be slower or the same.

ActorNightly2y ago

The issue is that you are thinking of this in terms of information compression, which is what LLMs are.

Im more concerned with an LLM having the ability to be trained to the point where a subset of the graph represents all the nand gates necessary for a cpu and ram, so when you ask it questions it can actually run code to compute them accurately instead of offering a statistical best guess, i.e decompression after lossy compression.

exe342y ago

Just give it a computer? Even a virtual machine. It can output assembly code or high level code that gets compiled.

1 more reply

sroussey2y ago

Well, just remember that NAND gates are made of transistors themselves which are a statistical model of a sort… just designed to appear digital when combined to that NAND level.

This is why I am very interested in analog again—quantum stuff is statistical already, so why go from statistical (analog) to digital (huge drop off of performance, e.g. just look at basic addition in a ALU) and back to statistical. Very interested. Not sure if it will ever be worth it, but can’t rule it out.

lambdatronics2y ago

>a way to generalize the compute graph as a learnable parameter.

Agreed. Seems analogous with how human mental processes are used to solve the kind of problems we'd like LLMs to solve (going beyond "language processing" which transformers do well, to actual reasoning which they can only mimic). Although you risk it becoming a Turing machine by giving it flow control & then training is a problem as you say. Perhaps not intractable though.

kevindamm2y ago

hyperparameter tuning does already go some of the way towards learning the compute graph, though very constrained and with a lot more training required.

knightoffaith2y ago

How can gradient descent work on compute graphs when the space of compute graphs is discrete?

enriquto2y ago

> How can gradient descent work on compute graphs when the space of compute graphs is discrete?

You can un-discretize the space of compute graphs by interpolating its points by simplices. More precisely, each graph is a subgraph of the complete graph, and the subgraph is identified by the indicator function of its edges whose values are either 0 or 1. By using weighted edges with values between 0 and 1, the space of all graphs (with the same number of vertices) becomes continuous and connected, and you can gradient move around it in small steps.

Of course, "compute graphs" are more general beasts than "graphs", but it is likely that the same idea will apply. At least, for a reasonably large class of compute graphs.

tnecniv2y ago

It can’t. There’s no gradient since it’s not a sufficiently nice space for them. You can use gradient free methods but I’d be shocked if there was an efficient enough way to do that

ActorNightly2y ago

I don't know if it can in the traditional sense of back propagation.

I think that Hebbian Learning is going to make a comeback at some point and time which will be used to connect static subgraphs to to other subgraphs subgraphs, which can be trained either separately or on the fly.

blackbear_2y ago

Perhaps in a way similar to this paper: https://arxiv.org/abs/1806.09055

knightoffaith2y ago

I wonder why this hasn't taken off.

1 more reply

api2y ago

Genetic algorithms figured out GI the first time, but it took a while.

mdp20212y ago

Could you please expand?

api2y ago

Evolution built our brains.

Though to be fair, actual biological evolution is more complex than simple genetic algorithms. More like evolution strategies with meta-parameter-learning and adaptive rate tuning among other things.

cyberpewpew2y ago

can you explain exactly what you mean by this? I understand what a compute graph is, but I'm not getting the idea of making a learnable parameter.

cyberpewpew2y ago

nevermind, after looking at my own question 3-4 times it clicked.

mikeknoop2y ago

Couldn't find your contact info. Email me?

enriquto2y ago

For a dryer, more formal and succinct approach, see "The Transformer Model in Equations" [0], by John Thickstun. The whole thing fits in a single page, using standard mathematical notation.

[0] https://johnthickstun.com/docs/transformers.pdf

EnnioEvo2y ago

Finally, thank you so much! Was it so difficult? Isn't 7 lines of mathematical notation way better than pages of qualitative pub talking? I don't really understand these ML researchers, it always looks like they have never studied mathematics at all.

wardedVibe2y ago

Thank god, I've had to cobble something like this together for my own notes a couple of times trying to parse papers and was never quite sure if I was missing something.

alexmolas2y ago

> Uh oh! We’re getting NaNs! It seems our values are too high, and when being passed to the next encoder, they end up being too high and exploding! This is called gradient explosion.

As far as I understand this is wrong. You're not computing gradients at any point, so this is no gradient explosion. I believe the problem is with the implementation of softmax, here [0] you have an explanation of how to implement a numerically stable softmax.

[0]: https://jaykmody.com/blog/stable-softmax/

osanseviero2y ago

Yes, you're correct. I tried to connect a common training problem (gradient explosion and vanishing gradient) with the issue of softmax being sensitive to large values. I agree it's misleading/inaccurate, so will rewrite that part.

That said, the whole neural network will be sensible to large values, so it won't be fixed by a numerically stable softmax. The normalization is a key aspect for the network to work.

quickthrower22y ago

Transformer tutorials might be the new monad tutorial. A hard concept to get, but one you need to struggle with (and practice some examples) to understand. So a bit like much of computer science :-).

hdhfjkrkrme2y ago

The moment you understand the Transformer you become incapable of explaining it.

csdvrx2y ago

> Transformer tutorials might be the new monad tutorial. A hard concept to get,

A hard concept?

But a monad is just a monoid in the category of endofunctors, so what's the problem?

amelius2y ago

Waiting for a blogpost titled "You could have invented transformers".

dogline2y ago

Six paragraphs in, and I already have questions.

> Hello -> [1,2,3,4] World -> [2,3,4,5]

The vectors are random, but they look like they have a pattern here. Does the 2 in both vector mean something? Or, is it the entire set that makes it unique?

dan-robertson2y ago

The number reuse is just the author being a bit lazy. You could estimate how similar these vectors are by seeing if they point in similar directions or by calculating the angle between them. Here they are about 60° apart and somewhat the same direction, but a lot of this is that the author didn’t want to put in any negative numbers in the example so vectors end up being a bit more similar than they would be really.

That the numbers are reused isn’t meaningful here: a 1 in the first position is quite unrelated to a 1 in the second (as no convolutions are done over this vector)

dogline2y ago

Thank you. I guess I need to back up. This is a vector, not just an identifier, and direction and angle seem important. I need to look up how the encoding is normally done, since this isn't obvious if you haven't worked in this domain before.

kevindamm2y ago

The encoding is typically learned, and if possible is part of the ANN so that it can be adjusted along with the other parameters.

A good place to start on that topic is the word2vec paper.

smaddox2y ago

That isn't a very good example. The vectors for each token are randomly initialized with each element taken from the normal distribution. After training, similar words will have some cosine similarity, but almost never as much cosine similarity as [1,2,3,4] and [2,3,4,5].

heisenburgzero2y ago

Not completely related. Does anyone know where I can find articles / papers that discuss why transformers, while acting as merely "next token predictor" can handle questions with: 1. Unknown words (or subwords/tokens) that are not seen in the training dataset. Example: Create a table with "sdsfs_ff", "fsdf_value" as columns in pandas. 2. Create examples(unseen in training dataset) and tell the LLM to provide similar output.

I have a feeling it should be a common question, but I just can't find the keyword to search.

PS. If anyone has any links with thoroughly discussion about positional embedding, that would be great. I never got a satisfying answer about the usage of sine / cosine and (multiplication vs addition)

treyd2y ago

If I had to guess, single characters are able to be encoded as tokens, but there's more "bandwidth" in the model being dedicated to handling them and there's less semantic meaning encoded in them "natively" compared to tokens for concrete words. If it decides to, it can recreate unknown sequences by copying over the tokens for the single letters or create them if it makes sense.

heisenburgzero2y ago

I think some earlier NLP applications have something called "Unknown token", which they will replace any unseen word. But for recent implementations, I don't think they are being used anymore.

It still baffles me why such stochastic parrot / next token predictor, will recognize these "Unseen combinations of tokens" and reuse them in response.

stormfather2y ago

This helped me understand but not well enough to explain it yet: https://transformer-circuits.pub/2022/in-context-learning-an...

1 more reply

stevenhuang2y ago

Everything falls into place once you understand that LLMs are indeed learning hierarchical concepts inherent in the structured data it has been trained on. These concepts exist in a high dimensional latent space. Within this space is the concept of nonsense/gibberish/placeholder, which your sequence of unseen tokens map to. Then it combines this with the concept of SQL tables, resulting in hopefully the intended answer.

drdeca2y ago

P(X_1=x_1, X_2=x_2, X_3=x_3) = P(X_3=x_3 | X_1=X_1, X_2=x_2) • P(X_1=x_1, X_2=x_2) = P(X_3=x_3 | X_1=X_1, X_2=x_2) • P(X_2=x_2 | X_1=x_1) • P(X_1=x_1)

That is to say: Having a correct conditional probability distribution over the next token conditional on the previous tokens, produces a correct probability distribution over sequences of tokens.

And, “correct probability distribution over sequences of tokens” (or, “correct conditional probability distribution over sequences of tokens, conditional on whatever)”, can be... well, you can describe pretty much any kind of input/output behavior in those terms.

So, “it works by predicting the next token” is, at least in principle, not much of a constraint on what kinds of input/output behavior it can have?

So, whatever impressive thing it does, is not really in conflict with its output being produced from the probability distribution P(X_{n+1}=x_{n+1} | X_1=x_1, ..., X_n=x_n) (“predicting the next token”)

thewarrior2y ago

It’s not reproducing exact strings in the training data but patterns and patterns of patterns.

Next token prediction is more intelligent than it sounds

Rastonbury2y ago

Reminds me of that person who asked chatgpt to make its own language with vocab and grammar rules and translate back and forth, it blew my mind

leereeves2y ago

> The complexity comes from the number of steps and the number of parameters.

Yes, it seems like a transformer model simple enough for us to understand isn't able to do anything interesting, and a transformer complex enough to do something interesting is too complex for us to understand.

I would love to study something in the middle, a model that is both simple enough to understand and complex enough to do something interesting.

calebkaiser2y ago

You might be interested, if you aren't already familiar, in some of the work going on in the mechanistic interpretability field. Neel Nanda has a lot of approachable work on the topic: https://www.neelnanda.io/mechanistic-interpretability

leereeves2y ago

I was not familiar with it, and that does look fascinating, thank you. If anyone else is interested, this guide "Concrete Steps to Get Started in Transformer Mechanistic Interpretability" on his site looks like a great place to start:

https://www.neelnanda.io/mechanistic-interpretability/gettin...

PeterisP2y ago

I would assume that the boundaries of those ranges are such that the middle in between those extremes is something that is already too complex for a human to properly understand while still too small to be able to do anything interesting.

bsenftner2y ago

Hard to understand when concepts are used without definition or introduction. The Encoder section just begins without any description of what it is or where is sets in an overall process. I grasp what the author is trying to do, but the post misses basic essay structures such as introducing ideas and explaining them before using them, rending the entire post confusing if one is not already a student and half understands the topic before reading.

jongjong2y ago

As someone who has written an ANN from scratch and hasn't used TensorFlow, I still find this description confusing.

I asked ChatGPT to explain how to modify a basic ANN to implement self-attention without using the terms Matrix or Vector and it gave me a really simple explanation. Though I haven't tried to implement it yet.

I prefer to think of everything in terms of nodes, weights and layers. Matrices and vectors just makes it harder to relate to what's happening in the ANN.

The way I'm used to writing ANNs, each input node is a scalar but the feed forward algorithm looks like vector-matrix multiplication since you multiply all the input nodes by the weights then sum them up... Anyway, I feel like I'm approaching these descriptions with the wrong mindset. Maybe I lack the necessary background.

snaxsnaxsnax2y ago

Hmmm, yes, I know some of these words.

sam161802y ago

Is there an error in the positional encoding example? For example when calculating PE(1, 3), I'd expect i = 1 as 3 = 2 * 1 + 1

So for “World”

PE(1, 0) = sin(1 / 10000^(2*0 / 4)) = sin(1 / 10000^0) = sin(1) ≈ 0.84

PE(1, 1) = cos(1 / 10000^(2*0 / 4)) = cos(1 / 10000^0) = cos(1) ≈ 0.54

PE(1, 2) = sin(1 / 10000^(2*1 / 4)) = sin(1 / 10000^.5) ≈ 0.01

PE(1, 3) = cos(1 / 10000^(2*1 / 4)) = cos(1 / 10000^.5) ≈ 1

I also wondered if these formulae were devised with 1-based indexing in mind (though I guess for larger dimensions it doesn't make much difference), as the paper states

> The wavelengths form a geometric progression from 2π to 10000 · 2π

That led me to this chain of PRs - https://github.com/tensorflow/tensor2tensor/pull/177 - turns out the original code was actually quite different to that stated in the paper. I guess slight variations in how you calculate this encoding doesn't affect things too much?

remexre2y ago

Should the line

    Z_encoder_decoder = layer_norm(Z_encoder_decoder + Z)

in Decoder step 7 instead be

    Z_encoder_decoder = layer_norm(Z_encoder_decoder + Z_self_attention)

? Also, is layer_norm missing in Decoder step 8...

bloopernova2y ago

Do LLMs use neural nets? If so, what makes up the "neuron"? i.e. Is there a code structure that underlies the neuron, or is it "just" fancy math?

osanseviero2y ago

Just math, and not even that fancy.

Let's say you want to predict if you'll pass an exam based on how many hours you studied (x1) and how many exercises you did (x2). A neuron will learn a weight for each variable (w1 and w2). If the model learns w1=0.5 and w2=1, the model will provide more importance to the # of exercises.

So if you study for 10 hours and only do 2 exercises, the model will do x1w1 + x2w2=10x0.5 + 2x1 = 7. The neuron then outputs that. This is a bit (but not much) simplified - we also have a bias term and an activation to process the output.

Congrats! We built our first neuron together! Have thousands of these neurons in connected layers, and you suddenly have a deep neural network. Have billions or trillions of them, you have an LLM :)

abrichr2y ago

The “neuron” in a neural network is just a non linear function of the weighted sum of the inputs (plus a bias term).

See the “definition” section in https://en.wikipedia.org/wiki/Perceptron .

dartos2y ago

Transformers can be considered a kind of neural network.

It’s mainly fancy math. With tools like PyTorch or tensorflow, you use python to describe a graph of computations which gets compiled down into optimized instructions.

There are some examples of people making transformers and other NN architectures in about 100 lines of code. I’d google for those to see what these things look like in code.

The training loop, data, and resulting weights are where the magic is.

The code is disappointingly simple.

bloopernova2y ago

  > The code is disappointingly simple.

I absolutely adore this sentence, it made me laugh to imagine coders or other folks looking at the code and thinking "That's it?!? But that's simple!"

Although it feels a little similar to some of the basic reactions that go to make up DNA: start with simple units that work together to form something much more complex.

(apologies for poor metaphors, I'm still trying to grasp some of the concepts involved with this)

edgyquant2y ago

Yes neural networks, and even the math required to build them, are very simple calc 1 stuff generally. It’s more coming up with these models that takes powerful intuition

dartos2y ago

I spent a solid month very confused after reading up on how to implement some basic neural networks.

I was sure I missed something, so I didn’t even try to implement it since I was so sure I missed the complicated bit.

But no, all the complexity is in the mathematical implications

theonlybutlet2y ago

Yes to both, the "neuron" would basically be a weighted parameter. A parameter is an expression, it's a mathematical representation of a token and it's probabilistic weighting (theyre translated from input or to output token lists entering and exiting the model). Usually tokens are pre-set small groups of character combinations like "if " or "cha" that make up a word/sentence. The recorded path your value takes down the chain of probabilities would be the "neural pathway" within the wider "neural network".

Someone please correct me if I'm wrong or my terminology is wrong.

pmayrgundter2y ago

This is all true in a neutral net, but Transformers aren't Neural Nets in the traditional sense. I was under that impression originally, but there's not a back propagation or Hebbian learning here, which were the key bits of biomimicry that earned classic NNs their name.

Transformers do have coefficients that are fit, but that's more broad.. could be used for any sort of regression or optimization, and not necessarily indicative of biological analogs.

So I think the terms "learned model" of "weights" are malapropisms for Transformers, carried over from deep nets because of structural similarities, like many layers, and the development workflow.

The functional units in Transformer's layers have lost their orginal biological inspiration and functional analog. The core function in Transformers is more like autoencoding/decoding (concepts from info theory) and model/grammar-free translation, with a unique attention based optimization. Transformers were developed for translation. The magic is smth like "attending" to important parts of the translation inputs&outputs as tokens are generated, maybe as a kind of deviation on pure autoencoding, due to the bias from the .. learned model :) See I can't even escape it.

Attention as a powerful systemic optimization is the actual closer bit of neuro/bio-insporation here.. but more from Cog Psych than micro/neuro anatomy.

Btw, not only is attention a key insight for Transformers, but it's an interesting biographical note that the lead inventor of it, Jakob Uzkereit, went on to work on a bio-AI startup after Google.

kmmlng2y ago

> This is all true in a neutral net, but Transformers aren't Neural Nets in the traditional sense. I was under that impression originally, but there's not a back propagation or Hebbian learning here, which were the key bits of biomimicry that earned classic NNs their name.

Hebbian learning has never been used with much success in training neural nets. Backpropagation is not bio-inspired, but backpropagation is certainly used to train transformers.

1 more reply

theonlybutlet2y ago

Thanks for your reply, you raise a very good point, transformer models are a lot more complex. I'd argue conceptually they're the same, just the data and process is more abstracted. Autoencoded data implies using efficient representations, basically semantically abstracted data and opting for measures like back propagation through time.

1 more reply

brcmthrowaway2y ago

Does the human brain use transformers?

mirekrusin2y ago

Yes, through ie. services like openai's chatgpt.

exe342y ago

No, but they can both implement a language virtual machine which appears to be able to produce intelligent behaviour with unknown bounds.

__loam2y ago

Anyone telling you it does is a fraud.

utopcell2y ago

Here [1] are some "frauds" from Stanford University, Oxford University and University College London telling you exactly that.

From their abstract:

``One of the most exciting and promising novel architectures, the Transformer neural network, was developed without the brain in mind. In this work, we show that transformers, when equipped with recurrent position encodings, replicate the precisely tuned spatial representations of the hippocampal formation; most notably place and grid cells. Furthermore, we show that this result is no surprise since it is closely related to current hippocampal models from neuroscience. We additionally show the transformer version offers dramatic performance gains over the neuroscience version.``

[1] https://arxiv.org/abs/2112.04035

__loam2y ago

Making the claim that transformers are a good candidate model for certain neural pathways is a pretty different claim than saying the brain is literally using transformers.

1 more reply

willy_k2y ago

I’m assuming you are asking if the brain uses transformer-like structures or otherwise exhibits similar behavior. I don’t know, but it does share some processes with simpler ML ideas, and I’d be very interested to see if it uses anything resembling a transformer.

patcon2y ago

Forward-forward algorithm is more like the brain. As I understand, backpropagation transformers require storing data, doing calculations on that aggregate, and sending it back through, which no neural structures can do anything like.

https://medium.com/@Mosbeh_Barhoumi/forward-forward-algorith...

nostrademons2y ago

Yes. I still remember Optimus Prime and Bumblebee from my youth.

dartos2y ago

What?

MrVandemar2y ago

Definitely more than meets the eye.

godelski2y ago

I love and hate these.

I love them because they do give another resource at explaining models such as transformers and I think this one is pretty well done (note: you really need to do something about the equation in 4.2...)

First, the critique is coming from love. Great work, so I don't want it to be taken as I'm saying anything it isn't.

Why I hate these is that they are labeled as "math behind" but I think this is not quite fitting. This is the opposite of the complaint I made about the Introduction to DL post the other day[0]. The issue isn't that there isn't math, but contextually it is being labeled as a mathematical approach but I'm not seeing anything that distinguishes it as deeper than what you'd get from Karpathy's videos or the Annotated Transformer (I like this more than illustrated). There's nothing wrong with that, but just think it might mislead people, especially as there is a serious lack of places to find a much deeper mathematical explanation behind architectures and the naming makes it harder to find for those that are looking for that, because they'll find these posts. Simply, complaint is about framing.

To be clear, the complaint is just about the subtitle, because the article is good and a useful resource for people seeking to learn attention and transformers. But let me try to clarify some of what would I personally (welcome to disagree, it is an opinion) more accurately be representative of " demystifying all the math behind them":

- I would include a much deeper discussion of both embedding and positional embedding. The former you should at minimum be discussion how it is created and discussing the dequantization. This post may give a reader the impression that this is not taking place (there is ambiguity between distinction of embedding vs tokenization and embedding, this looks to just briefly mention tokenization. I specifically think a novice might take away that the dequantization is happening due to the positional encoding, and not in the embedding). The tokenization and embedding is a vastly underappreciated and incredibly important aspect of making discrete models work (not just LLMs or LMs. Principle is more general).

- Same goes for the positional embedding which I have only in a handful of cases seen discussed and taken rather matter of factly. For a mathematical explanation you do need to explain the idea behind generating unique signals for each position, explain why we need a a high frequency, and it is worth mentioning how this can be learnable (often with similar results, which is why most don't bother), and other forms like rotational. The principle is far more general than even a Fourier Series (unmentioned!). The continuous aspect also matters a lot here, and we (often) don't want discritized positional encoding. If this isn't explained it feels rather arbitrary, and in some ways it is but others it isn't.

- The attention mechanism is vastly under-explained, though I understand why. There are many approaches to tackle this, some from graphs, some from category theory, and many others. They're all valuable pieces to the puzzle. But at minimum I think there needs to be a clear identification as to what the dot product is doing, the softmax, the scale (see softmax tempering), and why we then have the value. Their key-query-value names were not chosen at random and the database analogy is quite helpful. Maybe many don't understand the relationship of dot products and angles between vectors? But this can even get complex as we would expect values to go to 0 in high dimensions (which they kinda do if you look at the attention matricies post learning which often look highly diagonal and why you can initialize them as diagonally spiked for sometimes faster training). This would be a great place to bring up how there might be some surprising aspects to the attention mechanism considering matrices represent affine transformations of data (linear) and we might not see the non-linearity here (softmax) or understand why softmax works better than other non-linears or normalizers (try it yourself!).

- There's more but I've written a wall. So I'll just say we can continue for the residuals (also see META's 3 Things Everyone Should Know About Vision Transformers, in the Deit repo), why we have pre-norm as opposed to the original post-norm (which it looks like post norm is being used!), the residuals (knot theory can help here a bit), and why we have the linear layer (similarly the unknotting discussion helps, especially quantifying why we like a 4x ratio, but isn't absolutely necessary).

Idk, are people interested in these things? I know most people aren't, and there's absolutely nothing wrong with that (you can still build strong models without this knowledge, but it is definitely helpful). I do feel that we often call these things black boxes but they aren't completely opaque. They sure aren't transparent, especially through scale, but they aren't "black" either. (Allen-Zhu & Li's Physics of LLMs is a great resource btw and I'd love if other users posted/referenced more things they liked. I purposefully didn't link btw)

So, I do like the post, and I think it has good value (and certainly there is always value in teaching to learn!), but I disagree with the HN title and post's subtitle.

[0] https://news.ycombinator.com/item?id=38834244

osanseviero2y ago

Hey @godelski! Author of the blog post here.

I really appreciate you taking the time to provide all this feedback. This feedback + additional resources are extremely useful.

I agree that the subtitle is not as accurate as it could be. I'll revisit it! As for content updates, I've been doing some additional updates in the last days based on feedback (e.g. more info about tokenization and the token embeddings). Although diving in some of your suggestions is likely out of scope for this article, I in particular agree that expanding the attention mechanism content (e.g. the analogy with databases or explaining what is dot product) would increase the quality of the article. I will look into expanding this!

I also think a more rigorous, separate mathematical exploration into attention mechanisms and recent advancements would be a great tool for the ecosystem.

Once again, thank you for all the amazing feedback!

godelski2y ago

Hey, I'm glad you found it useful. I know it is hard to take critique, but I did enjoy the post. I truly do mean the critique is coming from a place of love. And I hope the comment helps others find more (I guess I'm writing a blog post now). I do feel there is often this gap between nearly no math and way too much math that causes a lot of people to come away with "you don't need math for ML" which is... idk... partially correct but not? haha. I'm a bit mathy of a person so you just caught a pet peeve of mine. I definitely agree what I said is out of scope for how you wrote but I will stand with my subtitle critique ;) I still do like the article though

And I just realized we're in a slack channel together haha (I don't think we've ever talked though). I poked around your website and saw you're at HF. Love you guys to death. You all also have tons of awesome blog posts and you're one of the most useful forces in ML. So I really do appreciate all the work.

thrtythreeforty2y ago

Can you link a resource that is able to adequately explain why they're called Key, Query, and Value? Every explanation I've read eventually handwaved this. It feels like understanding why they're named that is key (heh) to understanding the concept, rather than just blindly implementing matmul.

pests2y ago

https://stats.stackexchange.com/questions/421935/what-exactl...

It mentions it comes from the original Attention Is All You Need paper and goes on into more detail.

It seems to be named exactly as you would expect. Key/Value as in KV store, with Query being the term being retrived.

CanuckPro2y ago

> Idk, are people interested in these things?

Yes, absolutely. Would be awesome to read deeper.

hnfong2y ago

Seconded.

godelski2y ago

Looks like I'll be writing a blog lol.

naitgacem2y ago

Reading the title I thought this was about electrical transformers :p

Although this is HN but my background is still stronger.

And by the way, is it worth it to invest time to get some idea about this whole AI field? I'm from a compE background

dataking2y ago

> is it worth it to invest time to get some idea about this whole AI field? I'm from a compE background

Might be worth thinking about how it will specifically affect your field of expertise. Jensen Huang says your job won't be taken over by an AI but by a human using an AI.

Luechkt2y ago

I knew there was more than meets the eye.

dingclancy2y ago

I am loving the Quarto website. I see more Python users using Quarto for publishing.

adamnemecek2y ago

It’s a renormalization process. It can be modelled as a convolution in a Hopf algebra.

chpatrick2y ago

It's simple, monads are just monoids in the category of endofunctors...

nemo85512y ago

There I was all excited to show off some of my electrical chops on HN.

Not today.

rzzzt2y ago

Does mystified math lie beyond behind how the ratio of input and output voltages is equal to the ratio of the primary and secondary windings? Can it be derived from Maxwell's equations?

Off to a search...

nihzm2y ago

Not really mystified in any sense of the word but for more precise calculations of circuit diagrams with transformers you can formulate a system of coupled ODEs which can be written in matrix form leading to the very nice mathematics of matrix differential equations.

amelius2y ago

I bet that an LLM (which uses transformers) can explain those aspects of a transformer to you.

wardedVibe2y ago

Or just read Wikipedia, which I'm sure it will crib from poorly... https://en.wikipedia.org/wiki/Transformer

nemo85512y ago

I seen an LLM, or maybe another variant of “AI” [0] a while back that could aid design of electronic circuits by having a pool of data sheets added for referencing.

As you were querying specs for a board at component level it could give you a schematic, I think, with citations to the actual data sheets.

I suppose the same scale up could be used for systems that needed a varying number of specific power supplies.

[0] https://www.flux.ai/p

1 more reply

j / k navigate · click thread line to collapse

137 comments

ActorNightly2y ago

logicchains2y ago

ActorNightly2y ago

necroforest2y ago

> We arent going to see more progress until we have a way to generalize the compute graph as a learnable parameter

That's a bold statement since a ton of progress has been made without learning the compute graph.

nomel2y ago

I suppose we'll see in the next year!

manojlds2y ago

We already have competitors to Transformers

https://arxiv.org/abs/2312.00752

1 more reply

uoaei2y ago

A ton of progress can be made climbing a tree, but if your goal is reaching the moon it becomes clear pretty quickly that climbing taller trees will never get you there.

nethi2y ago

True, but it is the process of climbing trees that gives the insight whether taller trees help or not and if not, what to do next.

1 more reply

gpderetta2y ago

With enough thrust, even p̵i̵g̵s̵ trees can fly.

ActorNightly2y ago

We have made progress in efficiency, not functionality. Instead of searching google or stack overflow or any particular documentation, we just go to Chatgpt.

Information compression is cool, but I want actual AI.

danielmarkbruce2y ago

The idea that there has been no progress in functionality is silly.

1 more reply

p1esk2y ago

Fascinating. What’s “actual AI”?

4 more replies

ex3ndr2y ago

ActorNightly2y ago

The issue is that you are thinking of this in terms of information compression, which is what LLMs are.

exe342y ago

Just give it a computer? Even a virtual machine. It can output assembly code or high level code that gets compiled.

1 more reply

sroussey2y ago

Well, just remember that NAND gates are made of transistors themselves which are a statistical model of a sort… just designed to appear digital when combined to that NAND level.

lambdatronics2y ago

>a way to generalize the compute graph as a learnable parameter.

kevindamm2y ago

hyperparameter tuning does already go some of the way towards learning the compute graph, though very constrained and with a lot more training required.

knightoffaith2y ago

How can gradient descent work on compute graphs when the space of compute graphs is discrete?

enriquto2y ago

> How can gradient descent work on compute graphs when the space of compute graphs is discrete?

Of course, "compute graphs" are more general beasts than "graphs", but it is likely that the same idea will apply. At least, for a reasonably large class of compute graphs.

tnecniv2y ago

It can’t. There’s no gradient since it’s not a sufficiently nice space for them. You can use gradient free methods but I’d be shocked if there was an efficient enough way to do that

ActorNightly2y ago

I don't know if it can in the traditional sense of back propagation.

blackbear_2y ago

Perhaps in a way similar to this paper: https://arxiv.org/abs/1806.09055

knightoffaith2y ago

I wonder why this hasn't taken off.

1 more reply

api2y ago

Genetic algorithms figured out GI the first time, but it took a while.

mdp20212y ago

Could you please expand?

api2y ago

Evolution built our brains.

Though to be fair, actual biological evolution is more complex than simple genetic algorithms. More like evolution strategies with meta-parameter-learning and adaptive rate tuning among other things.

cyberpewpew2y ago

can you explain exactly what you mean by this? I understand what a compute graph is, but I'm not getting the idea of making a learnable parameter.

cyberpewpew2y ago

nevermind, after looking at my own question 3-4 times it clicked.

mikeknoop2y ago

Couldn't find your contact info. Email me?

enriquto2y ago

For a dryer, more formal and succinct approach, see "The Transformer Model in Equations" [0], by John Thickstun. The whole thing fits in a single page, using standard mathematical notation.

[0] https://johnthickstun.com/docs/transformers.pdf

EnnioEvo2y ago

wardedVibe2y ago

Thank god, I've had to cobble something like this together for my own notes a couple of times trying to parse papers and was never quite sure if I was missing something.

alexmolas2y ago

> Uh oh! We’re getting NaNs! It seems our values are too high, and when being passed to the next encoder, they end up being too high and exploding! This is called gradient explosion.

[0]: https://jaykmody.com/blog/stable-softmax/

osanseviero2y ago

That said, the whole neural network will be sensible to large values, so it won't be fixed by a numerically stable softmax. The normalization is a key aspect for the network to work.

quickthrower22y ago

Transformer tutorials might be the new monad tutorial. A hard concept to get, but one you need to struggle with (and practice some examples) to understand. So a bit like much of computer science :-).

hdhfjkrkrme2y ago

The moment you understand the Transformer you become incapable of explaining it.

csdvrx2y ago

> Transformer tutorials might be the new monad tutorial. A hard concept to get,

A hard concept?

But a monad is just a monoid in the category of endofunctors, so what's the problem?

amelius2y ago

Waiting for a blogpost titled "You could have invented transformers".

dogline2y ago

Six paragraphs in, and I already have questions.

> Hello -> [1,2,3,4] World -> [2,3,4,5]

The vectors are random, but they look like they have a pattern here. Does the 2 in both vector mean something? Or, is it the entire set that makes it unique?

dan-robertson2y ago

That the numbers are reused isn’t meaningful here: a 1 in the first position is quite unrelated to a 1 in the second (as no convolutions are done over this vector)

dogline2y ago

kevindamm2y ago

The encoding is typically learned, and if possible is part of the ANN so that it can be adjusted along with the other parameters.

A good place to start on that topic is the word2vec paper.

smaddox2y ago

heisenburgzero2y ago

I have a feeling it should be a common question, but I just can't find the keyword to search.

treyd2y ago

heisenburgzero2y ago

I think some earlier NLP applications have something called "Unknown token", which they will replace any unseen word. But for recent implementations, I don't think they are being used anymore.

It still baffles me why such stochastic parrot / next token predictor, will recognize these "Unseen combinations of tokens" and reuse them in response.

stormfather2y ago

This helped me understand but not well enough to explain it yet: https://transformer-circuits.pub/2022/in-context-learning-an...

1 more reply

stevenhuang2y ago

drdeca2y ago

P(X_1=x_1, X_2=x_2, X_3=x_3) = P(X_3=x_3 | X_1=X_1, X_2=x_2) • P(X_1=x_1, X_2=x_2) = P(X_3=x_3 | X_1=X_1, X_2=x_2) • P(X_2=x_2 | X_1=x_1) • P(X_1=x_1)

That is to say: Having a correct conditional probability distribution over the next token conditional on the previous tokens, produces a correct probability distribution over sequences of tokens.

So, “it works by predicting the next token” is, at least in principle, not much of a constraint on what kinds of input/output behavior it can have?

thewarrior2y ago

It’s not reproducing exact strings in the training data but patterns and patterns of patterns.

Next token prediction is more intelligent than it sounds

Rastonbury2y ago

Reminds me of that person who asked chatgpt to make its own language with vocab and grammar rules and translate back and forth, it blew my mind

leereeves2y ago

> The complexity comes from the number of steps and the number of parameters.

I would love to study something in the middle, a model that is both simple enough to understand and complex enough to do something interesting.

calebkaiser2y ago

leereeves2y ago

https://www.neelnanda.io/mechanistic-interpretability/gettin...

PeterisP2y ago

bsenftner2y ago

jongjong2y ago

As someone who has written an ANN from scratch and hasn't used TensorFlow, I still find this description confusing.

I prefer to think of everything in terms of nodes, weights and layers. Matrices and vectors just makes it harder to relate to what's happening in the ANN.

snaxsnaxsnax2y ago

Hmmm, yes, I know some of these words.

sam161802y ago

Is there an error in the positional encoding example? For example when calculating PE(1, 3), I'd expect i = 1 as 3 = 2 * 1 + 1

So for “World”

PE(1, 0) = sin(1 / 10000^(2*0 / 4)) = sin(1 / 10000^0) = sin(1) ≈ 0.84

PE(1, 1) = cos(1 / 10000^(2*0 / 4)) = cos(1 / 10000^0) = cos(1) ≈ 0.54

PE(1, 2) = sin(1 / 10000^(2*1 / 4)) = sin(1 / 10000^.5) ≈ 0.01

PE(1, 3) = cos(1 / 10000^(2*1 / 4)) = cos(1 / 10000^.5) ≈ 1

I also wondered if these formulae were devised with 1-based indexing in mind (though I guess for larger dimensions it doesn't make much difference), as the paper states

> The wavelengths form a geometric progression from 2π to 10000 · 2π

remexre2y ago

Should the line

    Z_encoder_decoder = layer_norm(Z_encoder_decoder + Z)

in Decoder step 7 instead be

    Z_encoder_decoder = layer_norm(Z_encoder_decoder + Z_self_attention)

? Also, is layer_norm missing in Decoder step 8...

bloopernova2y ago

Do LLMs use neural nets? If so, what makes up the "neuron"? i.e. Is there a code structure that underlies the neuron, or is it "just" fancy math?

osanseviero2y ago

Just math, and not even that fancy.

Congrats! We built our first neuron together! Have thousands of these neurons in connected layers, and you suddenly have a deep neural network. Have billions or trillions of them, you have an LLM :)

abrichr2y ago

The “neuron” in a neural network is just a non linear function of the weighted sum of the inputs (plus a bias term).

See the “definition” section in https://en.wikipedia.org/wiki/Perceptron .

dartos2y ago

Transformers can be considered a kind of neural network.

It’s mainly fancy math. With tools like PyTorch or tensorflow, you use python to describe a graph of computations which gets compiled down into optimized instructions.

There are some examples of people making transformers and other NN architectures in about 100 lines of code. I’d google for those to see what these things look like in code.

The training loop, data, and resulting weights are where the magic is.

The code is disappointingly simple.

bloopernova2y ago

  > The code is disappointingly simple.

I absolutely adore this sentence, it made me laugh to imagine coders or other folks looking at the code and thinking "That's it?!? But that's simple!"

Although it feels a little similar to some of the basic reactions that go to make up DNA: start with simple units that work together to form something much more complex.

(apologies for poor metaphors, I'm still trying to grasp some of the concepts involved with this)

edgyquant2y ago

Yes neural networks, and even the math required to build them, are very simple calc 1 stuff generally. It’s more coming up with these models that takes powerful intuition

dartos2y ago

I spent a solid month very confused after reading up on how to implement some basic neural networks.

I was sure I missed something, so I didn’t even try to implement it since I was so sure I missed the complicated bit.

But no, all the complexity is in the mathematical implications

theonlybutlet2y ago

Someone please correct me if I'm wrong or my terminology is wrong.

pmayrgundter2y ago

Transformers do have coefficients that are fit, but that's more broad.. could be used for any sort of regression or optimization, and not necessarily indicative of biological analogs.

So I think the terms "learned model" of "weights" are malapropisms for Transformers, carried over from deep nets because of structural similarities, like many layers, and the development workflow.

Attention as a powerful systemic optimization is the actual closer bit of neuro/bio-insporation here.. but more from Cog Psych than micro/neuro anatomy.

Btw, not only is attention a key insight for Transformers, but it's an interesting biographical note that the lead inventor of it, Jakob Uzkereit, went on to work on a bio-AI startup after Google.

kmmlng2y ago

Hebbian learning has never been used with much success in training neural nets. Backpropagation is not bio-inspired, but backpropagation is certainly used to train transformers.

1 more reply

theonlybutlet2y ago

1 more reply

brcmthrowaway2y ago

Does the human brain use transformers?

mirekrusin2y ago

Yes, through ie. services like openai's chatgpt.

exe342y ago

No, but they can both implement a language virtual machine which appears to be able to produce intelligent behaviour with unknown bounds.

__loam2y ago

Anyone telling you it does is a fraud.

utopcell2y ago

Here [1] are some "frauds" from Stanford University, Oxford University and University College London telling you exactly that.

From their abstract:

[1] https://arxiv.org/abs/2112.04035

__loam2y ago

Making the claim that transformers are a good candidate model for certain neural pathways is a pretty different claim than saying the brain is literally using transformers.

1 more reply

willy_k2y ago

patcon2y ago

https://medium.com/@Mosbeh_Barhoumi/forward-forward-algorith...

nostrademons2y ago

Yes. I still remember Optimus Prime and Bumblebee from my youth.

dartos2y ago

What?

MrVandemar2y ago

Definitely more than meets the eye.

godelski2y ago

I love and hate these.

First, the critique is coming from love. Great work, so I don't want it to be taken as I'm saying anything it isn't.

So, I do like the post, and I think it has good value (and certainly there is always value in teaching to learn!), but I disagree with the HN title and post's subtitle.

[0] https://news.ycombinator.com/item?id=38834244

osanseviero2y ago

Hey @godelski! Author of the blog post here.

I really appreciate you taking the time to provide all this feedback. This feedback + additional resources are extremely useful.

I also think a more rigorous, separate mathematical exploration into attention mechanisms and recent advancements would be a great tool for the ecosystem.

Once again, thank you for all the amazing feedback!

godelski2y ago

thrtythreeforty2y ago

pests2y ago

https://stats.stackexchange.com/questions/421935/what-exactl...

It mentions it comes from the original Attention Is All You Need paper and goes on into more detail.

It seems to be named exactly as you would expect. Key/Value as in KV store, with Query being the term being retrived.

CanuckPro2y ago

> Idk, are people interested in these things?

Yes, absolutely. Would be awesome to read deeper.

hnfong2y ago

Seconded.

godelski2y ago

Looks like I'll be writing a blog lol.

naitgacem2y ago

Reading the title I thought this was about electrical transformers :p

Although this is HN but my background is still stronger.

And by the way, is it worth it to invest time to get some idea about this whole AI field? I'm from a compE background

dataking2y ago

> is it worth it to invest time to get some idea about this whole AI field? I'm from a compE background

Might be worth thinking about how it will specifically affect your field of expertise. Jensen Huang says your job won't be taken over by an AI but by a human using an AI.

Luechkt2y ago

I knew there was more than meets the eye.

dingclancy2y ago

I am loving the Quarto website. I see more Python users using Quarto for publishing.

adamnemecek2y ago

It’s a renormalization process. It can be modelled as a convolution in a Hopf algebra.

chpatrick2y ago

It's simple, monads are just monoids in the category of endofunctors...

nemo85512y ago

There I was all excited to show off some of my electrical chops on HN.

Not today.

rzzzt2y ago

Does mystified math lie beyond behind how the ratio of input and output voltages is equal to the ratio of the primary and secondary windings? Can it be derived from Maxwell's equations?

Off to a search...

nihzm2y ago

amelius2y ago

I bet that an LLM (which uses transformers) can explain those aspects of a transformer to you.

wardedVibe2y ago

Or just read Wikipedia, which I'm sure it will crib from poorly... https://en.wikipedia.org/wiki/Transformer

nemo85512y ago

I seen an LLM, or maybe another variant of “AI” [0] a while back that could aid design of electronic circuits by having a pool of data sheets added for referencing.

As you were querying specs for a board at component level it could give you a schematic, I think, with citations to the actual data sheets.

I suppose the same scale up could be used for systems that needed a varying number of specific power supplies.

[0] https://www.flux.ai/p

1 more reply

j / k navigate · click thread line to collapse