For those reading it and going through each step, if by chance you get stuck on why 48 elements are in the first array, please refer to the model.py on minGPT [1]
It's an architectural decision that it will be great to mention in the article since people without too much context might lose it
[1] https://github.com/karpathy/minGPT/blob/master/mingpt/model....
I've recently finished an unorthodox kind of visualization / explanation of transformers. It's sadly not interactive, but it does have some maybe unique strengths.
First, it gives array axis semantic names, represented in the diagrams as colors (which this post also uses). So sequence axis is red, key feature dimension is green, multihead axis is orange, etc. This helps you show quite complicated array circuits and get an immediate feeling for what is going on and how different arrays are being combined with each-other. Here's a pic of the the full multihead self-attention step for example:
https://math.tali.link/raster/052n01bav6yvz_1smxhkus2qrik_07...
It also uses a kind of generalization tensor network diagrammatic notation -- if anyone remembers Penrose's tensor notation, it's like that but enriched with colors and some other ideas. Underneath these diagrams are string diagrams in a particular category, though you don't need to know (nor do I even explain that!).
Here's the main blog post introducing the formalism: https://math.tali.link/rainbow-array-algebra
Here's the section on perceptrons: https://math.tali.link/rainbow-array-algebra/#neural-network...
Here's the section on transformers: https://math.tali.link/rainbow-array-algebra/#transformers
https://github.com/karpathy/minGPT/blob/master/mingpt/model....
The frustration for the curious is that there is more than you can ever learn. You encounter something new and exciting, but then you realize that to really get to the spot where you can contribute will take at least a year or six, and that will require dropping other priorities.
Really nice stuff.
Since X now hides replies for non-logged in user here is a nitter link for those without an account (like me) that might want to see the full thread.
https://nitter.net/BrendanBycroft/status/1731042957149827140
Good visualization precludes good discoveries in many branches of science, I think.
(see my profile for a longer, potentially more silly description ;) )
Not only has the visualization, but it's interactive, has explanations for each item, has excellent performance and is open source: https://github.com/bbycroft/llm-viz/blob/main/src/llm
Another interesting visualization related thing: https://github.com/shap/shap
Unlike traditional neural networks with fixed weights, self-attention layers adaptively weight connections between inputs based on context. This allows transformers to accomplish in a single layer what would take traditional networks multiple layers.
1. There are the model weights, aka the parameters. These are what get adjusted during training to do the learning part. They always exist.
2. There are attention weights. These are part of the transformer architecture and they “weight” the context of the input. They are ephemeral. Used and discarded. Don’t always exist.
They are both typically 32-bit floats in case you’re curious but still different concepts.
Oh well... it seems like it's more confusing than I thought https://www.merriam-webster.com/wordplay/when-to-use-weigh-a...
On the training side I wouldn't be surprised if they were bf16 rather than fp32.
I think learning it following it more this historical development can be helpful. E.g. in this case here, learn the concept of attention, specifically cross attention first. And that is this paper: Bahdanau, Cho, Bengio, "Neural Machine Translation by Jointly Learning to Align and Translate", 2014, https://arxiv.org/abs/1409.0473
That paper introduces it. But even that is maybe quite dense, and to really grasp it, it helps to reimplement those things.
It's always dense, because those papers already have space constraints given by the conferences, max 9 pages or so. To get a better detailed overview, you can study the authors code, or other resources. There is a lot now about those topics, whole books, etc.
Imagine you read one comment in some forum, posted in a long conversation thread. It wouldn’t be obvious what’s going on unless you read more of the thread right?
A single paper is like a single comment, in a thread that goes on for years and years.
For example, why don’t papers explain what tokens/vectors/embedding layers are? Well, they did already, except that comment in the thread came 2013 with the word2vec paper!
You might think wth? To keep up with this some one would have to spend a huge part of their time just reading papers. So yeah that’s kind of what researchers do.
The alternative is to try to find where people have distilled down the important information or summarized it. That’s where books/blogs/youtube etc come in.
http://bactra.org/notebooks/nn-attention-and-transformers.ht...
(I'm sorry)
where
- Q = X @ W_Q [query]
- K = X @ W_K [key]
- V = X @ V [value]
- X [input]
hence
attn_head_i = (softmax(Q@K/normalizing term) @ V)
Each head corresponds to a different concurrent routing system
The transformer just adds normalization and mlp feature learning parts around that.
This [1] post from 2021 goes over attention mechanisms as applied to RNN / LSTM networks. It's visual and goes into a bit more detail, and I've personally found RNN / LSTM networks easier to understand intuitively.
[1] https://medium.com/swlh/a-simple-overview-of-rnn-lstm-and-at...
Wrote about it here: https://about.xethub.com/blog/visualizing-ml-models-github-n...
What an exciting time to be learning about LLMs. Everyday I come across a new resource, and everything is free!
So much depth; initially I thought it's "just" a 3d model. The animations are amazing.
But for me it's really broken haha
1) When you zoom, the cursor doesn't stay in the same position relative to some projected point 2) Panning also doesn't pin the cursor to a projected point, there's just a hacky multiplier there based on zoom
The main issue is that I'm storing the view state as target (on 2D plane) + euler angles + distance. Which is easy to think about, but issues 1 & 2 are better solved by manipulating a 4x4 view matrix. So would just need a matrix -> target-vector-pair conversion to get that working.
https://connectivity.brain-map.org/3d-viewer?v=1&types=PLY&P...
This pinning to one single plane works really well in this particular case because what you are showing is mostly a flat thing anyway, so you don't have much reason to put the view direction close to the plane. A straightforward extension of this behaviour would be to add a few more planes, like one for each of the the cardinal directions and just switch which plane is the one the panning happens in, might be interesting to try for a more rounded object.
To me the zoom seems to do what is expected, zooming around the cursor position tends to be disorientating in 3D anyway, though maybe I didn't understand what problem you complained about.
With in the transformer section:
> As is common in deep learning, it's hard to say exactly what each of these layers is doing, but we have some general ideas: the earlier layers tend to focus on learning lower-level features and patterns, while the later layers learn to recognize and understand higher-level abstractions and relationships.
That is the problem and yet these black boxes are just as explainable as a magic scroll.
For decades we’ve puzzled at how the inner workings of the brain works, and thought we’ve learned a lot we still don’t fully understand it. So, we figure, we’ll just make an artificial brain and THEN we’ll be able to figure it out.
And here we are, finally a big step closer to an artificial brain and once again, we don’t know how it works :)
(Although to be fair we’re spending all of our efforts making the models better and better and not on learning their low level behaviors. Thankfully when we decide to study them it’ll be a wee less invasive and actually doable, in theory.)
Add to that the fact that a model is being trained actively and the weights are given by humans and the possible realm of outputs is being heavily moderated by an army of faceless low paid workers I don't see any semblance of a brain but a very high maintenance indexing engine sold to the marketing departments of the world as a "brain".
Also, speaking about transformers: they usually append their output tokens to input and process them again. Can we optimize it, so that we don't need to do the same calculations with same input tokens?
I wasn't expecting it to get quite this popular, so hadn't handled this (rather major) edge case.
Check here : https://get.webgl.org/webgl2/
I get the same fail in both firefox and (chromium-based) qutebrowser.
The web is where useful error messages go to die.
This is why I love Hacker News!
If you are a teaching faculty at a university, it is against your own interests to invest time to develop novel teaching materials. The exception might be writing textbooks, which can be monetized, but typically are a net-negative endeavor.
Also, according to his home page, Mr. Bycroft earned a BSc in Mathematics and Physics from the University of Canterbury in 2012. It's true that this page isn't the direct result of a professor or a university course, and it's also true that it's not a completely separate thing. It seems clear that his university education had a big role to play in creating this.
If big competitive grants and competitive salaries went to people with demonstrated ability like the engineer of this viz, there would be less stem dropouts in colleges and more summer learning! Also, in technical trades like green construction, solar, hvac, building retrofits, data center operations and the like, people would get farther and it would be a more diverse bunch.
Why does YouTube sometimes have better content than professionally produced media? It's a really long tail of creators and educators
https://www.youtube.com/c/ProfGhristMath
This person is amazing.
Add that at research universities, they have to do research.
Also add in that at many schools, way too many students are just there to clock in, clock out and get a piece of paper that says they did it. Way too few are there to actually get an education. This has very real consequences on the moral of the instructors. When your students don't care, it's very hard for you to care. If your students aren't willing to work hard, why are you willing to work hard? Because you're paid so well?
I know plenty of instructors who would love to do things like this, but when are they going too? When are they going to find the time to learn the skills necessary to build an interactive web app? You think everyone outside of comp sci and like disciplines just naturally know how to build these types of apps?
I could go on, but the tl;dr of it is: Educators are over worked, underpaid and don't have enough time in the day.