LLM Visualization (opens in new tab)

(bbycroft.net)

1592 pointsplibither82y ago131 comments

131 comments

This is an excellent tool to realize how an LLM actually works from the ground up!

For those reading it and going through each step, if by chance you get stuck on why 48 elements are in the first array, please refer to the model.py on minGPT [1]

It's an architectural decision that it will be great to mention in the article since people without too much context might lose it

[1] https://github.com/karpathy/minGPT/blob/master/mingpt/model....

taliesinb2y ago

Wow, I love the interactive wizzing around and the animation, very neat! Way more explanations should work like this.

I've recently finished an unorthodox kind of visualization / explanation of transformers. It's sadly not interactive, but it does have some maybe unique strengths.

First, it gives array axis semantic names, represented in the diagrams as colors (which this post also uses). So sequence axis is red, key feature dimension is green, multihead axis is orange, etc. This helps you show quite complicated array circuits and get an immediate feeling for what is going on and how different arrays are being combined with each-other. Here's a pic of the the full multihead self-attention step for example:

https://math.tali.link/raster/052n01bav6yvz_1smxhkus2qrik_07...

It also uses a kind of generalization tensor network diagrammatic notation -- if anyone remembers Penrose's tensor notation, it's like that but enriched with colors and some other ideas. Underneath these diagrams are string diagrams in a particular category, though you don't need to know (nor do I even explain that!).

Here's the main blog post introducing the formalism: https://math.tali.link/rainbow-array-algebra

Here's the section on perceptrons: https://math.tali.link/rainbow-array-algebra/#neural-network...

Here's the section on transformers: https://math.tali.link/rainbow-array-algebra/#transformers

jimmySixDOF2y ago

You might also like this interactive 3D walk through explainer from PyTorch :

https://pytorch.org/blog/inside-the-matrix/

riemannzeta2y ago

Are you referring specifically to line 141, which sets the number of embedding elements for gpt-nano to 48? That also seems to correspond to the Channel size C referenced in the explanation text?

https://github.com/karpathy/minGPT/blob/master/mingpt/model....

tomnipotent2y ago

That matches the name of default model selected in the right pane, "nano-gpt". I missed the "bigger picture" at first before I noticed the other models in the right pane header.

namocat2y ago

Yes, thank you - It was unexplained, so I got stuck on "Why 48?", thinking I'd missing something right out of the gate.

zombiwoof2y ago

I was thinking 42 ;-)

jayveeone2y ago

Yes yes it was the 48 elements thing that got me stuck. Definitely not everything from the second the page loaded.

holtkam22y ago

The visualization I've been looking for for months. I would have happily paid serious money for this... the fact that it's free is such a gift and I don't take it for granted.

terminous2y ago

Same... this is like a textbook, but worth it

wills_forward2y ago

My jaw drop to see algorhythmic complexity laid out so clearly in a 3d space like that. I wish I was smart enough to know if it's accurate or not.

block_dagger2y ago

To know, you must perform intellectual work, not merely be smart. I bet you are smart enough.

nocoder2y ago

What a nice comment!! This has been a big failing of my mental model. I always believed if I was smart enough I should understand things without effort. Still trying to unlearn this....

blackbear_2y ago

That is a surprisingly common fallacy actually; I think you will find this book quite helpful to overcome it: https://www.penguinrandomhouse.com/books/44330/mindset-by-ca...

1 more reply

modriano2y ago

Unfortunately you must look closely at the details to deeply understand how something works. Even when I already have a decent mental heuristic about how an algorithm works, I get a much richer understanding by calculating the output of an algorithm by hand.

At least for me, I don't really understand something until I can see all of the moving parts and figure out how they work together. Until then, I just see a black box that does surprising things when poked.

jampekka2y ago

It's also important to learn how to "teach yourself".

Understanding transformers will be really hard if you don't understand basic fully connected feedforward networks (multilayer perceptrons). And learning those is a bit challenging if you don't understand a single unit perceptron.

Transformers have the additional challenge of having a bit weird terminology. Keys, queries and values kinda make sense from a traditional information retrieval literature but they're more a metaphor in the attention system. "Attention" and other mentalistic/antrophomorphic terminology can also easily mislead intuitions.

Getting a good "learning path" is usually a teacher's main task, but you can learn to figure those by yourself by trying to find some part of the thing you can get a grasp of.

Most complicated seeming things (especially in tech) aren't really that complicated "to get". You just have to know a lot of stuff that the thing builds on.

SubiculumCode2y ago

99% persperation, 1% inspiration, as the addage goes...and I completely agree.

The frustration for the curious is that there is more than you can ever learn. You encounter something new and exciting, but then you realize that to really get to the spot where you can contribute will take at least a year or six, and that will require dropping other priorities.

gryfft2y ago

Damn, this looks phenomenal. I've been wanting to do a deep dive like this for a while-- the 3D model is a spectacular pedagogic device.

quickthrower22y ago

Andrej Karpathy twisting his hands as he explains it is also a great device. Not being sarcastic, when he explains it I understand it for a good minute it two. Then need to rewatch as I forget (but that is just me)!

hodanli2y ago

which video specifically?

baq2y ago

Could as well be titled 'dissecting magic into matmuls and dot products for dummies'. Great stuff. Went away even more amazed that LLMs work as well as they do.

mark_l_watson2y ago

I am looking at Brenden’s GitHub repo https://github.com/bbycroft/llm-viz

Really nice stuff.

flockonus2y ago

Twitter thread by the author sharing some extra context on this work: https://twitter.com/BrendanBycroft/status/173104295714982714...

itslennysfault2y ago

Thanks for sharing. This is a great thread.

Since X now hides replies for non-logged in user here is a nitter link for those without an account (like me) that might want to see the full thread.

https://nitter.net/BrendanBycroft/status/1731042957149827140

3abiton2y ago

I wish it could integrate other open source LLMs in the backend, but this is already an amazing viz.

tysam_and2y ago

Another visualization I would really love would be a clickable circular set of possible prediction branches, projected onto a Poincare disk (to handle the exponential branching component of it all). Would take forever to calculate except on smaller models, but being able to visualize branch probabilities angularly for the top n values or whatever, and to go forwards and backwards up and down different branches would likely yield some important insights into how they work.

Good visualization precludes good discoveries in many branches of science, I think.

(see my profile for a longer, potentially more silly description ;) )

29athrowaway2y ago

I big kudos to the author of this.

Not only has the visualization, but it's interactive, has explanations for each item, has excellent performance and is open source: https://github.com/bbycroft/llm-viz/blob/main/src/llm

Another interesting visualization related thing: https://github.com/shap/shap

8f2ab37a-ed6c2y ago

Expecting someone to implement an LLM in Factorio any day now, we're half-way there already with this blueprint.

Exuma2y ago

This is really awesome but I at least wish there were a few added sentences around how I'm supposed to intuitively think about the purpose of why it's like that. For example, I see a T x C matrix of 6 x 48... but at this step, before it's fed into the net, what is this supposed to represent?

singularity20012y ago

Also later why 8 and why is "A" expected in the sixth position

atgctg2y ago

A lot of transformer explanations fail to mention what makes self attention so powerful.

Unlike traditional neural networks with fixed weights, self-attention layers adaptively weight connections between inputs based on context. This allows transformers to accomplish in a single layer what would take traditional networks multiple layers.

WhitneyLand2y ago

In case it’s confusing for anyone to see “weight” as a verb and a noun so close together, there are indeed two different things going on:

1. There are the model weights, aka the parameters. These are what get adjusted during training to do the learning part. They always exist.

2. There are attention weights. These are part of the transformer architecture and they “weight” the context of the input. They are ephemeral. Used and discarded. Don’t always exist.

They are both typically 32-bit floats in case you’re curious but still different concepts.

airstrike2y ago

I always thought the verb was "weigh" not "weight", but apparently the latter is also in the dictionary as a verb.

Oh well... it seems like it's more confusing than I thought https://www.merriam-webster.com/wordplay/when-to-use-weigh-a...

bobbylarrybobby2y ago

“To weight” is to assign a weight (e.g., to weight variables differently in a model), whereas “to weigh” is to observe and/or record a weight (as a scale does).

1 more reply

owlbite2y ago

I think in most deployments, they're not fp32 by the time you're doing inference no them, they've been quantized, possibly down to 4 bits or even fewer.

On the training side I wouldn't be surprised if they were bf16 rather than fp32.

kirill5pol2y ago

I think a good way of explaining #2 is “weight” in the sense of a weighted average

kmeisthax2y ago

None of this seems obvious just reading the original Attention is all you need paper. Is there a more in-depth explanation of how this adaptive weighting works?

albertzeyer2y ago

The audience of this paper are other researchers who already know the concept of attention, which was very well known already in the field. In such research papers, such things are never explained again, as all the researchers already know this or can read other sources, which are cited, but focus on the actual research questions. In this case, the research question was simply: Can we get away by just using attention and not using the LSTM anymore? Before that, everyone was using both together.

I think learning it following it more this historical development can be helpful. E.g. in this case here, learn the concept of attention, specifically cross attention first. And that is this paper: Bahdanau, Cho, Bengio, "Neural Machine Translation by Jointly Learning to Align and Translate", 2014, https://arxiv.org/abs/1409.0473

That paper introduces it. But even that is maybe quite dense, and to really grasp it, it helps to reimplement those things.

It's always dense, because those papers already have space constraints given by the conferences, max 9 pages or so. To get a better detailed overview, you can study the authors code, or other resources. There is a lot now about those topics, whole books, etc.

BOOSTERHIDROGEN2y ago

What books cover exclusively about this topic ? Thanks

3 more replies

WhitneyLand2y ago

It’s definitely not obvious no matter how smart you are! The common metaphor used is it’s like a conversation.

Imagine you read one comment in some forum, posted in a long conversation thread. It wouldn’t be obvious what’s going on unless you read more of the thread right?

A single paper is like a single comment, in a thread that goes on for years and years.

For example, why don’t papers explain what tokens/vectors/embedding layers are? Well, they did already, except that comment in the thread came 2013 with the word2vec paper!

You might think wth? To keep up with this some one would have to spend a huge part of their time just reading papers. So yeah that’s kind of what researchers do.

The alternative is to try to find where people have distilled down the important information or summarized it. That’s where books/blogs/youtube etc come in.

andai2y ago

Is there a way of finding interesting "chains" of such papers, short of scanning the references / "cited by" page?

(For example, Google Scholar lists 98797 citations for Attention is all you need!)

1 more reply

kaimac2y ago

I found these notes very useful. They also contain a nice summary of how LLMs/transformers work. It doesn't help that people can't seem to help taking a concept that has been around for decades (kernel smoothing) and giving it a fancy new name (attention).

http://bactra.org/notebooks/nn-attention-and-transformers.ht...

CyberDildonics2y ago

It's just as bad a "convolutional neural networks" instead of "images being scaled down"

1 more reply

gtoubassi2y ago

I struggled to get an intuition for this, but on another HN thread earlier this year saw the recommendation for Sebastian Raschka's series. Starting with this video: https://www.youtube.com/watch?v=mDZil99CtSU and maybe the next three or four. It was really helpful to get a sense of the original 2014 concept of attention which is easier to understand but less powerful (https://arxiv.org/abs/1409.0473), and then how it gets powerful with the more modern notion of attention. So if you have a reasonable intuition for "regular" ANNs I think this is a great place to start.

andai2y ago

Turns out Attention is all you need isn't all you need!

(I'm sorry)

pizza2y ago

softmax(QK) gives you a probability matrix of shape [seq, seq]. Think of this like an adjacency matrix with edges with flow weights that are probabilities. Hence semantic routing of parts of X reduced with V.

where

- Q = X @ W_Q [query]

- K = X @ W_K [key]

- V = X @ V [value]

- X [input]

hence

attn_head_i = (softmax(Q@K/normalizing term) @ V)

Each head corresponds to a different concurrent routing system

The transformer just adds normalization and mlp feature learning parts around that.

lchengify2y ago

Just to add on, a good way to learn these terms is to look at the history of neural networks rather than looking at transformer architecture in a vacuum

This [1] post from 2021 goes over attention mechanisms as applied to RNN / LSTM networks. It's visual and goes into a bit more detail, and I've personally found RNN / LSTM networks easier to understand intuitively.

[1] https://medium.com/swlh/a-simple-overview-of-rnn-lstm-and-at...

skadamat2y ago

If folks want a lower dimensional version of this for their own models, I'm a big fan of the Netron library for model architecture visualization.

Wrote about it here: https://about.xethub.com/blog/visualizing-ml-models-github-n...

thierrydamiba2y ago

Thanks for sharing.

What an exciting time to be learning about LLMs. Everyday I come across a new resource, and everything is free!

airesQ2y ago

Incredible work.

So much depth; initially I thought it's "just" a 3d model. The animations are amazing.

shaburn2y ago

Visualization never seems to get the credit due in software development. This is amazing.

SiempreViernes2y ago

Anyone know if there is a name for this 3D control schema? This feels like one of the most intuitive setups I've ever used.

bbycroft2y ago

Author here! Thanks, you'll find the code in https://github.com/bbycroft/llm-viz/blob/main/src/llm/Canvas... I don't know a name for it, I just made it up.

But for me it's really broken haha

1) When you zoom, the cursor doesn't stay in the same position relative to some projected point 2) Panning also doesn't pin the cursor to a projected point, there's just a hacky multiplier there based on zoom

The main issue is that I'm storing the view state as target (on 2D plane) + euler angles + distance. Which is easy to think about, but issues 1 & 2 are better solved by manipulating a 4x4 view matrix. So would just need a matrix -> target-vector-pair conversion to get that working.

SiempreViernes2y ago

You know what, I think the panning being pinned to a specific plane is actually great, it means you actually pan across the surface of the object instead of mostly moving it out of the viewport like in this example:

https://connectivity.brain-map.org/3d-viewer?v=1&types=PLY&P...

This pinning to one single plane works really well in this particular case because what you are showing is mostly a flat thing anyway, so you don't have much reason to put the view direction close to the plane. A straightforward extension of this behaviour would be to add a few more planes, like one for each of the the cardinal directions and just switch which plane is the one the panning happens in, might be interesting to try for a more rounded object.

To me the zoom seems to do what is expected, zooming around the cursor position tends to be disorientating in 3D anyway, though maybe I didn't understand what problem you complained about.

SiempreViernes2y ago

Should add I'm using keyboard and mouse, haven't tested it with a touch interface.

stareatgoats2y ago

Not sure if it has a name, but you might find out in the github repository: https://github.com/bbycroft/llm-viz/tree/main

arikrak2y ago

This looks pretty cool! Anyone know of visualizations for simpler neural networks? I'm aware of tensorflow playground but that's just for a toy example, is there anything for visualizing a real example (e.g handwriting recognition)?

Logge2y ago

https://okdalto.github.io/VisualizeMNIST_web/

atonalfreerider2y ago

We made a VR visualization back in 2017 https://youtu.be/x6y14yAJ9rY

crimsoneer2y ago

I like this one: https://aegeorge42.github.io/

rvz2y ago

Rather than looking at the visuals of this network, it is more better to focus on the actual problem with these LLMs which the author already has shown:

With in the transformer section:

> As is common in deep learning, it's hard to say exactly what each of these layers is doing, but we have some general ideas: the earlier layers tend to focus on learning lower-level features and patterns, while the later layers learn to recognize and understand higher-level abstractions and relationships.

That is the problem and yet these black boxes are just as explainable as a magic scroll.

nlh2y ago

I find this problem fascinating.

For decades we’ve puzzled at how the inner workings of the brain works, and thought we’ve learned a lot we still don’t fully understand it. So, we figure, we’ll just make an artificial brain and THEN we’ll be able to figure it out.

And here we are, finally a big step closer to an artificial brain and once again, we don’t know how it works :)

(Although to be fair we’re spending all of our efforts making the models better and better and not on learning their low level behaviors. Thankfully when we decide to study them it’ll be a wee less invasive and actually doable, in theory.)

garte2y ago

Is it a brain, though? As far as I understand it it's mostly stochastic calculations based on language or image patterns whose rule sets are static and immutable. Every conceivable idea of plasticity (and with that: a form of fake consciousness) is only present during training.

Add to that the fact that a model is being trained actively and the weights are given by humans and the possible realm of outputs is being heavily moderated by an army of faceless low paid workers I don't see any semblance of a brain but a very high maintenance indexing engine sold to the marketing departments of the world as a "brain".

incrudible2y ago

The implementation is not important, its function is. It predicts the most probable next action based on previous observations. It is hypothesized that this is how the brain works as well. You may have an aversion to think of yourself as a predictive automaton, but functionally there is no need to introduce much more than that.

MRtecno982y ago

It's still a neural network, like your brain. It lacks plasticity and can't "learn" autonoumously but it's still one step closer in creating an artifical brain

codedokode2y ago

This is a great visualization because original paper on transformers is not very clear and understandable; I tried to read it first and didn't understand so I had to look for other explanations (for example it was unclear for me how multiple tokens are handled).

Also, speaking about transformers: they usually append their output tokens to input and process them again. Can we optimize it, so that we don't need to do the same calculations with same input tokens?

1 more reply

hmate92y ago

This is a phenomenal visualisation. I wish I saw this when I was trying to wrap my head around transformers a while ago. This would have made it so much easier.

johnklos2y ago

Am I the only one getting "Application error: a client-side exception has occurred (see the browser console for more information)." messages?

bbycroft2y ago

My bad, this is likely WebGL2 support, as others have pointed out. Now I show a message on the canvas area if that's the case.

I wasn't expecting it to get quite this popular, so hadn't handled this (rather major) edge case.

lopkeny12ko2y ago

Same here. I blame the popularity of Next.js. More and more of the web is slowly becoming more broken on Firefox on Linux, all with the same tired error: "Application error: a client-side exception has occurred"

HellsMaddy2y ago

Works fine on Firefox on Linux for me.

dathinab2y ago

For me too.

Next.js was never really broken for Firefox for Linux in my experience.

Through some "hidden" settings, disabling JS, and proably quite many plugins can brake things.

The only thing which tends to be often "broken" for FF in my experience is often CORS and Mic/Camera APIs, ironically in case of CORS 100% because of bugs in non standard compliant websites and for Mic/Camera it's more complicated but most common websites simply refusing to work with FF without even trying (and if you trick them into believing it's no FF often working just fine...).

altilunium2y ago

It is possible that your machine does not yet support WebGL2.

Check here : https://get.webgl.org/webgl2/

fulafel2y ago

Or that you've blocked it (some sources recommend this to avoid fingerprinting so various extensions and privacy configuration receipes do it).

tsunamifury2y ago

This shows how the individual weights and vectors work but unless I’m missing something doesn’t quite illustrate yet how higher order vectors are created at the sentence and paragraph level. This might be an emergent property within this system though so it’s hard to “illustrate”. how all of this ends up with a world simulation needs to be understood better and I hope this advances further.

visarga2y ago

The deeper you go, the higher the order. It's what attention does at each layer, makes information circulate.

tsunamifury2y ago

Thanks, I assumed that was the case, but they didn't make that explicit. Then the question is, is the world simulation running at the highest order attention layer or is it an emergent property of the interaction cycle between the attention layers.

1 more reply

russellbeattie2y ago

I've wondered for a while if as LLM usage matures, there will be an effort to optimize hotspots like what happened with VMs, or auto indexed like in relational DBs. I'm sure there are common data paths which get more usage, which could somehow be prioritized, either through pre-processing or dynamically, helping speed up inference.

nbzso2y ago

Beautiful. This should be the new educational standard for visualization of complex topics and systemic thinking.

abrookewood2y ago

This does an amazing job of showing the difference in complexity between the different models. Click on GPT-3 and you should be able to see all 4 models side-by-side. GPT-3 is a monster compared to nano-gpt.

tikkun2y ago

What happened to this thread? When I saw it before it had 700+ upvotes.

tikkun2y ago

Oh: https://news.ycombinator.com/item?id=38511659

drdg2y ago

Very cool. The explanations of what each part is doing is really insightful. And I especially like how the scale jumps when you move from e.g. Nano all the way to GPT-3 ....

thistoowontpass2y ago

Thank you. I'd just completed doing this manually (much uglier and less accurate) and so can really appreciate the effort behind this.

thefourthchime2y ago

First off, this is fabulous work. I went through it for the Nano, but is there a way to do the step-by-step for the other LLMs?

Heidaradar2y ago

Below the title, there's a few others you can choose from (GPT-2 small and XL and GPT-3)

cod1r2y ago

Really cool stuff. Looks like an entire computer but with software. Definitely need to dig into more AI/ML things.

nikhil8962y ago

This is by far the best resource I've seen to understand LLMs. Incredibly well done! Thanks for this awesome tool

Simon_ORourke2y ago

This is brilliant work, thanks for sharing.

reexpressionist2y ago

Ditto. This is the most sophisticated viz of parameters I've seen...and it's also an interactive, step-through tutorial!

crotchfire2y ago

Application error: a client-side exception has occurred (see the browser console for more information).

nandhinianand2y ago

Seems brave blocks some js scripts .. this works in Firefox

crotchfire2y ago

Not using brave.

I get the same fail in both firefox and (chromium-based) qutebrowser.

The web is where useful error messages go to die.

sva_2y ago

The score on this post just went down by a factor of 10 and the time went to "1 hour" ago?!

myself2482y ago

Another post was merged with it: https://news.ycombinator.com/item?id=38507672

gbertasius2y ago

This is AMAZING! I'm about to go into Uni and this will be useful for my ML classes.

meeb2y ago

This is easily the best visualization I've seen for a long time. Fantastic work!

RecycledEle2y ago

This is excellent!

This is why I love Hacker News!

Arctic_fly2y ago

Curse you for being interesting enough to make me get on my desktop.

valdect2y ago

Super cool! It's always nice to look something concrete

athulsuresh1232y ago

This should be in college textbooks

physPop2y ago

Honestly reading the pytorch implementation of minGTP is a lot more informative than an inscrutable 3d rendering. It's a well commended and pedagogical implementation. I applaud the intention, and it looks slick, but I'm not sure it really conveys information in an efficient way.

smy200112y ago

Really cool!

Solvency2y ago

Wish it were mobile friendly.

Workaccount22y ago

Such an amazing tool

haltist2y ago

Very cool.

BSTRhino2y ago

bbycroft is the GOAT!

wdiamond2y ago

amazing router

reqo2y ago

I feel like visualizations like this are what is missing from univeristy curricula. Now imagine a professor going trough each animation describing exactly what is happening, I am pretty sure students would get a much more in-depth understanding!

Arson94162y ago

Isn't it amazing that a random person on the internet can produce free educational content that trumps university courses? With all the resources and expertise that universities have, why do they get shown up all the time? Do they just not know how to educate?

syntaxers2y ago

It's an incentives problem. At research universities, promotion is contingent on research output, and teaching is often seen as a distraction. At so-called teaching universities, promotion (or even survival) is mainly contingent on throughput, and not measures of pedagogical outcome.

If you are a teaching faculty at a university, it is against your own interests to invest time to develop novel teaching materials. The exception might be writing textbooks, which can be monetized, but typically are a net-negative endeavor.

tiffanyg2y ago

Unfortunately, this is a problem throughout our various economies, at this point.

Professionalization of management with its easy over-reliance on the simplification of the quantitative - of "metrics" - along with the scales (size) this allows and manner in which fundamental issues get obscured tends to produce these types of results. This is, of course, well known in business schools and efforts are generally made to ensure graduates are aware of some of the downsides of "the quantitative." Unsurprisingly, over time, there is a kind of "forcing" that tends to drive these systems towards the results like you describe.

It's usually the case that imposition of metrics, optimization, etc. - "mathematical methods" - is quite beneficial at first, but once systems are improved in sensible ways based on insights gained through this, less desirable behavior begins to occur. Multiple factors including basic human psychology factor into this ... which I think is getting beyond the scope of what's reasonable to include in this comment.

cmiller12y ago

Have you considered the considerably greater breadth of content required for a full course, as well as the other responsibilities of the people teaching them such as testing, public speaking, etc.

dannyw2y ago

It probably would be massively beneficial to society and progress if teaching professors could spend more time and attention on teaching.

webnrrd2k2y ago

This is the result of a single person on the internet, who was not chosen randomly. it's not a fair characterization to call this the product of some random person on the internet. You can't just choose anyone on the internet at random and get results this good.

Also, according to his home page, Mr. Bycroft earned a BSc in Mathematics and Physics from the University of Canterbury in 2012. It's true that this page isn't the direct result of a professor or a university course, and it's also true that it's not a completely separate thing. It seems clear that his university education had a big role to play in creating this.

erosenbe02y ago

I second the reply about incentives. Funding curriculum materials and professional curriculum development is often seen as more of a K-12 thing. There is not even enough at the vocational level.

If big competitive grants and competitive salaries went to people with demonstrated ability like the engineer of this viz, there would be less stem dropouts in colleges and more summer learning! Also, in technical trades like green construction, solar, hvac, building retrofits, data center operations and the like, people would get farther and it would be a more diverse bunch.

wrsh072y ago

You're betting on the hundreds of top university cs professors to produce better content than the hundreds of thousands of industry veterans or hobbyists...

Why does YouTube sometimes have better content than professionally produced media? It's a really long tail of creators and educators

physPop2y ago

This isn't new. Textbooks exist for the same reason, so we don't need to duplicate effort creating teaching materials and can have a kind of accepted core cirriculum.

29athrowaway2y ago

Except for this man: Professor Robert Ghrist

https://www.youtube.com/c/ProfGhristMath

This person is amazing.

lusus_naturae2y ago

The person who made this went to university for maths education (if I found the right profile).

sgnelson2y ago

Because Faculty are generally paid very poorly, have many courses to teach, and what takes up more and more of their time are the broken bureaucratic systems they have to deal with.

Add that at research universities, they have to do research.

Also add in that at many schools, way too many students are just there to clock in, clock out and get a piece of paper that says they did it. Way too few are there to actually get an education. This has very real consequences on the moral of the instructors. When your students don't care, it's very hard for you to care. If your students aren't willing to work hard, why are you willing to work hard? Because you're paid so well?

I know plenty of instructors who would love to do things like this, but when are they going too? When are they going to find the time to learn the skills necessary to build an interactive web app? You think everyone outside of comp sci and like disciplines just naturally know how to build these types of apps?

I could go on, but the tl;dr of it is: Educators are over worked, underpaid and don't have enough time in the day.

jwilber2y ago

If you enjoy interactive visuals of machine learning algorithms, do check out https://mlu-explain.github.io/

j / k navigate · click thread line to collapse

131 comments

warkanlock2y ago

This is an excellent tool to realize how an LLM actually works from the ground up!

For those reading it and going through each step, if by chance you get stuck on why 48 elements are in the first array, please refer to the model.py on minGPT [1]

It's an architectural decision that it will be great to mention in the article since people without too much context might lose it

[1] https://github.com/karpathy/minGPT/blob/master/mingpt/model....

taliesinb2y ago

Wow, I love the interactive wizzing around and the animation, very neat! Way more explanations should work like this.

I've recently finished an unorthodox kind of visualization / explanation of transformers. It's sadly not interactive, but it does have some maybe unique strengths.

https://math.tali.link/raster/052n01bav6yvz_1smxhkus2qrik_07...

Here's the main blog post introducing the formalism: https://math.tali.link/rainbow-array-algebra

Here's the section on perceptrons: https://math.tali.link/rainbow-array-algebra/#neural-network...

Here's the section on transformers: https://math.tali.link/rainbow-array-algebra/#transformers

jimmySixDOF2y ago

You might also like this interactive 3D walk through explainer from PyTorch :

https://pytorch.org/blog/inside-the-matrix/

riemannzeta2y ago

Are you referring specifically to line 141, which sets the number of embedding elements for gpt-nano to 48? That also seems to correspond to the Channel size C referenced in the explanation text?

https://github.com/karpathy/minGPT/blob/master/mingpt/model....

tomnipotent2y ago

That matches the name of default model selected in the right pane, "nano-gpt". I missed the "bigger picture" at first before I noticed the other models in the right pane header.

namocat2y ago

Yes, thank you - It was unexplained, so I got stuck on "Why 48?", thinking I'd missing something right out of the gate.

zombiwoof2y ago

I was thinking 42 ;-)

jayveeone2y ago

Yes yes it was the 48 elements thing that got me stuck. Definitely not everything from the second the page loaded.

holtkam22y ago

The visualization I've been looking for for months. I would have happily paid serious money for this... the fact that it's free is such a gift and I don't take it for granted.

terminous2y ago

Same... this is like a textbook, but worth it

wills_forward2y ago

My jaw drop to see algorhythmic complexity laid out so clearly in a 3d space like that. I wish I was smart enough to know if it's accurate or not.

block_dagger2y ago

To know, you must perform intellectual work, not merely be smart. I bet you are smart enough.

nocoder2y ago

What a nice comment!! This has been a big failing of my mental model. I always believed if I was smart enough I should understand things without effort. Still trying to unlearn this....

blackbear_2y ago

That is a surprisingly common fallacy actually; I think you will find this book quite helpful to overcome it: https://www.penguinrandomhouse.com/books/44330/mindset-by-ca...

1 more reply

modriano2y ago

jampekka2y ago

It's also important to learn how to "teach yourself".

Getting a good "learning path" is usually a teacher's main task, but you can learn to figure those by yourself by trying to find some part of the thing you can get a grasp of.

Most complicated seeming things (especially in tech) aren't really that complicated "to get". You just have to know a lot of stuff that the thing builds on.

SubiculumCode2y ago

99% persperation, 1% inspiration, as the addage goes...and I completely agree.

gryfft2y ago

Damn, this looks phenomenal. I've been wanting to do a deep dive like this for a while-- the 3D model is a spectacular pedagogic device.

quickthrower22y ago

hodanli2y ago

which video specifically?

baq2y ago

Could as well be titled 'dissecting magic into matmuls and dot products for dummies'. Great stuff. Went away even more amazed that LLMs work as well as they do.

mark_l_watson2y ago

I am looking at Brenden’s GitHub repo https://github.com/bbycroft/llm-viz

Really nice stuff.

flockonus2y ago

Twitter thread by the author sharing some extra context on this work: https://twitter.com/BrendanBycroft/status/173104295714982714...

itslennysfault2y ago

Thanks for sharing. This is a great thread.

Since X now hides replies for non-logged in user here is a nitter link for those without an account (like me) that might want to see the full thread.

https://nitter.net/BrendanBycroft/status/1731042957149827140

3abiton2y ago

I wish it could integrate other open source LLMs in the backend, but this is already an amazing viz.

tysam_and2y ago

Good visualization precludes good discoveries in many branches of science, I think.

(see my profile for a longer, potentially more silly description ;) )

29athrowaway2y ago

I big kudos to the author of this.

Not only has the visualization, but it's interactive, has explanations for each item, has excellent performance and is open source: https://github.com/bbycroft/llm-viz/blob/main/src/llm

Another interesting visualization related thing: https://github.com/shap/shap

8f2ab37a-ed6c2y ago

Expecting someone to implement an LLM in Factorio any day now, we're half-way there already with this blueprint.

Exuma2y ago

singularity20012y ago

Also later why 8 and why is "A" expected in the sixth position

atgctg2y ago

A lot of transformer explanations fail to mention what makes self attention so powerful.

WhitneyLand2y ago

In case it’s confusing for anyone to see “weight” as a verb and a noun so close together, there are indeed two different things going on:

1. There are the model weights, aka the parameters. These are what get adjusted during training to do the learning part. They always exist.

2. There are attention weights. These are part of the transformer architecture and they “weight” the context of the input. They are ephemeral. Used and discarded. Don’t always exist.

They are both typically 32-bit floats in case you’re curious but still different concepts.

airstrike2y ago

I always thought the verb was "weigh" not "weight", but apparently the latter is also in the dictionary as a verb.

Oh well... it seems like it's more confusing than I thought https://www.merriam-webster.com/wordplay/when-to-use-weigh-a...

bobbylarrybobby2y ago

“To weight” is to assign a weight (e.g., to weight variables differently in a model), whereas “to weigh” is to observe and/or record a weight (as a scale does).

1 more reply

owlbite2y ago

I think in most deployments, they're not fp32 by the time you're doing inference no them, they've been quantized, possibly down to 4 bits or even fewer.

On the training side I wouldn't be surprised if they were bf16 rather than fp32.

kirill5pol2y ago

I think a good way of explaining #2 is “weight” in the sense of a weighted average

kmeisthax2y ago

None of this seems obvious just reading the original Attention is all you need paper. Is there a more in-depth explanation of how this adaptive weighting works?

albertzeyer2y ago

That paper introduces it. But even that is maybe quite dense, and to really grasp it, it helps to reimplement those things.

BOOSTERHIDROGEN2y ago

What books cover exclusively about this topic ? Thanks

3 more replies

WhitneyLand2y ago

It’s definitely not obvious no matter how smart you are! The common metaphor used is it’s like a conversation.

Imagine you read one comment in some forum, posted in a long conversation thread. It wouldn’t be obvious what’s going on unless you read more of the thread right?

A single paper is like a single comment, in a thread that goes on for years and years.

For example, why don’t papers explain what tokens/vectors/embedding layers are? Well, they did already, except that comment in the thread came 2013 with the word2vec paper!

You might think wth? To keep up with this some one would have to spend a huge part of their time just reading papers. So yeah that’s kind of what researchers do.

The alternative is to try to find where people have distilled down the important information or summarized it. That’s where books/blogs/youtube etc come in.

andai2y ago

Is there a way of finding interesting "chains" of such papers, short of scanning the references / "cited by" page?

(For example, Google Scholar lists 98797 citations for Attention is all you need!)

1 more reply

kaimac2y ago

http://bactra.org/notebooks/nn-attention-and-transformers.ht...

CyberDildonics2y ago

It's just as bad a "convolutional neural networks" instead of "images being scaled down"

1 more reply

gtoubassi2y ago

andai2y ago

Turns out Attention is all you need isn't all you need!

(I'm sorry)

pizza2y ago

where

- Q = X @ W_Q [query]

- K = X @ W_K [key]

- V = X @ V [value]

- X [input]

hence

attn_head_i = (softmax(Q@K/normalizing term) @ V)

Each head corresponds to a different concurrent routing system

The transformer just adds normalization and mlp feature learning parts around that.

lchengify2y ago

Just to add on, a good way to learn these terms is to look at the history of neural networks rather than looking at transformer architecture in a vacuum

[1] https://medium.com/swlh/a-simple-overview-of-rnn-lstm-and-at...

skadamat2y ago

If folks want a lower dimensional version of this for their own models, I'm a big fan of the Netron library for model architecture visualization.

Wrote about it here: https://about.xethub.com/blog/visualizing-ml-models-github-n...

thierrydamiba2y ago

Thanks for sharing.

What an exciting time to be learning about LLMs. Everyday I come across a new resource, and everything is free!

airesQ2y ago

Incredible work.

So much depth; initially I thought it's "just" a 3d model. The animations are amazing.

shaburn2y ago

Visualization never seems to get the credit due in software development. This is amazing.

SiempreViernes2y ago

Anyone know if there is a name for this 3D control schema? This feels like one of the most intuitive setups I've ever used.

bbycroft2y ago

Author here! Thanks, you'll find the code in https://github.com/bbycroft/llm-viz/blob/main/src/llm/Canvas... I don't know a name for it, I just made it up.

But for me it's really broken haha

SiempreViernes2y ago

https://connectivity.brain-map.org/3d-viewer?v=1&types=PLY&P...

To me the zoom seems to do what is expected, zooming around the cursor position tends to be disorientating in 3D anyway, though maybe I didn't understand what problem you complained about.

SiempreViernes2y ago

Should add I'm using keyboard and mouse, haven't tested it with a touch interface.

stareatgoats2y ago

Not sure if it has a name, but you might find out in the github repository: https://github.com/bbycroft/llm-viz/tree/main

arikrak2y ago

Logge2y ago

https://okdalto.github.io/VisualizeMNIST_web/

atonalfreerider2y ago

We made a VR visualization back in 2017 https://youtu.be/x6y14yAJ9rY

crimsoneer2y ago

I like this one: https://aegeorge42.github.io/

rvz2y ago

Rather than looking at the visuals of this network, it is more better to focus on the actual problem with these LLMs which the author already has shown:

With in the transformer section:

That is the problem and yet these black boxes are just as explainable as a magic scroll.

nlh2y ago

I find this problem fascinating.

And here we are, finally a big step closer to an artificial brain and once again, we don’t know how it works :)

garte2y ago

incrudible2y ago

MRtecno982y ago

It's still a neural network, like your brain. It lacks plasticity and can't "learn" autonoumously but it's still one step closer in creating an artifical brain

codedokode2y ago

1 more reply

hmate92y ago

This is a phenomenal visualisation. I wish I saw this when I was trying to wrap my head around transformers a while ago. This would have made it so much easier.

johnklos2y ago

Am I the only one getting "Application error: a client-side exception has occurred (see the browser console for more information)." messages?

bbycroft2y ago

My bad, this is likely WebGL2 support, as others have pointed out. Now I show a message on the canvas area if that's the case.

I wasn't expecting it to get quite this popular, so hadn't handled this (rather major) edge case.

lopkeny12ko2y ago

HellsMaddy2y ago

Works fine on Firefox on Linux for me.

dathinab2y ago

For me too.

Next.js was never really broken for Firefox for Linux in my experience.

Through some "hidden" settings, disabling JS, and proably quite many plugins can brake things.

altilunium2y ago

It is possible that your machine does not yet support WebGL2.

Check here : https://get.webgl.org/webgl2/

fulafel2y ago

Or that you've blocked it (some sources recommend this to avoid fingerprinting so various extensions and privacy configuration receipes do it).

tsunamifury2y ago

visarga2y ago

The deeper you go, the higher the order. It's what attention does at each layer, makes information circulate.

tsunamifury2y ago

1 more reply

russellbeattie2y ago

nbzso2y ago

Beautiful. This should be the new educational standard for visualization of complex topics and systemic thinking.

abrookewood2y ago

tikkun2y ago

What happened to this thread? When I saw it before it had 700+ upvotes.

tikkun2y ago

Oh: https://news.ycombinator.com/item?id=38511659

drdg2y ago

Very cool. The explanations of what each part is doing is really insightful. And I especially like how the scale jumps when you move from e.g. Nano all the way to GPT-3 ....

thistoowontpass2y ago

Thank you. I'd just completed doing this manually (much uglier and less accurate) and so can really appreciate the effort behind this.

thefourthchime2y ago

First off, this is fabulous work. I went through it for the Nano, but is there a way to do the step-by-step for the other LLMs?

Heidaradar2y ago

Below the title, there's a few others you can choose from (GPT-2 small and XL and GPT-3)

cod1r2y ago

Really cool stuff. Looks like an entire computer but with software. Definitely need to dig into more AI/ML things.

nikhil8962y ago

This is by far the best resource I've seen to understand LLMs. Incredibly well done! Thanks for this awesome tool

Simon_ORourke2y ago

This is brilliant work, thanks for sharing.

reexpressionist2y ago

Ditto. This is the most sophisticated viz of parameters I've seen...and it's also an interactive, step-through tutorial!

crotchfire2y ago

Application error: a client-side exception has occurred (see the browser console for more information).

nandhinianand2y ago

Seems brave blocks some js scripts .. this works in Firefox

crotchfire2y ago

Not using brave.

I get the same fail in both firefox and (chromium-based) qutebrowser.

The web is where useful error messages go to die.

sva_2y ago

The score on this post just went down by a factor of 10 and the time went to "1 hour" ago?!

myself2482y ago

Another post was merged with it: https://news.ycombinator.com/item?id=38507672

gbertasius2y ago

This is AMAZING! I'm about to go into Uni and this will be useful for my ML classes.

meeb2y ago

This is easily the best visualization I've seen for a long time. Fantastic work!

RecycledEle2y ago

This is excellent!

This is why I love Hacker News!

Arctic_fly2y ago

Curse you for being interesting enough to make me get on my desktop.

valdect2y ago

Super cool! It's always nice to look something concrete

athulsuresh1232y ago

This should be in college textbooks

physPop2y ago

smy200112y ago

Really cool!

Solvency2y ago

Wish it were mobile friendly.

Workaccount22y ago

Such an amazing tool

haltist2y ago

Very cool.

BSTRhino2y ago

bbycroft is the GOAT!

wdiamond2y ago

amazing router

reqo2y ago

Arson94162y ago

syntaxers2y ago

tiffanyg2y ago

Unfortunately, this is a problem throughout our various economies, at this point.

cmiller12y ago

Have you considered the considerably greater breadth of content required for a full course, as well as the other responsibilities of the people teaching them such as testing, public speaking, etc.

dannyw2y ago

It probably would be massively beneficial to society and progress if teaching professors could spend more time and attention on teaching.

webnrrd2k2y ago

erosenbe02y ago

I second the reply about incentives. Funding curriculum materials and professional curriculum development is often seen as more of a K-12 thing. There is not even enough at the vocational level.

wrsh072y ago

You're betting on the hundreds of top university cs professors to produce better content than the hundreds of thousands of industry veterans or hobbyists...

Why does YouTube sometimes have better content than professionally produced media? It's a really long tail of creators and educators

physPop2y ago

This isn't new. Textbooks exist for the same reason, so we don't need to duplicate effort creating teaching materials and can have a kind of accepted core cirriculum.

29athrowaway2y ago

Except for this man: Professor Robert Ghrist

https://www.youtube.com/c/ProfGhristMath

This person is amazing.

lusus_naturae2y ago

The person who made this went to university for maths education (if I found the right profile).

sgnelson2y ago

Because Faculty are generally paid very poorly, have many courses to teach, and what takes up more and more of their time are the broken bureaucratic systems they have to deal with.

Add that at research universities, they have to do research.

I could go on, but the tl;dr of it is: Educators are over worked, underpaid and don't have enough time in the day.

jwilber2y ago

If you enjoy interactive visuals of machine learning algorithms, do check out https://mlu-explain.github.io/

j / k navigate · click thread line to collapse