> The "Warp Scheduler" is a SIMD vector unit like the TPU VPU with 32 lanes, called "CUDA Cores"
It's not clear from the above what a "CUDA core" (singular) _is_ -- this is the archetypical "let me explain things to you" error most people make, in good faith usually -- if I don't know the material, and I am out to understand, then you have gotten me to read all of it but without making clear the very objects of your explanation.
And so, for these kind of "compounding errors", people who the piece was likely targeted at, are none the wiser really, while those who already have a good grasp of the concepts attempted explained, like what a CUDA core actually is, already know most of what the piece is trying to explain anyway.
My advice to everyone who starts out with a back of envelope cheatsheet then decides to publish it "for the good of mankind", e.g. on Github: please be surgically precise with your terms -- the terms are your trading cards, then come the verbs etc. I mean this is all writing 101, but it's a rare thing, evidently. Don't mix and match terms, don't conflate them (the reader will do it for you many times over for free if you're sloppy), and be diligent with analogies.
Evidently, the piece may have been written to help those already familiar with TPU terminology -- it mentions "MXU" but there's no telling what that is.
I understand I am asking for a tall order, but the piece is long and all the effort that was put in, could have been complemented with minimal extra hypertext, like annotated abbreviations like "MXU".
I can always ask $AI to do the equivalent for me, which is a tragedy according to some.
> please be surgically precise with your terms
There's always a tension between precision in every explanation and the "moral" truth. I can say "a SIMD (Single Instruction Multiple Data) vector unit like the TPU VPU with 32 ALUs (SIMD lanes) which NVIDIA calls CUDA Cores", which starts to get unwieldy and even then leaves terms like vector units undefined. I try to use footnotes liberally, but you have to believe the reader will click on them. Sidenotes are great, but hard to make work in HTML.
For terms like MXU, I was intending this to be a continuation of the previous several chapters which do define the term, but I agree it's maybe not reasonable to assume people will read each chapter.
There are other imprecisions here, like the term "Warp Scheduler" is itself overloaded to mean the scheduler, dispatch unit, and SIMD ALUs, which is kind of wrong but also morally true, since NVIDIA doesn't have a name for the combined unit. :shrug:
I agree with your points and will try to improve this more. It's just a hard set of compromises.
> Each SM is broken up into 4 identical quadrants, which NVIDIA calls SM subpartitions, each containing a Tensor Core, 16k 32-bit registers, and a SIMD/SIMT vector arithmetic unit called a Warp Scheduler, whose lanes (ALUs) NVIDIA calls CUDA Cores.
And right after:
> CUDA Cores: each subpartition contains a set of ALUs called CUDA Cores that do SIMD/SIMT vector arithmetic.
So, to your defense and my shame -- you *did* do better than I was able to infer from first glance. And I can take absolutely no issue with a piece elaborating on originally "vague" sentence later on -- we need to read top to bottom, after all.
Much of the difficulty with laying out knowledge in written word is inherent constraints like choosing between deferring detail to "further down" at the expense of giving the "bird's eye view". I mean there is a reason writing is hard, technical writing perhaps more so, in a way. You're doing much better than a lot of other stuff I've had to learn with, so I can only thank you to have done as much as you already have.
To be more constructive still, I agree the border between clarity and utility isn't always clearly drawn. But I think you can think of it as a service to your readers -- go with precision I say -- if you really presuppose the reader should know SIMD, chances are they are able to grok a new definition like "SIMD lane" if you define it _once_ and _well_. You don't need to be "unwieldy" in repetition -- the first time may be hard but you only need to do it once.
I am rambling. I do believe there are worse and better ways to impart knowledge of the kind in writing, but I too obviously don't have the answers, so my criticism was in part inconstructive, just a sheer outcry of mild frustration once I started conflating things from the get go but before I decided to give it a more thorough read.
One last thing though: I always like when a follow-up article starts with a preamble along of "In the previous part of the series..." so new visitors can simultaneously become aware there's prior knowledge that may be assumed, _and_ navigate their way to desired point in the series, all the way to the start perhaps. That frees you from e.g. wanting to annotate abbreviations in every part, if you want to avoid doing that.
2) What are your thoughts on links to the wiki articles under things such as "SIMD" or "ALUs" for the precise meaning while using the metaphors in your prose?
Most novices tend to Google and end up on Wikipedia for the trees. It's harder to find the forest.
> This article assumes you've read [this] and [this] and understand > [this topic] and [this topic too]
I'm not sure that's helpful, and, I don't put everything. Those links might also have further links saying you need X, Y, and Z. But at least there is a trail on where to start
A CUDA core is basically a SIMD lane on an actual core on an NVIDIA GPUs.
For a longer version of this answer: https://stackoverflow.com/a/48130362/1593077
I think you want a metaphor that doesn't also depend on its literal meaning.
should have most of it
From the resource intro: > Expected background: We’re going to assume you have a basic understanding of LLMs and the Transformer architecture but not necessarily how they operate at scale.
I suppose this doesn’t require any knowledge about how computers work, but core CPU functionality seems…reasonable?
While I can understand the imprecise point, I found myself very impressed by the quality of the writing. I don't envy making digestible prose about the differences between GPUs and TPUs.
It's hard to find skills that don't have a degree of provincialism. It's not a great feeling, but you more on. IMO, don't over-idealize the concept of general-knowledge to your detriment.
I think we can also untangle the open-source part from the general/provincial. There is more to the world worth exploring.
Stories of exploring DOS often ended up at hex editing and assembly.
Best to learn with whatever options are accessible, plenty is transferable.
While there is an embarrassment of options to learn from or with today the greatest gaffe can be overlooking learning.
There are enough people for who it's worth it, even if just for tinkering, and I'm sure you are aware of that.
It reads a bit like "You shouldn't use it because..."
Learning about Nvidia GPUs will teach you a lot about other GPUs as well, and there are a lot of tutorials about the former, so why not use it if it interests you?
The software is proprietary, and easy to ignore if you don't plan to write low-level optimizations for NVIDIA.
However, the hardware architecture is worth knowing. All GPUs work roughly the same way (especially on the compute side), and the CUDA architecture is still fundamentally the same as it was in 2007 (just with more of everything).
It dictates how shader languages and GPU abstractions work, regardless of whether you're using proprietary or open implementations. It's very helpful to understand peculiarities of thread scheduling, warps, different levels of private/shared memory, etc. There's a ridiculous amount of computing power available if you can make your algorithms fit the execution model.
Sounds good on paper but unfortunately I've had numerous issues with these "abstractors". For example, PyTorch had serious problems on Apple Silicon even though technically it should "just work" by hiding the implementation details.
In reality, what ends up happening is that some features in JAX, PyTorch, etc. are designed with CUDA in mind, and Apple Silicon is an afterthought.
Work keeps us humble enough to be open to learn.
When cuda rose to prominence were there any viable alternatives?
Better not learn CUDA then.
For most IT folks it doesn't make much sense.
Shamelessly: I’m open to work if anyone is hiring.
Quiz 2 is confusingly worded but is, iiuc, referring to intranode GPU connections rather than internode networking.
For example, we don't know how many actual functional units an SM has; we don't know if the "tensor core" even _exists_ as a piece of hardware, or whether there's just some kind of orchestration of other functional units; and IIRC we don't know what exactly happens at the sub-warp level w.r.t. issuing and such.
What are the actual incentives at NVIDIA? If it’s all about marketing they’re doing great, but I have some doubts about engineering culture.
In games, you can get NDA disclosures about architectural details that are closer to those docs. But I’ve never really seen any vendor (besides Intel) disclose this stuff publicly
Much of the info on compute and bandwidth is from that and other architecture whitepapers, as well as the CUDA C++ programming guide, which covers a lot of what this article shares, in particular chapters 5, 6, and 7. https://docs.nvidia.com/cuda/cuda-c-programming-guide/
There’s plenty of value in third parties distilling and having short form versions, and of writing their own takes on this, but this article wouldn’t have been possible without NVIDIA’s docs, so the speculation, FUD and shade is perhaps unjustified.
Yes it's all TPU focussed (other than this most recent part) but a lot of what it discusses are generally principles you can apply elsewhere (or easy enough to see how you could generalise them).
Also: it's a shame Google doesn't talk about how they use TPU's outside of LLM.
Last I looked they would require the host to synthesize a suitable instruction stream for this on-the-fly with no existing tooling to do so efficiently.
An example where this would be relevant would be LLM inference prefill stage with (activated) MoA expert count on the order of — to a small integer smaller than — the prompt length, where you'd want to only load needed experts and only load each one at most once per layer.
In my experience you can roughly get 8x speed improvement.
Turning a 4 second web response into half a second can be game changing. But it is a lot easier to use a web socket and put a spinner or cache result in background.
Running a GPU in the cloud is expensive
Mind you, I did all long range force stuff which is difficult to work with over multiple nodes at the best of times.
They leave some performance on the table, but they gain flexible compilers.
The majority of NVidia's profits (almost 90%) do come from data center, most of which is going to be neural net acceleration, and I'd have to assume that they have optimized their data center products to maximize performance for typical customer workloads.
I'm sure that Microsoft would provide feedback to Nvidia if they felt changes were needed to better compete with Google in the cloud compute market.
Modern GPUs still are just SIMD with good predication support at ISA level.
> CUDA cores are much more flexible than a TPU’s VPU: GPU CUDA cores use what is called a SIMT (Single Instruction Multiple Threads) programming model, compared to the TPU’s SIMD (Single Instruction Multiple Data) model.
How to scale your model: A systems view of LLMs on TPUs - https://news.ycombinator.com/item?id=42936910 - Feb 2025 (30 comments)
> There are plans to release a PDF version; need to fix some formatting issues + convert the animated diagrams into static images.
I don't see anything on the page about it, has there been an update on this? I'd love to put this on my e-reader.