undefined | Better HN

0 pointslelanthran11mo ago0 comments

> The trick is memory bandwidth - not just the amount of VRAM - is important for LLM inference.

I'm not really knowledgeable about this space, so maybe I'm missing something:

Why does the bus performance affect token generation? I would expect it to cause a slow startup when loading the model, but once the model is loaded, just how much bandwidth can the token generation possibly use?

Token generation is completely on the card using the memory on the card, without any bus IO at all, no?

IOW, I'm trying to think of what IO the card is going to need for token generation, and I can't think of any other than returning the tokens (which, even on a slow 100MB/s transfer is still going to be about 100x the rate at which tokens are being generated.

0 comments

stevenhuang11mo ago

During inference, each token passes through each parameter of the model as a matrix vector products. And then as context grows, each new token passes through all current context tokens as matrix vector products.

This means bandwidth requirements grow as context sizes grow.

For datacenter workloads batching can be used to efficiently use this memory bandwidth and make things compute bound instead

lelanthranOP11mo ago

[I'm still not understanding]

It seems to me that even if you pass in a long context on every prompt, that context is still tiny compared to the execution time on the processor/GPU/tensorcore/etc.

Lets say I load up a model of 12GB on my 12GB VRAM GPU. I pass in a prompt with 1MB of context which causes a response of 500kb after 1s. That's still only 1.5MB of IO transferred in 1s, which kept the GPU busy for 1s. Increasing the prompt is going to increase the duration to a response accordingly.

Unless the GPU is not fully utilised on each prompt-response cycle, I feel that the GPU is still the bottleneck here, not the bus performance.

imtringued11mo ago

1MB of context can maybe hold 10 tokens depending on your model.

For reference. llama 3.2 8B used to take 4 KiB per token per layer. At 32 layers that is 128KiB or 8 tokens per MiB of KV cache (context). If your context holds 8000 tokens including responses then you need around 1GB.

>Unless the GPU is not fully utilised on each prompt-response cycle, I feel that the GPU is still the bottleneck here, not the bus performance.

Matrix vector multiplication implies a single floating point multiplication and addition (2 flops) per parameter. Your GPU can do way more flops than that without using tensor cores at all. In fact, this workload bores your GPU to death.

zargon11mo ago

> I feel that the GPU is still the bottleneck here, not the bus performance.

PCIe bus performance is basically irrelevant.

> Token generation is completely on the card using the memory on the card, without any bus IO at all, no?

Right. But the GPU can't instantaneously access data in VRAM. It has to be copied from VRAM to GPU registers first. For every token, the entire contents of VRAM has to be copied to the GPU to be computed. It's a memory-bound process.

Right now there's about an 8x difference in memory bandwidth between low-end and high-end consumer cards (e.g., 4060 Ti vs 5090). Moving up to a B200 more than doubles that performance again.

jononor11mo ago

GPU memory bandwidth is the limiting factor, not PCIe bandwidth. The memory bandwidth is critical because the models rely on getting all the parameters from memory to do computation, and there is a low amount of computation per parameter, so memory tends to be the bottleneck.

j / k navigate · click thread line to collapse