https://github.com/vllm-project/vllm/releases/tag/v0.7.1
MHA is still faster in low QPS regime apparently.
https://neuralmagic.com/blog/enhancing-deepseek-models-with-...
Also published this month was theoretical proof showing that for the same KV Cache overhead, MLA consistently offers greater expressive power than GQA. Furthermore, widely used GQA-based pre-trained models (e.g. LLaMA, Qwen, Mixtral) can be converted into MLA-based models.
I am very curious to see how well-optimized Deepseek's code is compared to leading LLM serving softwares like vLLM or SGLang.
(w/ the extra memory V3/R1 fits on a single MI300X or H200 node)
It'll be interesting to see if either project can take advantage/get any benefits from this FlashMLA implementation.
Training and prefill are compute bound. Decode is memory bound. FlashAttention massively increases the arithmetic intensity of naive MHA, such that you can remain compute bound at lower batch sizes during decode.
> FlashAttention ... such that you can remain compute bound at lower batch sizes during decode.
So, which one is it then?
From the authors of FlashAttention:
> This [decoding] operation has been optimized with FlashAttention (v1 and v2 recently) in the training case, where the bottleneck is the memory bandwidth to read and write the intermediate results
And then they continue with:
> However, these optimizations don’t apply directly to the inference case, because the bottlenecks are different. For training, FlashAttention parallelizes across the batch size and query length dimensions. During inference, the query length is typically 1 ... With a batch size of 1, FlashAttention will use less than 1% of the GPU!
And then they come up with a different proposal, FlashDecoding, that optimizes for inference time:
> Our new approach Flash-Decoding is based on FlashAttention, and adds a new parallelization dimension: the keys/values sequence length. It combines the benefits of the 2 approaches from above. Like FlashAttention, it stores very little extra data to global memory, however it fully utilizes the GPU even when the batch size is small, as long as the context length is large enough.
Link: https://crfm.stanford.edu/2023/10/12/flashdecoding.html
If Deepseek R1 had used standard MHA, they would need 1749KB per token for KV cache storage. This means that once the conversation reaches ~46,000 tokens, the KV cache will have exceeded the entire storage capacity of a single H100.
Using MLA, each token now consumes 125KB. This means you can hit ~640,000 tokens (2x Ulysses) before overflowing.
https://verticalserve.medium.com/group-query-attention-58283...
Unrelated, it's always impressed me how Singapore buys 15% of the world's h100's. Really is the AI development capital of the world.
Singapore is the billing location, not the shipping location, which makes sense because they’re the HQ of a lot of companies in the region.
https://www.tomshardware.com/tech-industry/deepseek-gpu-smug...
>Achieving up to 3000 GB/s in memory-bound configuration and 580 TFLOPS in computation-bound configuration on H800 SXM5, using CUDA 12.6.
Another country has X because they were expected (in the terms of their purchase) to not sell it to an adversary. So yes they’re supposed to honor that agreement and are not supposed to trade that particular thing X with each other. Not doing so invites sanctions and other consequences. Is it worth the risk just to do business with a dictatorship? Probably not.
(Showing my lack of breadth of knowledge in the ecosystem (s))
Inference providers like Fireworks, or major clouds, can use this to reduce their cost, if they don't already have a replication with similar perf.
vLLM and SGLang may integrate this to be faster at serving DeepSeek-V2/V2.5/V3/R1 on H100/H800s.
I believe that's why they didn't release this back then, this is part of their "moat" (pretty weak tho) and it only benefits competitors.
Open sourcing this after being very popular may indicate that they don't want all the users to use their API/Chat and now want the world to serve it instead? Idk.
There is an extremely high chance (in fact a 99.9% chance) that an AI did not build this and the ones who are able to build or adapt projects like this which are deep into hardware systems will be the most sort after.
Not the horrendous JS or even TS slop across GitHub that is extremely easy for an AI to generate correctly.
You've got until 2030 to decide. And my advice is to study the codebases of pytorch (backends), DeepSeek, tinygrad and ggml.
Do you feel GenAI coding is substantially different from the lineage of 4GL to 'low code' approaches?
Reason I'm asking is because despite all promises al suffered from what Spolsky coined the 'leaky abstraction' problem.
Once something goes wrong, the user is left without recourse in a sea of additional complexity created by the tooling that was meant to not have to deal with it in the first place.
My own opinion is that GenAI is different because of (a) its recursive reflexive potential (you can use the tool itself to help you past the failure) and (b) it shifts the input out of the necessity for algorithmic/systemic thinking (which may come as a surprise to the audience here but my experience has taught me is alien to dare I say the majority of people).
Now don't get me wrong. We have not reached the point where (a)+(b) make it to where you don't need application layer devs, but we are definitely seeing some progress.
As for going deeper into the stack to "escape" AI, I would venture that is probably a non starter as the deeper you go the more constrained the domain is, so your escape strategy relies on AI reasoning making little progress, where AI reasoning has always been more successful in smaller well defined spaces.
I do agree that if you are "only" a developer, you will have to be in some sort of tightly defined niche, and how long those niches survive is anyone's guess.
Incidentally, I put the word "only" in quotes because I morally and aesthetically appreciate the strength of someone who can write to spec. I have no interest in demeaning the effort it takes to do so. I have worked with supposedly senior developers who ignore specs completely, even when the specs are done by a technical person and include details / unit tests.
Yes, this unfortunately does mean a reduction in the less skilled workforce, but frankly that's an on the whole good thing. Does anyone really enjoy writing and testing boilerplate day in day out for low pay, it's the same as the old white collar pushing paper around until retirement...
"99% written by DeepSeek-R1" according to the author.
I suspect it's much higher throughput than vLLM, which in turn is much higher throughput than llama.cpp. The MLA kernel they just open-sourced seems to indicate that, although we'll see how it does in third party benchmarks on non-hobbled GPUs vs FlashAttention. They only released the BF16 version — whereas most people, including DeepSeek themselves, serve in FP8 — so it might not be immediately useful to most companies quite yet, although I imagine there'll be FP8 ports soon enough.