DeepSeek Open Source FlashMLA – MLA Decoding Kernel for Hopper GPUs (opens in new tab)

Link: https://crfm.stanford.edu/2023/10/12/flashdecoding.html

rfoo1y ago

You've got it backwards. After FlashAttention, it's the decoding part being bound mainly by memory access. With FA as long as you have enough batch size you can push training/prefill to be compute-bound.

menaerus1y ago

I don't think I got it backwards, I believe what I said is correct - FA does not improve inference time.

From the authors of FlashAttention:

> This [decoding] operation has been optimized with FlashAttention (v1 and v2 recently) in the training case, where the bottleneck is the memory bandwidth to read and write the intermediate results

And then they continue with:

> However, these optimizations don’t apply directly to the inference case, because the bottlenecks are different. For training, FlashAttention parallelizes across the batch size and query length dimensions. During inference, the query length is typically 1 ... With a batch size of 1, FlashAttention will use less than 1% of the GPU!

And then they come up with a different proposal, FlashDecoding, that optimizes for inference time:

> Our new approach Flash-Decoding is based on FlashAttention, and adds a new parallelization dimension: the keys/values sequence length. It combines the benefits of the 2 approaches from above. Like FlashAttention, it stores very little extra data to global memory, however it fully utilizes the GPU even when the batch size is small, as long as the context length is large enough.

https://www.nvidia.com/en-us/data-center/h100/

albertzeyer1y ago

I also just read that paper. But I wonder, even though MLA is strictly more powerful, do you really gain by that in experiments? This paper doesn't really do too much experimental comparisons. GQA on the other side should still be faster (no need to an extra linear transformation).

helloericsfOP1y ago

X:https://x.com/deepseek_ai/status/1893836827574030466 BF16 support Paged KV cache (block size 64) 3000 GB/s memory-bound & 580 TFLOPS compute-bound on H800

WithinReason1y ago

That's 90% bandwidth efficiency and 60% compute efficiency

helloericsfOP1y ago

They don't have h100. wink,wink.

rfoo1y ago

They have H800s which have exactly same memory bandwidth and max FLOPS.

https://verticalserve.medium.com/group-query-attention-58283...

FL33TW00D1y ago

It seems to me that MLA will become the standard from here on out.

If Deepseek R1 had used standard MHA, they would need 1749KB per token for KV cache storage. This means that once the conversation reaches ~46,000 tokens, the KV cache will have exceeded the entire storage capacity of a single H100.

Using MLA, each token now consumes 125KB. This means you can hit ~640,000 tokens (2x Ulysses) before overflowing.

ur-whale1y ago

For those who wonder ... it's somewhat likely that MLA mean Multi-head latent attention

https://paperswithcode.com/method/multi-head-attention

eigenvalue1y ago

Nice, probably saved a bunch of FANG devs a lot of hours of work trying to knock this off.

nicce1y ago

There were likely some startups that tried to sell the same thing…

anon389r58r581y ago

You mean like Modular?

nicce1y ago

Or Silo AI (as an example of why) : https://www.silo.ai/blog/amd-to-acquire-silo-ai-to-expand-en...

imranq1y ago

Dang only forward passes. The real secret was in the backward pass! I was also curious to learn how they implemented the dualpipe scheduler

rfoo1y ago

Do they even have an optimized backward? It looks like optimizations like this aren't needed during training. Their V2 paper also suggests so.

mohsen11y ago

I'm confused. Wasn't there sanctions against Chinese companies about Hopper GPUs? Are they just admitting that they had access to H100 against the US sanctions?!

thot_experiment1y ago

Just the H100, the H800 is a region-specific version of the card for china with shitty nvlink bandwidth which makes it rougher for making big clusters, but deepseek was able to mitigate the impact of that by being clever (rumored to have made significant use of PTX assembly instead of just using CUDA, we'll probably find out in the releases this week)

https://www.tomshardware.com/tech-industry/deepseek-gpu-smug...

ahofmann1y ago

It isn't illegal for chinese companies to buy H100 cards. It is illegal for USA companies to sell them to China. So the "admit" part wouldn't be on Chinas side.

jofzar1y ago

It's also totally legal to sell h100 cards to a country that is very close to China.

Unrelated, it's always impressed me how Singapore buys 15% of the world's h100's. Really is the AI development capital of the world.

janalsncm1y ago

It’s funny how this claim is able to make the rounds. I originally heard it here: https://m.youtube.com/watch?v=_1f-o0nqpEI

Singapore is the billing location, not the shipping location, which makes sense because they’re the HQ of a lot of companies in the region.

xbmcuser1y ago

Not really Singapore is a trading hub a lot of multi national companies have regional offices or head offices in Singapore so if the head office buys anything for any where the purchase will show up as Singapore. Despite Nvidia showing such a large revenue from Singapore actual number of gpu shipped to Singapore is not that high. Not that some of the gpus are not going China but their is a valid reason for the Nvidia Singapore revenue numbers.

samvher1y ago

I can't tell if you're insinuating that Singapore is a pass-through for H100's heading towards China or whether there is some significant development taking place in Singapore that I'm unaware of?

amelius1y ago

Also breaking the law to growth-hack happens all the time, see Uber.

kridsdale11y ago

And the British East India Company

Tiberium1y ago

H800 is the export variant that they had access to. They directly reference it in the repo:

>Achieving up to 3000 GB/s in memory-bound configuration and 580 TFLOPS in computation-bound configuration on H800 SXM5, using CUDA 12.6.

WiSaGaN1y ago

H20 is a Hopper GPU, and they are allowed to be sold in China.

jonplackett1y ago

Can everyone stop downvoting people just for asking questions - this isn’t Stack Overflow!

feverzsj1y ago

The secret ingredient is smuggling.

tasuki1y ago

I'd be very careful when using that word in this situation. If China wants X, and another country has X, who are you to say they shouldn't trade with each other?

blackeyeblitzar1y ago

Why does anyone need to be careful using that word? What a bizarre way to try to intimidate someone over speech.

Another country has X because they were expected (in the terms of their purchase) to not sell it to an adversary. So yes they’re supposed to honor that agreement and are not supposed to trade that particular thing X with each other. Not doing so invites sanctions and other consequences. Is it worth the risk just to do business with a dictatorship? Probably not.

quantum_state1y ago

We should forget about the sanction BS … it damages US industry when it has money to make while motivating others to be more self reliant and build the product to compete …

randomNumber71y ago

Donald Trump?

79521y ago

Do you think that would be morally wrong? Honest question.

amelius1y ago

No, especially considering that they open sourced everything. (not OP)

Also, they could have outsourced the computation to a subsidiary company in the US, I suppose.

rob_c1y ago

Great work any plans to integrate with pyT or TF I wonder?

(Showing my lack of breadth of knowledge in the ecosystem (s))

behnamoh1y ago

Open AI is back!

echelon1y ago

The real "Open" AI.

fsndz1y ago

DeepSeek is just the gift that keeps on giving. I now agree with people who say open source AI will win: https://open.substack.com/pub/transitions/p/deepseek-is-comi...

baq1y ago

Open sourcing is the runner-up’s way to ensure the current best player doesn’t steal the whole market. The elephant in the room is obviously the cluster size required, it hardly matters for normal people that the weights are free. We needed more efficiency breakthroughs.

https://sakana.ai/ai-cuda-engineer/

mclau1561y ago

Was really hoping we could get flash games back with AI

kridsdale11y ago

Ask an LLM to write you some ActionScript3

syntex1y ago

What i can do with that?

rfoo1y ago

Probably nothing.

Inference providers like Fireworks, or major clouds, can use this to reduce their cost, if they don't already have a replication with similar perf.

vLLM and SGLang may integrate this to be faster at serving DeepSeek-V2/V2.5/V3/R1 on H100/H800s.

I believe that's why they didn't release this back then, this is part of their "moat" (pretty weak tho) and it only benefits competitors.

Open sourcing this after being very popular may indicate that they don't want all the users to use their API/Chat and now want the world to serve it instead? Idk.

rvz1y ago

This is the minimum bar that I expect very elite programmers should be striving for in the age of AI and DeepSeek should be studied as an example and this is the only just the first of many projects from them.

There is an extremely high chance (in fact a 99.9% chance) that an AI did not build this and the ones who are able to build or adapt projects like this which are deep into hardware systems will be the most sort after.

Not the horrendous JS or even TS slop across GitHub that is extremely easy for an AI to generate correctly.

You've got until 2030 to decide. And my advice is to study the codebases of pytorch (backends), DeepSeek, tinygrad and ggml.

PeterStuer1y ago

Honest question:

Do you feel GenAI coding is substantially different from the lineage of 4GL to 'low code' approaches?

Reason I'm asking is because despite all promises al suffered from what Spolsky coined the 'leaky abstraction' problem.

Once something goes wrong, the user is left without recourse in a sea of additional complexity created by the tooling that was meant to not have to deal with it in the first place.

My own opinion is that GenAI is different because of (a) its recursive reflexive potential (you can use the tool itself to help you past the failure) and (b) it shifts the input out of the necessity for algorithmic/systemic thinking (which may come as a surprise to the audience here but my experience has taught me is alien to dare I say the majority of people).

Now don't get me wrong. We have not reached the point where (a)+(b) make it to where you don't need application layer devs, but we are definitely seeing some progress.

As for going deeper into the stack to "escape" AI, I would venture that is probably a non starter as the deeper you go the more constrained the domain is, so your escape strategy relies on AI reasoning making little progress, where AI reasoning has always been more successful in smaller well defined spaces.

jbm1y ago

It's an interesting opinion, but I read the exact same opinions about JS developers in 2008 too.

I do agree that if you are "only" a developer, you will have to be in some sort of tightly defined niche, and how long those niches survive is anyone's guess.

KeplerBoy1y ago

What do you mean with "only" developer? Someone who just knows how to code when given a spec but lacking domain knowledge (in this case ai math and hardware optimization) and larger context?

jbm1y ago

Personally, I don't think having deep domain knowledge is as important. However, being able to write the spec based on interactions with the client / customer / stakeholders is. (The AI Math and hardware optimization "never" being doable by an AI seems like an arbitrary distinction to justify one's choices.)

Incidentally, I put the word "only" in quotes because I morally and aesthetically appreciate the strength of someone who can write to spec. I have no interest in demeaning the effort it takes to do so. I have worked with supposedly senior developers who ignore specs completely, even when the specs are done by a technical person and include details / unit tests.

menaerus1y ago

I agree that DeepSeek continues to prove themselves as a great example of engineering but the number of job positions requiring this type of knowledge IME is typically very very low so I am not sure if this would be the right advice to follow. Though I wish it was different.

rob_c1y ago

Yeah, you're hitting the nail on the head. Low tier coding work can be reduced and the high end developers can now avoid boiler plate type coding problems and get back to high level work at reengineering complex frameworks.

Yes, this unfortunately does mean a reduction in the less skilled workforce, but frankly that's an on the whole good thing. Does anyone really enjoy writing and testing boilerplate day in day out for low pay, it's the same as the old white collar pushing paper around until retirement...

beernet1y ago

LLM generated comments are so 2024

BoorishBears1y ago

Nothing about that comment implies it's LLM generated, and it's bizzare how it's being received since it's a pretty reasonable take.

rnewme1y ago

I don't find it a reasonable take, it's like saying stackoverflow.com is taking developer jobs by making it easy to code, we better develop new stackoverflow.com

WithinReason1y ago

AI is already writing optimized GPU code:

mirekrusin1y ago

Comments around that page suggest it's more of a facepalm than anything else.

CamperBob21y ago

x2 speed increase for ggml by optimizing SIMD: https://github.com/ggml-org/llama.cpp/pull/11453

"99% written by DeepSeek-R1" according to the author.

https://github.com/vllm-project/vllm/releases/tag/v0.7.1

m3kw91y ago

MHGA making hopper great again

deyiao1y ago

I heard their inferencing framework is way lower than typical deployment methods. Can this be verified from that open-source project? How does it stack up against vllm or llama.cpp

reissbaker1y ago

By "lower" you mean cheaper/better?

I suspect it's much higher throughput than vLLM, which in turn is much higher throughput than llama.cpp. The MLA kernel they just open-sourced seems to indicate that, although we'll see how it does in third party benchmarks on non-hobbled GPUs vs FlashAttention. They only released the BF16 version — whereas most people, including DeepSeek themselves, serve in FP8 — so it might not be immediately useful to most companies quite yet, although I imagine there'll be FP8 ports soon enough.

nialv71y ago

i think they meant lower level.

bee_rider1y ago

It seems hard to guess. Could be lower level, lower performance, or lower compute cost.

helloericsfOP1y ago

What do you mean by "lower"? To my understanding, they will open 5 infra related repos this week. Let's revisit your comparison question on Friday.

find0x901y ago

I don't see any use of PTX, might be in one of the other repos they plan to release.

DesiLurker1y ago

right, I think PTX use is a bigger deal than its getting coverage for. this opens an opening for other vendors to get their foot in with PTX to LLVM-ir translation for existing cuda kernels.

feverzsj1y ago

Maybe. Apple ditched them in China, because their infra can't handle large scale users.

helloericsfOP1y ago

Don't think the decision is based on infra, or any technical reasons. It's more on the service support side. How a 200-person company supports 44M iPhone users in China?

chvid1y ago

Is that true? I thought Apple was going to use their own infrastructure.

tw19841y ago

deepseek doesn't have any experience on support a 50 million user base. that was the reason cited by apple a few weeks ago.

j / k navigate · click thread line to collapse

108 comments

refibrillator1y ago

vLLM supports MLA for Deepseek models as of 3 weeks ago. 3x higher generation throughput and 10x token memory capacity.

MHA is still faster in low QPS regime apparently.

https://neuralmagic.com/blog/enhancing-deepseek-models-with-...

https://arxiv.org/pdf/2502.07864

shihab1y ago

For future readers, note that those 3x and 10x figures are compared to vLLM's own previous release, and NOT compared to Deepseek's implementation.

I am very curious to see how well-optimized Deepseek's code is compared to leading LLM serving softwares like vLLM or SGLang.

lhl1y ago

(w/ the extra memory V3/R1 fits on a single MI300X or H200 node)

It'll be interesting to see if either project can take advantage/get any benefits from this FlashMLA implementation.

menaerus1y ago

FL33TW00D1y ago

You have it backwards.

menaerus1y ago

> Decode is memory bound.

> FlashAttention ... such that you can remain compute bound at lower batch sizes during decode.

So, which one is it then?

Link: https://crfm.stanford.edu/2023/10/12/flashdecoding.html

rfoo1y ago

menaerus1y ago

I don't think I got it backwards, I believe what I said is correct - FA does not improve inference time.

From the authors of FlashAttention:

> This [decoding] operation has been optimized with FlashAttention (v1 and v2 recently) in the training case, where the bottleneck is the memory bandwidth to read and write the intermediate results

And then they continue with:

And then they come up with a different proposal, FlashDecoding, that optimizes for inference time:

https://www.nvidia.com/en-us/data-center/h100/

albertzeyer1y ago

helloericsfOP1y ago

X:https://x.com/deepseek_ai/status/1893836827574030466 BF16 support Paged KV cache (block size 64) 3000 GB/s memory-bound & 580 TFLOPS compute-bound on H800

WithinReason1y ago

That's 90% bandwidth efficiency and 60% compute efficiency

helloericsfOP1y ago

They don't have h100. wink,wink.

rfoo1y ago

They have H800s which have exactly same memory bandwidth and max FLOPS.

https://verticalserve.medium.com/group-query-attention-58283...

FL33TW00D1y ago

It seems to me that MLA will become the standard from here on out.

Using MLA, each token now consumes 125KB. This means you can hit ~640,000 tokens (2x Ulysses) before overflowing.

ur-whale1y ago

For those who wonder ... it's somewhat likely that MLA mean Multi-head latent attention

https://paperswithcode.com/method/multi-head-attention

eigenvalue1y ago

Nice, probably saved a bunch of FANG devs a lot of hours of work trying to knock this off.

nicce1y ago

There were likely some startups that tried to sell the same thing…

anon389r58r581y ago

You mean like Modular?

nicce1y ago

Or Silo AI (as an example of why) : https://www.silo.ai/blog/amd-to-acquire-silo-ai-to-expand-en...

imranq1y ago

Dang only forward passes. The real secret was in the backward pass! I was also curious to learn how they implemented the dualpipe scheduler

rfoo1y ago

Do they even have an optimized backward? It looks like optimizations like this aren't needed during training. Their V2 paper also suggests so.

mohsen11y ago

I'm confused. Wasn't there sanctions against Chinese companies about Hopper GPUs? Are they just admitting that they had access to H100 against the US sanctions?!

thot_experiment1y ago

https://www.tomshardware.com/tech-industry/deepseek-gpu-smug...

ahofmann1y ago

It isn't illegal for chinese companies to buy H100 cards. It is illegal for USA companies to sell them to China. So the "admit" part wouldn't be on Chinas side.

jofzar1y ago

It's also totally legal to sell h100 cards to a country that is very close to China.

Unrelated, it's always impressed me how Singapore buys 15% of the world's h100's. Really is the AI development capital of the world.

janalsncm1y ago

It’s funny how this claim is able to make the rounds. I originally heard it here: https://m.youtube.com/watch?v=_1f-o0nqpEI

Singapore is the billing location, not the shipping location, which makes sense because they’re the HQ of a lot of companies in the region.

xbmcuser1y ago

samvher1y ago

I can't tell if you're insinuating that Singapore is a pass-through for H100's heading towards China or whether there is some significant development taking place in Singapore that I'm unaware of?

amelius1y ago

Also breaking the law to growth-hack happens all the time, see Uber.

kridsdale11y ago

And the British East India Company

Tiberium1y ago

H800 is the export variant that they had access to. They directly reference it in the repo:

>Achieving up to 3000 GB/s in memory-bound configuration and 580 TFLOPS in computation-bound configuration on H800 SXM5, using CUDA 12.6.

WiSaGaN1y ago

H20 is a Hopper GPU, and they are allowed to be sold in China.

jonplackett1y ago

Can everyone stop downvoting people just for asking questions - this isn’t Stack Overflow!

feverzsj1y ago

The secret ingredient is smuggling.

tasuki1y ago

I'd be very careful when using that word in this situation. If China wants X, and another country has X, who are you to say they shouldn't trade with each other?

blackeyeblitzar1y ago

Why does anyone need to be careful using that word? What a bizarre way to try to intimidate someone over speech.

quantum_state1y ago

We should forget about the sanction BS … it damages US industry when it has money to make while motivating others to be more self reliant and build the product to compete …

randomNumber71y ago

Donald Trump?

79521y ago

Do you think that would be morally wrong? Honest question.

amelius1y ago

No, especially considering that they open sourced everything. (not OP)

Also, they could have outsourced the computation to a subsidiary company in the US, I suppose.

rob_c1y ago

Great work any plans to integrate with pyT or TF I wonder?

(Showing my lack of breadth of knowledge in the ecosystem (s))

behnamoh1y ago

Open AI is back!

echelon1y ago

The real "Open" AI.

fsndz1y ago

DeepSeek is just the gift that keeps on giving. I now agree with people who say open source AI will win: https://open.substack.com/pub/transitions/p/deepseek-is-comi...

baq1y ago

https://sakana.ai/ai-cuda-engineer/

mclau1561y ago

Was really hoping we could get flash games back with AI

kridsdale11y ago

Ask an LLM to write you some ActionScript3

syntex1y ago

What i can do with that?

rfoo1y ago

Probably nothing.

Inference providers like Fireworks, or major clouds, can use this to reduce their cost, if they don't already have a replication with similar perf.

vLLM and SGLang may integrate this to be faster at serving DeepSeek-V2/V2.5/V3/R1 on H100/H800s.

I believe that's why they didn't release this back then, this is part of their "moat" (pretty weak tho) and it only benefits competitors.

Open sourcing this after being very popular may indicate that they don't want all the users to use their API/Chat and now want the world to serve it instead? Idk.

rvz1y ago

Not the horrendous JS or even TS slop across GitHub that is extremely easy for an AI to generate correctly.

You've got until 2030 to decide. And my advice is to study the codebases of pytorch (backends), DeepSeek, tinygrad and ggml.

PeterStuer1y ago

Honest question:

Do you feel GenAI coding is substantially different from the lineage of 4GL to 'low code' approaches?

Reason I'm asking is because despite all promises al suffered from what Spolsky coined the 'leaky abstraction' problem.

Once something goes wrong, the user is left without recourse in a sea of additional complexity created by the tooling that was meant to not have to deal with it in the first place.

Now don't get me wrong. We have not reached the point where (a)+(b) make it to where you don't need application layer devs, but we are definitely seeing some progress.

jbm1y ago

It's an interesting opinion, but I read the exact same opinions about JS developers in 2008 too.

I do agree that if you are "only" a developer, you will have to be in some sort of tightly defined niche, and how long those niches survive is anyone's guess.

KeplerBoy1y ago

What do you mean with "only" developer? Someone who just knows how to code when given a spec but lacking domain knowledge (in this case ai math and hardware optimization) and larger context?

jbm1y ago

menaerus1y ago

rob_c1y ago

beernet1y ago

LLM generated comments are so 2024

BoorishBears1y ago

Nothing about that comment implies it's LLM generated, and it's bizzare how it's being received since it's a pretty reasonable take.

rnewme1y ago

I don't find it a reasonable take, it's like saying stackoverflow.com is taking developer jobs by making it easy to code, we better develop new stackoverflow.com

WithinReason1y ago

AI is already writing optimized GPU code:

mirekrusin1y ago

Comments around that page suggest it's more of a facepalm than anything else.

CamperBob21y ago

x2 speed increase for ggml by optimizing SIMD: https://github.com/ggml-org/llama.cpp/pull/11453

"99% written by DeepSeek-R1" according to the author.