undefined | Better HN

0 pointsFL33TW00D1y ago0 comments

You have it backwards.

Training and prefill are compute bound. Decode is memory bound. FlashAttention massively increases the arithmetic intensity of naive MHA, such that you can remain compute bound at lower batch sizes during decode.

0 comments

menaerus1y ago

> Decode is memory bound.

> FlashAttention ... such that you can remain compute bound at lower batch sizes during decode.

So, which one is it then?

FL33TW00DOP1y ago

It depends on the batch size and the accelerator you're running on! Decode is *typically* memory bound unless you can hit high batch sizes (in the hundreds), which is hard during serving due to the contention between batch size and low TTFT.

https://jax-ml.github.io/scaling-book/inference/ - good read!

j / k navigate · click thread line to collapse