Training and prefill are compute bound. Decode is memory bound. FlashAttention massively increases the arithmetic intensity of naive MHA, such that you can remain compute bound at lower batch sizes during decode.
It depends on the batch size and the accelerator you're running on! Decode is *typically* memory bound unless you can hit high batch sizes (in the hundreds), which is hard during serving due to the contention between batch size and low TTFT.