undefined | Better HN

0 pointsmenaerus1y ago0 comments

> Decode is memory bound.

> FlashAttention ... such that you can remain compute bound at lower batch sizes during decode.

So, which one is it then?

0 comments

It depends on the batch size and the accelerator you're running on! Decode is *typically* memory bound unless you can hit high batch sizes (in the hundreds), which is hard during serving due to the contention between batch size and low TTFT.

https://jax-ml.github.io/scaling-book/inference/ - good read!

j / k navigate · click thread line to collapse

0 comments

FL33TW00D1y ago

https://jax-ml.github.io/scaling-book/inference/ - good read!

j / k navigate · click thread line to collapse