undefined | Better HN

0 pointsmenaerus1y ago0 comments

True. I wrote some software that does these calculations for me besides the ones I already have on the paper. I confused two different graphs.

So, the final number would be ~0.6 GFLOPS (self-attention across heads) + ~0.15 GFLOPS (attention) + ~1 GFLOPS (ffwd) which in total give or take is ~2 GFLOPS per-layer.

Bandwidth-wise, the ~1GB number I previously gave was also wrong (llama3-70B has 8 KV heads). Now, with more precise calculations that figure is ~0.6 GB per-layer.

So, at batch_size=1, FP8 precision, 1024 tokens, during the decode phase with KV-cache, we need ~2GFLOPS of compute and ~0.6GB of bandwidth per each layer. Still looks compute-bound to me.

0 comments

rfoo1y ago

> Still looks compute-bound to me.

H100 has 3.3TB/s HBM bandwidth on paper, and ~1000TFLOPS bf16 compute on paper. That's 1:300. 0.6GB vs ~2GFLOPS is 1:3. Tell me how is this compute bound?

(also, your number, even after accounting for GQA, is still off. You usually can't store kvcache in fp8.)

j / k navigate · click thread line to collapse

0 comments

rfoo1y ago

> Still looks compute-bound to me.

H100 has 3.3TB/s HBM bandwidth on paper, and ~1000TFLOPS bf16 compute on paper. That's 1:300. 0.6GB vs ~2GFLOPS is 1:3. Tell me how is this compute bound?

(also, your number, even after accounting for GQA, is still off. You usually can't store kvcache in fp8.)

j / k navigate · click thread line to collapse