So, the final number would be ~0.6 GFLOPS (self-attention across heads) + ~0.15 GFLOPS (attention) + ~1 GFLOPS (ffwd) which in total give or take is ~2 GFLOPS per-layer.
Bandwidth-wise, the ~1GB number I previously gave was also wrong (llama3-70B has 8 KV heads). Now, with more precise calculations that figure is ~0.6 GB per-layer.
So, at batch_size=1, FP8 precision, 1024 tokens, during the decode phase with KV-cache, we need ~2GFLOPS of compute and ~0.6GB of bandwidth per each layer. Still looks compute-bound to me.