You're confusing two things.
Classic softmax attention aka Softmax(Q K^T/sqrt(d_k))V consists of two matrix multiplications.
This means QK^T=O and then softmax(O/sqrt(d_k)V.
The matrix O is quadratic with respect to the number of input tokens. Writing the O matrix to main memory is bound by the maximum bandwidth of your memory.
Then it has to be read out again to be multiplied against V.
What flash attention does is change the algorithm. Flash attention is numerically similar to softmax attention, but not equivalent. The changed algorithm allows you to fuse the independent kernels.
Instead of writing out the O matrix to main memory, its softmax is calculated against V immediately. The double memory roundtrip is now gone. This in itself does not change the fact that both softmax attention and flash attention are quadratic with respect to the input, but it sure as hell improves the speed of "prefill".
If you tile the Q, K, V matrices into n blocks each, you will still have to load O(n^2) blocks.
But here is the thing. Matrix multiplication is an operation with a significant amount of shared data. This means the multipliers calculating the dot products are being fed from the same flip flops, or the data is shifted around via a systolic array. You end up in a situation with an insignificant memory load, but a massive amount of arithmetic.
In addition to that, you have all the tokens already, so the MLPs at the end of the layer can be processed as GEMM instead of GEMV.
This is why "prefill" is compute intensive instead of memory intensive.
During token generation, you need to perform attention for the next token, with all the tokens already in the KV cache. You load n entries from the KV cache, then do GEMV on the MLP and you have to do this over and over again in a sequential fashion. This means that memory bandwidth is the deciding factor for token generation.
Now here is a caveat: if SRAM is limited Vs your TOPS, then it is possible that even flash attention is memory bound, but for a different reason. It's memory bound, because the maximum tile size that can be held in SRAM can be processed faster than it takes to load it from system memory or VRAM and you are performing a quadratic amount of tile loading operations. This is only noticeable near the extreme top end of context lengths between 32k and 128k tokens.