undefined | Better HN

0 pointsKeplerBoy2y ago0 comments

Just for clarification: The 1080 has 20 SMs with 128 FPUs each. Each FPU can perform 2 FLOPs per cycle (fused multiply adds). Combined with the frequency of 1607 MHz we land on the advertised 8.2 TFlop/s.

The fact that each SM can support 1024 threads (that's the maximum blocksize of CUDA on that card) doesn't do much for the theoretical flops. Only a fraction of those threads can be active at a time. The others are idling or waiting on their memory requests. This hides a lot of the i/o latency.

0 comments

pavlov2y ago

For sure. Just counting threads doesn't give anything like a complete picture of performance.

It's still somewhat interesting because threads are a low-level programming primitive. If you can come up with work for 40k simultaneous threads, you can use the GPU effectively. For some tasks this parallelization is obvious (a HD video frame has 2 million pixels and shading them independently is trivial), and of course often it's anything but.

j / k navigate · click thread line to collapse

0 pointsKeplerBoy2y ago0 comments

0 comments

pavlov2y ago

For sure. Just counting threads doesn't give anything like a complete picture of performance.

j / k navigate · click thread line to collapse