Llama 405B 506 tokens/second on an H200 (opens in new tab)

(developer.nvidia.com)

21 pointsmoondistance1y ago5 comments

5 comments

not "an H200", "In the table above, tensor parallelism is compared to pipeline parallelism with each across eight GPUs"

Title on HN is wrong. The article says GPUs and it's referring to one of their 8xH200 boxes.

7e1y ago

And this is why nobody submits MLPerf against NVIDIA.

Its weird, i looked up whether AMD has any benchmarks on the 405B for the MI300x, and came across this one -- https://dstack.ai/blog/amd-mi300x-inference-benchmark/#token...

From my understanding, it can get up to around 2500 tokens/s? Both are 8x units (h200 and MI300x)

Significant further optimizations. FP8!

j / k navigate · click thread line to collapse