undefined | Better HN

0 pointsawnihannun3mo ago0 comments

Right, my comment was mostly about decoding speed. For prefill you can get a speed up but there you are less latency bound.

In our benchmarks with MLX / mlx-lm it's as much as 3.5x for token generation (decoding) at batch size 1 over 4 machines. In that case you are memory bandwidth bound so sharding the model and KV cache 4-ways means each machine only needs to access 1/4th as much memory.

0 comments

liuliu3mo ago

Oh! That's great to hear. Congrats! Now, I want to get the all-to-all primitives ready in s4nnc...

j / k navigate · click thread line to collapse

0 pointsawnihannun3mo ago0 comments

Right, my comment was mostly about decoding speed. For prefill you can get a speed up but there you are less latency bound.

0 comments

liuliu3mo ago

Oh! That's great to hear. Congrats! Now, I want to get the all-to-all primitives ready in s4nnc...

j / k navigate · click thread line to collapse