undefined | Better HN

0 pointselcritch24d ago0 comments

Running inference requires sharing intermediate matrix results between nodes. Faster networking speeds that up.

0 comments

I read (but cannot find this anymore) that the information sent from layer to layer is minimal. The actual matrix work happens within a layer. They are not doing matrix multiplication over the netwerk (that would be insane latency wise).

elcritchOP23d ago

The LLM/transformers attention layers require an O(n^2) operation between all tokens, which does require significant bandwidth.

Yes the latency hurts performance, that why it’s only achieving ~8tok/s.

j / k navigate · click thread line to collapse

0 comments

wokkel24d ago

elcritchOP23d ago

The LLM/transformers attention layers require an O(n^2) operation between all tokens, which does require significant bandwidth.

Yes the latency hurts performance, that why it’s only achieving ~8tok/s.

j / k navigate · click thread line to collapse