Skip to content
Better HN
Top
New
Best
Ask
Show
Jobs
Search
⌘K
undefined | Better HN
0 points
elcritch
24d ago
0 comments
Share
Running inference requires sharing intermediate matrix results between nodes. Faster networking speeds that up.
0 comments
default
newest
oldest
wokkel
24d ago
I read (but cannot find this anymore) that the information sent from layer to layer is minimal. The actual matrix work happens within a layer. They are not doing matrix multiplication over the netwerk (that would be insane latency wise).
elcritch
OP
23d ago
The LLM/transformers attention layers require an O(n^2) operation between all tokens, which does require significant bandwidth.
Yes the latency hurts performance, that why it’s only achieving ~8tok/s.
j
/
k
navigate · click thread line to collapse