How do get the weights for the right set of experts for a given batch of tokens into fast memory at the right time?
The activated experts is only available after routing, at which point you need the weights immediately and will have very poor performance if they are across PCIe