I'm not really knowledgeable about this space, so maybe I'm missing something:
Why does the bus performance affect token generation? I would expect it to cause a slow startup when loading the model, but once the model is loaded, just how much bandwidth can the token generation possibly use?
Token generation is completely on the card using the memory on the card, without any bus IO at all, no?
IOW, I'm trying to think of what IO the card is going to need for token generation, and I can't think of any other than returning the tokens (which, even on a slow 100MB/s transfer is still going to be about 100x the rate at which tokens are being generated.