Most recent ha-ha moment: I kept wondering if it was normal that my cluster was only able to process 4 requests per second per vLLM engine (just seemed really low to me).
I realized a better metric is in-flight requests... Each engine is processing 70 requests at any given time, streaming tokens for over 30s.