True, having multiple CPUs should help. Though in my experience it's quite easy in linux for one CPU stalled in the kernel to block quite a wide range of other tasks, presumably because of contended locks.
And yes, the better approach is to use a virtio type system which is designed for VM to host communication, but VM implementations tend to emulate lowest-common-denominator hardware for the highest compatibility (The 8250 is everywhere, basically every OS supports it and its interface is very commonly emulated even in modern hardware). The issue is, because it's ancient hardware, every tiny buffer is basically a full interrupt routine (checking flags and copying data), with multiple traps to the VM emulation of it. Doing something like rep outb would be only a minor optimisation on top of that (it already does the most significant thing of continuing to copy data in the interrupt routine until the flag is unset, instead of waiting for the interrupt to re-trigger for each chunk of data).
(On real hardware, doing something like DMA is the much better way to optimise this, but it's probably not a good emulation target because DMA controllers vary wildly from platform to platform)