This insight is why I stopped trying to use CA as my underlying computational substrate in genetic programming experiments. It is much, much cheaper to run something like brainfuck programs than it is to simulate CA on a practical computer.
A switch statement over 8 instructions contained in a contiguous byte array will essentially teleport its way through your CPU's architecture.
I feel like CA (single, or multi-state) would work quite well on dedicated hardware, how big could the grid even be? I may be missing the obvious, but it does seem easier to scale compared to cores and manual dispatch.
But otherwise yeah, not the most efficient on current CPUs.
To be fair, a one-dimensional CA is effectively a sort of UTM with a weird program counter and instruction set. I think the more useful CAs will tend to be of the higher dimensional variety (beyond 2D/3D). Simulating a CA in hyperspace seems problematic even if you are intending to purpose build the hardware.