I feel like CA (single, or multi-state) would work quite well on dedicated hardware, how big could the grid even be? I may be missing the obvious, but it does seem easier to scale compared to cores and manual dispatch.
But otherwise yeah, not the most efficient on current CPUs.