I also ran it on x86 (AMD Ryzen 5600X) and got similar results; everything ran faster on a 4.6 GHz chip, but ChaCha8 stayed around 2 cpb while PCG improved a little to about 0.7 cpb.
I do have a ~10-year-old Celeron NUC laying around that I could use to test older/lower-end hardware. I can also publish the (admittedly very simple) program I'm using.
It's also worth noting that cpb is still a measure of average throughput not variance or latency and while Go's buffered ChaCha8 implementation amortizes to 2 cpb, when the 32-element buffer exhausts, generating a new number is much more expensive than the previous 31 numbers. I wouldn't say the performance is good enough for every use case.