undefined | Better HN

0 pointskbolino2y ago0 comments

The builtin benchmarking tool in Go runs with parallelism set to 2x the number of cores. That should make evident any code that scales poorly. It is a tunable parameter though; I can try 4x cores etc.

I also ran it on x86 (AMD Ryzen 5600X) and got similar results; everything ran faster on a 4.6 GHz chip, but ChaCha8 stayed around 2 cpb while PCG improved a little to about 0.7 cpb.

I do have a ~10-year-old Celeron NUC laying around that I could use to test older/lower-end hardware. I can also publish the (admittedly very simple) program I'm using.

It's also worth noting that cpb is still a measure of average throughput not variance or latency and while Go's buffered ChaCha8 implementation amortizes to 2 cpb, when the 32-element buffer exhausts, generating a new number is much more expensive than the previous 31 numbers. I wouldn't say the performance is good enough for every use case.

0 comments

kbolinoOP2y ago

Ok, I was able to figure out a few more things. First of all, the benchmark runner does not necessarily exploit the parallelism of GOMAXPROCS out of the box. Second, GOMAXPROCS seems to default to 1x logical cores; I last ran it on a machine where logical cores = 2x physical cores. I adjusted the benchmark code to use RunParallel and adjusted the parallelism of each run.

Testing on Celeron J3455 @ 1.5 GHz (4 physical and logical cores) gave me PCG at 1.2 cpb and ChaCha8 at 2.6 cpb with cpu=1, but PCG stayed relatively constant across cpu=1,2,4,8 (worst was 1.8 cpb) while ChaCha8 slowed to 6.5 cpb at cpu=2 and 7.5 cpb at cpu=4 and cpu=8.

Back on my M1 Mac (8 logical and physical? cores), both ChaCha8 and PCG generally got better with more cores. ChaCha8 got down to 0.76 cpb at cpu=4 (then regressed a bit at cpu=8) while PCG got down to 0.26 cpb at cpu=8.

I don't think any of these results rule ChaCha8 out completely, though again I'm looking from the perspective of video games, which generally monopolize a machine while running.

pclmulqdq2y ago

The important question to benchmark for small buffer sizes isn't actually cycles per byte, it's micro-ops per byte. You are sort of getting there by adding threads, but if you want to measure micro-ops per byte by measuring cycles per byte, your benchmarking code should run a number of parallel implementations of the generator on each thread (and stick to 1 thread per physical core, too). Each generator will have a "roof" where more generators per thread doesn't cost any more in terms of cycles/byte. I am assuming that for PCG, that roof is around 4-ish on an old CPU, and may be as high as 6 or 8 on your macbook, while the roof for ChaCha will be close to 1.

ChaCha exploits instruction-level parallelism to get speed. PCG doesn't - it has a chain of instructions that must be executed sequentially. That means that when the PCG generator executes, it leaves gaps in the instruction stream that can be filled with other instructions for the game. That means a slowdown in the game that is more significant than what your benchmark suggests.

I'm going to do this and write blog about it (although I don't have a macbook), so we may be able to compare results.

kbolinoOP2y ago

Ok this makes sense. There are only 4 FP/SIMD units in the M1 and I'm guessing just one in the Celeron.

I look forward to reading your blog about this.

j / k navigate · click thread line to collapse

0 comments

kbolinoOP2y ago

I don't think any of these results rule ChaCha8 out completely, though again I'm looking from the perspective of video games, which generally monopolize a machine while running.

pclmulqdq2y ago

I'm going to do this and write blog about it (although I don't have a macbook), so we may be able to compare results.

kbolinoOP2y ago

Ok this makes sense. There are only 4 FP/SIMD units in the M1 and I'm guessing just one in the Celeron.

I look forward to reading your blog about this.

j / k navigate · click thread line to collapse