Also, someone else figured out that we can just use an and instruction instead of cmp. That gives us this version:
#include <stddef.h>
#include <stdint.h>
int run_switches(const char *s, const size_t n) {
int res = 0;
uint8_t tmp = 0;
for (int i = n & 127; i--; ++s)
tmp += 1 & *s;
res += tmp;
for (int i = n >> 7; i--;) {
tmp = 0;
for (int j = 128; j--; ++s)
tmp += 1 & *s;
res += tmp;
}
return 2 * res - n;
}
This is 111GB/s, up from 4.5GB/s in the blog. I'm going to try really hard to put this problem down now and work on something more productive.