int run_switches(const char *s) {
int res = 0;
uint8_t tmp = 0;
size_t n = strlen(s);
for (size_t i = n & 127; i--; ++s)
tmp += (*s == 's');
res += tmp;
for (size_t j = n >> 7; j--;) {
tmp = 0;
for (size_t i = 128; i--; ++s)
tmp += (*s == 's');
res += tmp;
}
return 2 * res - n;
}Edit: Also, there's an off by one error. should be:
#include <stddef.h>
#include <stdint.h>
int run_switches(const char *s, const size_t n) {
int res = 0;
uint8_t tmp = 0;
for (int i = n & 127; i--; ++s)
tmp += *s == 's';
res += tmp;
for (int size = n >> 7; size--;) {
tmp = 0;
for (int i = 128; i--; ++s)
tmp += *s == 's';
res += tmp;
}
return 2 * res - n + 1;
}
~90GB/s on my machine, compared to 4.5GB/s for his best effort on his blog. So 20x as fast.Which tricks in there are worth playing around with more widely?
Is the uint8_t just "no point in using something bigger" or does it likely help the compiler? Does/can the signedness matter as well as the size?
Ditto looping downwards -- is it often likely to improve things? Can it generalize to pointer/iterator ranges, or is it often worth trying to phrase them in terms of array/index accesses instead?
I guess the compiler's unrolling heuristics generally aren't as good as that blocking "mod then div" alternative to Duff's device? Obviously taking `s` out of the loop condition is part of the magic.
Not checking the 'p' character by comparison is an easy optimization to understand.
Any places to read about this sort of thing, or any tricks or guidelines that come to mind? I write a fair bit of performance-sensitive code but it's all probably 20x slower than it could be because I have no intuition for what transformations compilers will do beyond "this prob gets inlined" etc.
The magic in this case is the compiler autovectorizer. Making the length of the loop a loop invariant allows the autovectorizer to kick in.
The reason "blocking" by accumulating on uint8_t helps further is that it allows the compiler to accumulate on 8 bit SIMD lanes, instead 32 bit SIMD lanes. The same operation on 8 bit SIMD lanes will, to a first approximation, do x4 the work per cycle.
In a good world you could use just uint_fast8_t and compiler would optimize this question for you. In real world I don't think compilers are smart enough, or there are too many other constraints limiting them :(
Also, someone else figured out that we can just use an and instruction instead of cmp. That gives us this version:
#include <stddef.h>
#include <stdint.h>
int run_switches(const char *s, const size_t n) {
int res = 0;
uint8_t tmp = 0;
for (int i = n & 127; i--; ++s)
tmp += 1 & *s;
res += tmp;
for (int i = n >> 7; i--;) {
tmp = 0;
for (int j = 128; j--; ++s)
tmp += 1 & *s;
res += tmp;
}
return 2 * res - n;
}
This is 111GB/s, up from 4.5GB/s in the blog. I'm going to try really hard to put this problem down now and work on something more productive.n>>7 is equal to n/(2^7), and is a faster way to divide with a power-of-two.
The reason it helps performance is because it allows the compiler to accumulate in byte sized SIMD variables instead of int sized SIMD variables. My system has AVX-512 so 64 byte wide SIMD registers. With the non-blocking version, the compiler will load 16 chars into ints in a 64 byte ZMM register, then check if it's an 's', and then increment if so. With the blocked version, with the uint8_t tmp variable, the compiler will load 64 chars into uint8_ts in a 64 byte ZMM register instead. But there's a problem; we're gonna overflow the variables. So the compiler will stop every 128 iterations, and then move the 64 byte uint8_t accumulation variable into 4 64 byte int accumlations registers and sum them all up. Then do the next 128 iterations.
I'm pretty sure a similar thing will happen with SSE or AVX2 but I didn't check.
In general for x86, unaligned writes are worth doing work to avoid, but reads are in most situations not really an issue.