undefined | Better HN

0 pointsskavi2y ago0 comments

Am I missing something, or does this not really account for alignment? Is the compiler doing smarter loop splitting?

0 comments

You're correct, it does not account for alignment.

The reason it helps performance is because it allows the compiler to accumulate in byte sized SIMD variables instead of int sized SIMD variables. My system has AVX-512 so 64 byte wide SIMD registers. With the non-blocking version, the compiler will load 16 chars into ints in a 64 byte ZMM register, then check if it's an 's', and then increment if so. With the blocked version, with the uint8_t tmp variable, the compiler will load 64 chars into uint8_ts in a 64 byte ZMM register instead. But there's a problem; we're gonna overflow the variables. So the compiler will stop every 128 iterations, and then move the 64 byte uint8_t accumulation variable into 4 64 byte int accumlations registers and sum them all up. Then do the next 128 iterations.

I'm pretty sure a similar thing will happen with SSE or AVX2 but I didn't check.

Tuna-Fish2y ago

I think it's just reading unaligned. That's just a ~2x loss of throughput from L1, but the second the problem is large enough that the work being done doesn't reliably fit into the L1, that doesn't matter a bit anymore.

In general for x86, unaligned writes are worth doing work to avoid, but reads are in most situations not really an issue.

j / k navigate · click thread line to collapse

0 comments

nwallin2y ago

You're correct, it does not account for alignment.

I'm pretty sure a similar thing will happen with SSE or AVX2 but I didn't check.

Tuna-Fish2y ago

In general for x86, unaligned writes are worth doing work to avoid, but reads are in most situations not really an issue.

j / k navigate · click thread line to collapse