You're correct, it does not account for alignment.
The reason it helps performance is because it allows the compiler to accumulate in byte sized SIMD variables instead of int sized SIMD variables. My system has AVX-512 so 64 byte wide SIMD registers. With the non-blocking version, the compiler will load 16 chars into ints in a 64 byte ZMM register, then check if it's an 's', and then increment if so. With the blocked version, with the uint8_t tmp variable, the compiler will load 64 chars into uint8_ts in a 64 byte ZMM register instead. But there's a problem; we're gonna overflow the variables. So the compiler will stop every 128 iterations, and then move the 64 byte uint8_t accumulation variable into 4 64 byte int accumlations registers and sum them all up. Then do the next 128 iterations.
I'm pretty sure a similar thing will happen with SSE or AVX2 but I didn't check.