The assembly in the article looks like it was compiled without optimization, which is going to change the exact numbers quite a bit. Contrary to what it sounds like, "without optimization" really means something like "output some strange dialect of antiquated assembly that's 10 times slower than it needs to be". It's really only useful if you are debugging the compiler. In particular, it's pushing all the variables onto the stack even though it doesn't need to, so that they are there in case you need to check them manually. 99% of the time you want to be compiling with -02, -03, or -0fast.
With gcc -O3 for x64, here's what they look like. The functions are shorter, faster, and in some ways easier to understand:
align_1:
leal 7(%rdi), %eax
andl $-8, %eax
ret
The first integer arg always comes in %rdi/%edi. The compiler has recognized the idiom, and transformed it into (x + 7) & (0b11..11000). LEA is "load effective address", and in this case is a slightly shorter way of writing 'ADD', but no faster. align_2:
cmpl $8, %edi
movl $8, %eax
jbe .L6
.L5:
addl $8, %eax
cmpl %eax, %edi
ja .L5
rep ret
.L6:
rep ret
"If x < 8 (jbe == jump before or equal), return 8. Otherwise keep adding 8 until it's greater (jump after), and return that". The return value is whatever is in %eax on return. The 'rep ret' is a funny way of saying return, apparently done to be slightly faster on ancient AMD processors. At this point it's cruft. And there is no real reason to have two return statements --- that's the kind of wacky thing compilers like to do even when optimizing. align_3:
movl %edi, %edi
subq $8, %rsp
cvtsi2sdq %rdi, %xmm0
mulsd .LC0(%rip), %xmm0
call ceil
mulsd .LC1(%rip), %xmm0
addq $8, %rsp
cvttsd2siq %xmm0, %rax
ret
.LC0:
.long 0
.long 1069547520
.LC1:
.long 0
.long 1075838976
The first 'movl' wipes out any bits greater than 32 if they are there. This is because x was defined as an int, and not a long. In general, you might be better off always using long unless there is a specific reason not to. Then we convert the int to floating point. Floating point on modern systems uses the 128-bit XMM registers that are also used for vectors. We put it into %xmm0, because that is register for the first floating point or vector arg. Keeping the result in %xmm0, multiply by what is presumably a floating point constant equal to ALIGN and then call ceil(). The result is also returned in %xmm0. Then because multiplication is much faster than division, multiply by ALIGN and then call ceil(). Then because multiplication is much faster than division, multiply by another constant that is is approximately (and depending on the value of the alignment, this might be a significant fact) equal to 1/align. Convert that back to a 32-bit "signed integer quadword" (siq) and return it.The trickiness with optimizing (which is probably why you used the mangled unoptimized version) is that the compiler has a tendency to optimize them out completely so you end up testing an empty loop. Your choices are either work in assembly directly so what you see is what you get, or to be crafty and come up with something that prevents the compiler from doing so: a checksum, or perhaps a comparison that you know the result of but hopefully the compiler doesn't.
Here are the timings in cycles on a Haswell system. I compiled with '-fno-inline', which would be faster, but might skew the results more in this case, even though function calls are fast on x64.
baseline: 5.3 cycles for a loop that returns x and compares against zero.
align_1: 5.0 cycles including the loop. Yes, in this case adding the two instructions of the function actually took fewer cycles than baseline loop. It's more than ILP voodoo, it's a whole way of life!
align_2: 62 cycles per iteration. Horrible, but not as bad as the uncompiled mess. But it worth pointing out that while you don't want to use a loop here, there are much faster loops. Perhaps while (x % align != 0) { x++ }? This gets you down to 11 cycles, only 6 more than the baseline. Interestingly, it's this fast because the processor apparently can perfectly predict the number of loop iterations after the first 30, which is a little scary.
align_3: 19 cycles per call. Faster than the silly loop, but really not a way you want to approach this problem unless you've really thought through all the floating point details.
I put my code up at https://gist.github.com/nkurz/985470b01b999e67d04b. At the bottom are more numbers provided by Likwid for things like number of instructions, number of branch errors, etc.
I don't agree. That's trading off useful cache for almost invisible micro optimizations.
As a general rule, it's always better to use the smallest size possible.
Most of the time the argument starts in a register, is passed in a register, and returned in a register. Using a smaller size often just means that the compiler adds some unneeded conversions as in this case. Usually this doesn't matter, but when it does, the benefit is almost always in favor of the simpler rule of always using 64-bit variables.
The loop version works poorly for large queue sizes, and if I saw this in production code I'd replace it immediately.
I'd be okay with a division and re-multiplication (likely going to be optimized by the compiler, through it's foolish to depend on this).
I don't see any problem with the mask/not/and solution. This has been a programming idiom for decades, and should be no more mysterious than (say) a doubly-linked list.
I asked around the room and the consensus is that a decade of C has brain damaged me.
If we want to avoid bit twiddling, how about this:
(mqhp->mq_maxsz + MQ_ALIGNSIZE - 1) / MQ_ALIGNSIZE * MQ_ALIGNSIZE;
I would expect a compiler to emit the same instructions as the original, but at worst you would have an add and two shifts rather than jumps or complex float operations.
Assume MQ_ALIGNSIZE is a power of two. Then if N is the power, MQ_ALIGNSIZE is 1<<N. Then you can replace divisions by right shifts, and multiplications by left shifts, and turn the expression into (value+(1<<N)-1)>>N<<N.
Consider x>>N<<N. The (unsigned - i.e., guaranteed zero-filling) right shift discards the bottom N bits, and then the left shift puts the remaining bits back where they were. The net result is to mask off the bottom N bits of x.
Another way of putting that would be x&~((1<<N)-1) - hopefully that is clear. In our example, 1<<N is MQ_ALIGNSIZE. So we have x&~(MQ_ALIGNSIZE-1), or, overall, (x+MQ_ALIGNSIZE-1)&~(MQ_ALIGNSIZE-1).
I suppose it's possible a compiler could transform this, or the divide-and-multiply version, into the add-then-and version, given enough things that are compile-time constants. But personally if I knew the alignment would be a power of two - which it pretty much always is - I'd just write the code I want. Life is usually simpler that way.
I was further stunned by the seeming naivety of the author's align_2() implementation, but then it does get the job done, eventually. My naive approach would've been roughly align_3(), but using integer math and modulus.
On the other hand, younger team members were recently stunned by code I did for translation of a binary protocol into more easily handled pieces using what I think of as typical idioms, so this article might get sent around Monday morning.
But you are right, this is indeed a very obvious line of code once you understand it. Thank you for your comments.
E.g. modern C code tends to use "int" or "long" all over the place without considering if "short" or "char" is likely to be sufficient.
Similarly, you'll find bit fiddling wherever it is possible, including a tendency to use bit shift instead of multiply/divide even for constant factors where one might thing the compilers would optimize it to shifts (many early compilers didn't).
So it's not necessarily just an idiom of systems programming, as an idiom amongst "old school" C programmers in general. The extent to which it has carried through to more modern code, varies, though. I suspect systems programming has an overall higher ratio of "old school" C developers than application code does.
"Sufficient" is bit odd choice of word here. Word-sized types should be used unless there is a good reason not to. See nkurzs comment for a great example for this:
> The first 'movl' wipes out any bits greater than 32 if they are there. This is because x was defined as an int, and not a long.
I have banned myself from using printfs to figure out things like this. Instead I would use a debugger and breakpoints to view the live variables in their different data formats.
In GDB, that's p/t for binary and p/d for decimal.
I've done this before, but usually take modulo 8 rather than bitwise-and negative 7 as the final step.
void *p = x;
p += 7;
p -= (p%8);
Now that I've written it out I suppose it's not any clearer than the mask method: void *p = x;
p += 7;
p &= ~7;Curiously, it's also the area where strong static typing breaks down, as it is sometimes preferred to view the same set of bits as a value of a different type to be able to manipulate it in another way.