Those algorithms become more important when you're dividing by a value known at runtime but which remains the same during parts of the program. That's where libdivide comes in.
> libdivide allows you to replace expensive integer divides with comparatively cheap multiplication and bitshifts. Compilers usually do this, but only when the divisor is known at compile time. libdivide allows you to take advantage of it at runtime. The result is that integer division can become faster - a lot faster. [...] divide SIMD vectors by runtime constants, [1]
> libdivide.h is a header-only C/C++ library for optimizing integer division. Integer division is one of the slowest instructions on most CPUs e.g. on current x64 CPUs a 64-bit integer division has a latency of up to 90 clock cycles whereas a multiplication has a latency of only 3 clock cycles. libdivide allows you to replace expensive integer divsion instructions by a sequence of shift, add and multiply instructions that will calculate the integer division much faster.
> On current CPUs you can get a speedup of up to 10x for 64-bit integer division and a speedup of up to to 5x for 32-bit integer division when using libdivide. libdivide also supports SSE2, AVX2 and AVX512 vector division which provides an even larger speedup. You can test how much speedup you can achieve on your CPU using the benchmark program.[2]
[1] https://libdivide.com/ [2] https://github.com/ridiculousfish/libdivide
Keep in mind that this doesn't use the fact that we know that the input is between 0 to 63.
You can use __builtin_assume for this: https://godbolt.org/z/K4jKhxnTq
Say you’re animating a jump and the character slows down towards the apex. I’ve seen an approach where it’s just “every X frames halve the speed until it’s zero.” This is done with the `SLA` (shift-left, LSB to 0) and `SRL` (shift-right, MSB to 0)` operations, which are very fast (8 cycles!) The bonus is that these opcodes will set the Zero Carry flag to 0 if the result is zero, so you know when your jump animation (up) is done if there is no ceiling collision, without having to check the value of the byte.
(I’ve also seen a set of hard coded speeds to animate through, and cases where they don’t change the speed at all, making the animation feel very rigid. It’s amazing what little details are important)
Though I do have some math-related problems in there. I spent several hours on it but I couldn't figure out how the SBC instruction should affect the carry flags. No matter what I do I can't pass the test ROMs.
But even with that, many of my childhood games are 100% playable. Writing a video game console emulator is a rewarding experience!
While the article goes down a different rabbit hole, this also happens to be a neat example of (n-1)(n+1) = n² - 1, where n = 8.
Which is one of those things I haven't managed to find an actual use for but feels like it should have since you can use it with 0xFF (15 × 17), 0xFFFF (255 × 257), and so on.
1 1 1 1
--- - --- = --- --- (a-b)
b a b a
Move 1/a to the right hand side 1 1 1 1
--- = --- + --- --- (a-b)
b a b a
[X]
Substitute 1/b marked with X for the RHS and you get 1 1 (a-b) 1 1
--- = --- + ----- (1 + --- --- (a-b))
b a a^2 b a
Repeat and eventually you get __
1 \ -(n+1) n
--- = > a (a-b)
b /_
n=0...inf
For example with b=3 and a=2, you get 1 1 1 1
--- = --- - --- + --- -+ ...
3 2 4 16
Word of warning though, this method tend to produce nasty carry errors.lookup table with 64 entries anyone?
Compare here for the DEC PDP-1 (1959) and related algorithms:
[0] https://www.masswerk.at/spacewar/inside/insidespacewar-pt6-g...
In cases where you need a floored result, or need it rounded in a certain direction, or know that the divisor is always positive, the compiler will often give you sub-optimal assembly. And sometimes the compiler just randomly fails to inline or constant-fold and outputs a big fat IDIV.
Also, if you have to debug with optimizations disabled, the compiler will give you deliberately garbage code, which can make the program you're debugging unusably slow. So you often end up hand-optimizing for that case.
Of course, this depends on where the hot path is, but I've had to do a lot of optimization for code that gets run billions of times per second. I used to think compilers were really smart, but after staring at enough assembly output and learning all their tricks, they don't seem that smart anymore. Especially Microsoft's compiler; I've seen it output redundant division instructions!
int div7 (int x) { return 9 - __builtin_popcountll (0x4081020408102040ULL >> x); }
https://godbolt.org/z/Wj6P6jYGM
I think any monotonic function on 0 .. 63 can be written this way, so should be possible to fold in the outer 63 - nlz expression, too.
The miss cost is, therefore, not really relevant: if this code is hotpath at all the cost of a single cache miss is amortized across millions of calls. Your lookup table will be in cache as surely as the code that reads it is.
(I don't know what language is used here. I'm just assuming something like C.)
`(63 - nlz) / 7` presumably tells you how many bytes you need to read for the varint. The length could be signaled by the number of leading zeros in the first byte (nlz), and each subsequent byte could set a high continuation bit (à la UTF-8 and protobuf varints), thus providing 7 bits of information per byte.
I'm totally speculating but this is generally what I'd expect from the expression `(63 - nlz) / 7` in the context of varints. Continuation bits and leading-zero-counters are redundant but this wouldn't be the first time I've seen something encode redundant information.
(v + (v << 3) + 9) >> 6Alas, the commenting system on their website seems broken, so can't ask there.
You are right that micro-benchmarks are a bit suspicious, but they are better than nothing.
Btw, have a look at https://godbolt.org/z/zMarEnYP5 to see what Clang come up with on her own.
#include<stdint.h>
uint64_t div(uint64_t nlz) {
__builtin_assume(nlz <= 63);
return (63 - nlz) / 7;
}
div: # @div
xori a0, a0, 63
andi a1, a0, 255
li a2, 37
mul a1, a1, a2
srli a1, a1, 8
subw a0, a0, a1
slli a0, a0, 56
srli a0, a0, 57
add a0, a0, a1
srli a0, a0, 2
ret
This uses Risc-V assembly. Just for fun. x86 is also fascinating.I haven't analysed it in detail. But it looks like Clang doesn't seem to mind multiplication.
Most of it is pattern recognition and learning common idioms, like multiplying by 0x01010101..., extracting substrings of bits ((x >> y) & z), thinking of values as sets of integers and so on.
Also bitwise manipulations lend themselves well to function composition, so try to keep an eye out for that. For example, if you have a fast function that checks if any byte is zero, you can generalize it to any value by applying xor beforehand. Similarly, the fast bit matrix transpose (which absolutely terrified me) is simply applying the least number of "shift a subset of bits" operations such that each bit travels the required distance.
It sounds a little woo-woo but going from representing something on paper (or worse, a display) to doing it with my hands really changed it for me. Humans likely evolved intelligence because we stood up and used our hands to manipulate our environment, I suspect we more fully engage our brains in important ways by doing something with our hands when we can.
You might start appreciating the imperial system of measurement though. You've been warned.
E.g in decimal you can multiply by ten by appending a zero, so in binary you multiply by 2 by doing so. And left shifting by 1 is what appends a zero.
Or you can take approximate log base ten in decimal by counting amount of digits, so in binary you can approximate log2 this way (counting up to most significant one-bit). Etc...
Dividing a number by 7 using bitwise operations can be done using a technique called "magic numbers." This involves precomputing a number that, when combined with bitwise operations, will give the correct result.
Here's how it can be done:
First, we need to find a "magic number" that will help us divide by 7. One way to do this is to use the fact that 7 is a prime number, which means that there is a number k such that 2^k ≡ -1 (mod 7). This number is called the "multiplicative inverse" of 2 modulo 7. We can find k by repeatedly squaring 2 until we get -1 modulo 7:
lua2^1 ≡ 2 (mod 7) 2^2 ≡ 4 (mod 7) 2^3 ≡ 1 (mod 7) 2^4 ≡ 2 (mod 7) 2^5 ≡ 4 (mod 7) 2^6 ≡ 1 (mod 7)
So k = 3, since 2^3 ≡ 1 (mod 7).
Next, we need to combine this magic number with bitwise operations. Specifically, we can use the following formula to divide a number x by 7:
scssx/7 = (x + (x >> 3)) * k
Here, >> is the bitwise right shift operator, which shifts the bits of x to the right by 3 places, effectively dividing x by 8. The expression (x + (x >> 3)) adds these two values together, which effectively multiplies x by 9. Multiplying by k then gives the correct result of dividing by 7.
Here's some example code in Python that implements this division algorithm:
python
def div_by_7(x): k = 3 return (x + (x >> 3)) * k
Note that this algorithm assumes that the input number x is a non-negative integer. It may not work correctly for negative numbers or floating-point numbers.