Bitwise Division (opens in new tab)

(h14s.p5r.org)

167 pointsstek293y ago74 comments

74 comments

Horner’s method [0] is also commonly used on small microcontrollers without divide (or multiply) instructions.

[0] https://www.ti.com/lit/an/slaa329a/slaa329a.pdf

kjs33y ago

Look out. Any time someone brings up microcontroller multiply, there's a someone who is compelled to chime in with "Well I've never needed to multiply on a microcontroller, so why do you need to!".

dev_tty013y ago

Or when the controller has a divide but it is painfully slow.

mgaunard3y ago

The compiler is already reducing integer division by a constant into these things.

Those algorithms become more important when you're dividing by a value known at runtime but which remains the same during parts of the program. That's where libdivide comes in.

mncharity3y ago

libdivide tl;dr:

> libdivide allows you to replace expensive integer divides with comparatively cheap multiplication and bitshifts. Compilers usually do this, but only when the divisor is known at compile time. libdivide allows you to take advantage of it at runtime. The result is that integer division can become faster - a lot faster. [...] divide SIMD vectors by runtime constants, [1]

> libdivide.h is a header-only C/C++ library for optimizing integer division. Integer division is one of the slowest instructions on most CPUs e.g. on current x64 CPUs a 64-bit integer division has a latency of up to 90 clock cycles whereas a multiplication has a latency of only 3 clock cycles. libdivide allows you to replace expensive integer divsion instructions by a sequence of shift, add and multiply instructions that will calculate the integer division much faster.

> On current CPUs you can get a speedup of up to 10x for 64-bit integer division and a speedup of up to to 5x for 32-bit integer division when using libdivide. libdivide also supports SSE2, AVX2 and AVX512 vector division which provides an even larger speedup. You can test how much speedup you can achieve on your CPU using the benchmark program.[2]

[1] https://libdivide.com/ [2] https://github.com/ridiculousfish/libdivide

int_19h3y ago

The obvious question this invites: if this is generally faster for unknown values, why don't compilers use this optimization directly in emitted code?

moonchild3y ago

They could, but it's less profitable, and cost models are, as always, a big problem. If we have:

  loop for i from 0 to n
      f(a[i]/y)

How big does n have to be before it makes sense to compute a reciprocal for y? And how frequently is it actually that big during the runtime of the code?

_a_a_a_3y ago

> Compilers usually do this, but only when the divisor is known at compile time

if the divisor is in a variable or otherwise 'hidden', the compiler can't deduce enought to get to it, is my guess.

baruch3y ago

Calculating the divisor values is also expensive, this works when you do this work once and then do the efficient divides multiple times.

eru3y ago

See https://godbolt.org/z/3Yqbceaza for what godbolt says clang produces.

Keep in mind that this doesn't use the fact that we know that the input is between 0 to 63.

workaccount13y ago

> Keep in mind that this doesn't use the fact that we know that the input is between 0 to 63.

You can use __builtin_assume for this: https://godbolt.org/z/K4jKhxnTq

pantalaimon3y ago

An assert() also does the trick: https://godbolt.org/z/MecvMGPdW

edit uh but when asserts are disabled it won't work: https://godbolt.org/z/4TMs1Wc5z

unless you roll your own assert: https://godbolt.org/z/4v35rrTvn

RenThraysk3y ago

9*x/64 still reduces to 2 instructions

https://godbolt.org/z/6WsWqh4ah

Waterluvian3y ago

I learned about shifting to divide or multiply when writing my game boy emulator. Something I find fascinating is the lack of precision needed for games of that fidelity.

Say you’re animating a jump and the character slows down towards the apex. I’ve seen an approach where it’s just “every X frames halve the speed until it’s zero.” This is done with the `SLA` (shift-left, LSB to 0) and `SRL` (shift-right, MSB to 0)` operations, which are very fast (8 cycles!) The bonus is that these opcodes will set the Zero Carry flag to 0 if the result is zero, so you know when your jump animation (up) is done if there is no ceiling collision, without having to check the value of the byte.

(I’ve also seen a set of hard coded speeds to animate through, and cases where they don’t change the speed at all, making the animation feel very rigid. It’s amazing what little details are important)

grishka3y ago

I also wrote a game boy emulator recently with the purpose of educating myself. I haven't looked that deep into what games do wrt math, but I was really impressed with all the creative ways they abuse the PPU by changing its various registers mid-frame. "Batman return of the Joker" for example uses this interesting effect of "compressing" the upper portion of the screen when you enter the options menu. It does this by changing the scroll Y to specific values on specific lines.

Though I do have some math-related problems in there. I spent several hours on it but I couldn't figure out how the SBC instruction should affect the carry flags. No matter what I do I can't pass the test ROMs.

But even with that, many of my childhood games are 100% playable. Writing a video game console emulator is a rewarding experience!

Rediscover3y ago

This is reminiscent, to me, of AIM-239 [0] and Warren's "Hacker's Delight" [1]. Good stuff.

[0] https://dspace.mit.edu/handle/1721.1/6086

[1] https://en.m.wikipedia.org/wiki/Hacker%27s_Delight

vanderZwan3y ago

> And it happens to be the case that 7 × 9 = 63, which is almost 64, a power of 2

While the article goes down a different rabbit hole, this also happens to be a neat example of (n-1)(n+1) = n² - 1, where n = 8.

Which is one of those things I haven't managed to find an actual use for but feels like it should have since you can use it with 0xFF (15 × 17), 0xFFFF (255 × 257), and so on.

gus_massa3y ago

It looks nicer in hexa: 0xFF = 0xF*0x11 and 0xFFFF = 0xFF*0x101

marginalia_nu3y ago

You can derive a relationship between division and shift-add-multiply in a cute way by noting that

   1       1        1   1 
  ---  -  ---  =   --- --- (a-b)
   b       a        b   a

Move 1/a to the right hand side

   1      1     1   1 
  ---  = --- + --- --- (a-b)
   b      a     b   a  

               [X]

Substitute 1/b marked with X for the RHS and you get

   1     1    (a-b)       1   1
  --- = --- + ----- (1 + --- --- (a-b))
   b     a     a^2        b   a

Repeat and eventually you get

      __
  1   \    -(n+1)    n
 --- = >  a     (a-b)
  b   /_  
      n=0...inf

For example with b=3 and a=2, you get

  1     1     1     1
 --- = --- - --- + --- -+ ... 
  3     2     4     16

Word of warning though, this method tend to produce nasty carry errors.

pestatije3y ago

(63 - nlz) / 7, where nlz is between 0 and 63

lookup table with 64 entries anyone?

dahart3y ago

It might be slower to use a memory access, even if you assume an average L1 cache hit. You can pipeline the bitwise version and maybe achieve an amortized divide in one cycle, where (depending on process and many other things) the lookup table might peak at like 4 cycles per divide even in unrolled/parallel situations. Plus the technique in the article is good for just about any constant divisor and if generalized works on much larger ranges, it’s why many compilers use this trick when they can.

hoseja3y ago

A bitshift oneliner IS nicer.

resource0x3y ago

Maybe nicer, but slower. And I'm not even sure about the former.

cryptonector3y ago

Lookup tables can be slower. Do not assume that memory is fast. Even if the table is small and fits in cache, it will still be displacing other useful things from the cache. If you can have a small sequence of fast instructions using only registers and no branches, that's very likely to be faster -if not much faster- than lookup tables. Just one L1 cache miss could take much longer to resolve than computing this particular sequence of instructions.

1 more reply

Matheus283y ago

Probably faster: no memory loads, no cache pressure

1 more reply

masswerk3y ago

Before there had been more complex instructions, doing entire multiplications or divisions at once, computers handled this by dedicated multiplication shift and division shift instructions.

Compare here for the DEC PDP-1 (1959) and related algorithms:

[0] https://www.masswerk.at/spacewar/inside/insidespacewar-pt6-g...

cormacrelf3y ago

Don’t modern compilers do this automatically (and aggressively) for almost any division by a constant?

gary_03y ago

Sometimes, but not always.

In cases where you need a floored result, or need it rounded in a certain direction, or know that the divisor is always positive, the compiler will often give you sub-optimal assembly. And sometimes the compiler just randomly fails to inline or constant-fold and outputs a big fat IDIV.

Also, if you have to debug with optimizations disabled, the compiler will give you deliberately garbage code, which can make the program you're debugging unusably slow. So you often end up hand-optimizing for that case.

Of course, this depends on where the hot path is, but I've had to do a lot of optimization for code that gets run billions of times per second. I used to think compilers were really smart, but after staring at enough assembly output and learning all their tricks, they don't seem that smart anymore. Especially Microsoft's compiler; I've seen it output redundant division instructions!

vintermann3y ago

The author mentions that this is for a variable length (i.e. bigint) format. Sure, I bet mature bigint libraries do stuff like this automatically when sensible, but I still think it's interesting to read about someone figuring out such things on their own.

bonzini3y ago

It's about variable-length encodings, not variable length integers.

cormacrelf3y ago

Yeah. The (63-nlz)/7 is part of a piece of code implementing an encoding. nlz is an int.

RagnarD3y ago

Wouldn’t a lookup table be a whole lot easier? I’d think a 64 entry table of integers would be fast to access once cached.

fweimer3y ago

It's possible to use a lookup table (sort of) in a register. With x86-64-v2 and later, the code is larger, but potentially faster due to the shorter dependency chain:

int div7 (int x) { return 9 - __builtin_popcountll (0x4081020408102040ULL >> x); }

https://godbolt.org/z/Wj6P6jYGM

I think any monotonic function on 0 .. 63 can be written this way, so should be possible to fold in the outer 63 - nlz expression, too.

gregschlom3y ago

This is a very cool technique. Someone ought to benchmark this vs the method described in the original article.

gary_03y ago

I don't use lookup tables that much when optimizing these days, because most math instructions cost far fewer cycles than a cache miss. And even if your lookup table is in L3, it still costs ~40 cycles to retrieve each line.

bodyfour3y ago

That's a good instinct to have. However in this case the input is 0..63 and the output comfortably sits in a single byte. It can fit in just one cache line if you bother to align it.

The miss cost is, therefore, not really relevant: if this code is hotpath at all the cost of a single cache miss is amortized across millions of calls. Your lookup table will be in cache as surely as the code that reads it is.

dahart3y ago

Maybe easier, if limited to the specific case of an input in the range [0..63] and a divisor of 7, but not necessarily faster. Keep in mind the latency of a memory read is typically around 4 cycles, and at least some architectures have a tighter limit on the number of concurrent memory transactions than the number of concurrent integer math instructions, and keep in mind the article is generalizing beyond the initial problem involving the numbers 63 and 7- if you have more than a couple different divisors, and if you want to allow the input to go as large as possible, then a lookup table quickly becomes wasteful and impractical.

vintermann3y ago

The divisor is from 0 to 63. The dividend is a variable bit length integer, i.e. a bigint.

bonzini3y ago

Absolutely not, he needs to compute (63 - x) / 7, where x is computed from a 64-bit integers. There are no bigints involved.

vintermann3y ago

Well, there are variable length integers involved, since it's as part of a way of representing variable length integers. But it seems you're right that x is just a regular machine int in this context, as an implementation detail. I misunderstood.

phkahler3y ago

My goto would have been multiplication by (1/7)*2^16 followed by a right shift. You still need to verify the constant (or can add another after the multiply) to get it to work in every case. Used this once to extract digits of a 16bit integer from left to right.

eru3y ago

Your C compiler will do this for you automatically. See eg https://godbolt.org/z/3Yqbceaza

(I don't know what language is used here. I'm just assuming something like C.)

jsd19823y ago

Very nice description of the bitwise maths, however I'm curious what lead to the need for `(63 - nlz) / 7` in the first place. A divisor of 7 is an odd thing to find when dealing with power-of-2-sized integers. Maybe this expression is used in some sort of expected length calculation to find the number of bytes which would be used for preallocating buffers? In most variable length encoder loops I've seen, you would just do a few bit twiddles to compute your encoded byte values and your exit condition should be a trivial check if there are remaining non-zero bits.

cornstalks3y ago

OP mentions "playing around with variable-length integer encodings."

`(63 - nlz) / 7` presumably tells you how many bytes you need to read for the varint. The length could be signaled by the number of leading zeros in the first byte (nlz), and each subsequent byte could set a high continuation bit (à la UTF-8 and protobuf varints), thus providing 7 bits of information per byte.

I'm totally speculating but this is generally what I'd expect from the expression `(63 - nlz) / 7` in the context of varints. Continuation bits and leading-zero-counters are redundant but this wouldn't be the first time I've seen something encode redundant information.

jameshart3y ago

Nice, if a little bit hand wavy. Seems a little bit of a stretch to call an operation that still includes a multiply bitwise, though.

alpaca1283y ago

The final non-generalized version uses two additions and two bitwise shifts, no multiplication:

    (v + (v << 3) + 9) >> 6

dahart3y ago

Naw, it’s standard to call such tricks “bitwise” even when including a multiply. Remember you’re multiplying into a specific bit range and then shifting down (dividing by a power of 2) to capture the bits you want, it’s bitwise in a very literal sense. Probably quite fair to call any expression “bitwise” if any single operation in the expression is bitwise, regardless of the other operators & functions, no? What part is hand-wavy? Variations of this technique are in standard widespread usage in the compilers we use.

jameshart3y ago

But that requires turning ‘multiplication by a constant’ into a bitwise trick in exactly the same way that this is doing for division by a constant (albeit without the fuzziness about rounding).

dahart3y ago

Why’s that?

1 more reply

deisner3y ago

Appears to be down. Archived here: https://web.archive.org/web/20230301110743/http://h14s.p5r.o...

eru3y ago

I wonder whether the author ever ran a benchmark?

Alas, the commenting system on their website seems broken, so can't ask there.

plesner3y ago

Author her. I didn't run benchmarks. I'm suspicious of micro-benchmarks and I don't have a context where I can try it against realistic data. Also, I just enjoy the maths of it even if it turns out not to make a huge performance difference in practice.

eru3y ago

Thanks for replying! It's definitely a nice write-up.

You are right that micro-benchmarks are a bit suspicious, but they are better than nothing.

Btw, have a look at https://godbolt.org/z/zMarEnYP5 to see what Clang come up with on her own.

    #include<stdint.h>

    uint64_t div(uint64_t nlz) {
        __builtin_assume(nlz <= 63);
        return (63 - nlz) / 7;
    }

    div:                                    # @div
            xori    a0, a0, 63
            andi    a1, a0, 255
            li      a2, 37
            mul     a1, a1, a2
            srli    a1, a1, 8
            subw    a0, a0, a1
            slli    a0, a0, 56
            srli    a0, a0, 57
            add     a0, a0, a1
            srli    a0, a0, 2
            ret

This uses Risc-V assembly. Just for fun. x86 is also fascinating.

I haven't analysed it in detail. But it looks like Clang doesn't seem to mind multiplication.

Cthulhu_3y ago

I hope this isn't considered doxing, but he has his name and location on the blurb on the side; you may be able to reach him through linkedin: https://www.linkedin.com/in/christianplesnerhansen/

cabirum3y ago

Just precompute a list of values?

victor1063y ago

I always have a hard time grooking bit wise operations. Any good resources?

PhilipRoman3y ago

"Hacker's Delight" book was really useful for me even though I already had a good grasp on bitwise operators. I sat down with a pencil and went through the bit matrix transpose example and it really opened my eyes to how much interesting stuff can be done this way. I think the key to understanding is to visualize the bitwise operations.

Most of it is pattern recognition and learning common idioms, like multiplying by 0x01010101..., extracting substrings of bits ((x >> y) & z), thinking of values as sets of integers and so on.

Also bitwise manipulations lend themselves well to function composition, so try to keep an eye out for that. For example, if you have a fast function that checks if any byte is zero, you can generalize it to any value by applying xor beforehand. Similarly, the fast bit matrix transpose (which absolutely terrified me) is simply applying the least number of "shift a subset of bits" operations such that each bit travels the required distance.

Icathian3y ago

Honestly just practice. They cost me an interview I really wanted once, so now anytime I stumble across one in a project I take extra time to read and understand it. I also make a point of using bitmasks as function arguments in places where that's a good/reasonable choice.

earlyam3y ago

Start counting in binary on your fingers until you're just about "binary native". Also, perform operations, and think about why they behave the way they do.

It sounds a little woo-woo but going from representing something on paper (or worse, a display) to doing it with my hands really changed it for me. Humans likely evolved intelligence because we stood up and used our hands to manipulate our environment, I suspect we more fully engage our brains in important ways by doing something with our hands when we can.

You might start appreciating the imperial system of measurement though. You've been warned.

Aardwolf3y ago

Think about all tricks you can so in decimal, but in base 2 instead.

E.g in decimal you can multiply by ten by appending a zero, so in binary you multiply by 2 by doing so. And left shifting by 1 is what appends a zero.

Or you can take approximate log base ten in decimal by counting amount of digits, so in binary you can approximate log2 this way (counting up to most significant one-bit). Etc...

Reubensson3y ago

I think wikipedia does a quite good job explaining what different operations do: https://en.m.wikipedia.org/wiki/Bitwise_operation

secondcoming3y ago

Very cool and it must have been really satisfying to have figured it out.

edf133y ago

ChatGPT tells me:

Dividing a number by 7 using bitwise operations can be done using a technique called "magic numbers." This involves precomputing a number that, when combined with bitwise operations, will give the correct result.

Here's how it can be done:

    First, we need to find a "magic number" that will help us divide by 7. One way to do this is to use the fact that 7 is a prime number, which means that there is a number k such that 2^k ≡ -1 (mod 7). This number is called the "multiplicative inverse" of 2 modulo 7. We can find k by repeatedly squaring 2 until we get -1 modulo 7:

lua

2^1 ≡ 2 (mod 7) 2^2 ≡ 4 (mod 7) 2^3 ≡ 1 (mod 7) 2^4 ≡ 2 (mod 7) 2^5 ≡ 4 (mod 7) 2^6 ≡ 1 (mod 7)

So k = 3, since 2^3 ≡ 1 (mod 7).

    Next, we need to combine this magic number with bitwise operations. Specifically, we can use the following formula to divide a number x by 7:

scss

x/7 = (x + (x >> 3)) * k

Here, >> is the bitwise right shift operator, which shifts the bits of x to the right by 3 places, effectively dividing x by 8. The expression (x + (x >> 3)) adds these two values together, which effectively multiplies x by 9. Multiplying by k then gives the correct result of dividing by 7.

Here's some example code in Python that implements this division algorithm:

python

def div_by_7(x): k = 3 return (x + (x >> 3)) * k

Note that this algorithm assumes that the input number x is a non-negative integer. It may not work correctly for negative numbers or floating-point numbers.

byteware3y ago

so... 7/7 = (7+(7>>3))*3 = 21? gotcha

benj1113y ago

Well it is chat gpt so of course it's confidently incorrect.

j / k navigate · click thread line to collapse

74 comments

roger10-43y ago

Horner’s method [0] is also commonly used on small microcontrollers without divide (or multiply) instructions.

[0] https://www.ti.com/lit/an/slaa329a/slaa329a.pdf

kjs33y ago

Look out. Any time someone brings up microcontroller multiply, there's a someone who is compelled to chime in with "Well I've never needed to multiply on a microcontroller, so why do you need to!".

dev_tty013y ago

Or when the controller has a divide but it is painfully slow.

mgaunard3y ago

The compiler is already reducing integer division by a constant into these things.

Those algorithms become more important when you're dividing by a value known at runtime but which remains the same during parts of the program. That's where libdivide comes in.

mncharity3y ago

libdivide tl;dr:

[1] https://libdivide.com/ [2] https://github.com/ridiculousfish/libdivide

int_19h3y ago

The obvious question this invites: if this is generally faster for unknown values, why don't compilers use this optimization directly in emitted code?

moonchild3y ago

They could, but it's less profitable, and cost models are, as always, a big problem. If we have:

  loop for i from 0 to n
      f(a[i]/y)

How big does n have to be before it makes sense to compute a reciprocal for y? And how frequently is it actually that big during the runtime of the code?

_a_a_a_3y ago

> Compilers usually do this, but only when the divisor is known at compile time

if the divisor is in a variable or otherwise 'hidden', the compiler can't deduce enought to get to it, is my guess.

baruch3y ago

Calculating the divisor values is also expensive, this works when you do this work once and then do the efficient divides multiple times.

eru3y ago

See https://godbolt.org/z/3Yqbceaza for what godbolt says clang produces.

Keep in mind that this doesn't use the fact that we know that the input is between 0 to 63.

workaccount13y ago

> Keep in mind that this doesn't use the fact that we know that the input is between 0 to 63.

You can use __builtin_assume for this: https://godbolt.org/z/K4jKhxnTq

pantalaimon3y ago

An assert() also does the trick: https://godbolt.org/z/MecvMGPdW

edit uh but when asserts are disabled it won't work: https://godbolt.org/z/4TMs1Wc5z

unless you roll your own assert: https://godbolt.org/z/4v35rrTvn

RenThraysk3y ago

9*x/64 still reduces to 2 instructions

https://godbolt.org/z/6WsWqh4ah

Waterluvian3y ago

I learned about shifting to divide or multiply when writing my game boy emulator. Something I find fascinating is the lack of precision needed for games of that fidelity.

grishka3y ago

But even with that, many of my childhood games are 100% playable. Writing a video game console emulator is a rewarding experience!

Rediscover3y ago

This is reminiscent, to me, of AIM-239 [0] and Warren's "Hacker's Delight" [1]. Good stuff.

[0] https://dspace.mit.edu/handle/1721.1/6086

[1] https://en.m.wikipedia.org/wiki/Hacker%27s_Delight

vanderZwan3y ago

> And it happens to be the case that 7 × 9 = 63, which is almost 64, a power of 2

While the article goes down a different rabbit hole, this also happens to be a neat example of (n-1)(n+1) = n² - 1, where n = 8.

Which is one of those things I haven't managed to find an actual use for but feels like it should have since you can use it with 0xFF (15 × 17), 0xFFFF (255 × 257), and so on.

gus_massa3y ago

It looks nicer in hexa: 0xFF = 0xF*0x11 and 0xFFFF = 0xFF*0x101

marginalia_nu3y ago

You can derive a relationship between division and shift-add-multiply in a cute way by noting that

   1       1        1   1 
  ---  -  ---  =   --- --- (a-b)
   b       a        b   a

Move 1/a to the right hand side

   1      1     1   1 
  ---  = --- + --- --- (a-b)
   b      a     b   a  

               [X]

Substitute 1/b marked with X for the RHS and you get

   1     1    (a-b)       1   1
  --- = --- + ----- (1 + --- --- (a-b))
   b     a     a^2        b   a

Repeat and eventually you get

      __
  1   \    -(n+1)    n
 --- = >  a     (a-b)
  b   /_  
      n=0...inf

For example with b=3 and a=2, you get

  1     1     1     1
 --- = --- - --- + --- -+ ... 
  3     2     4     16

Word of warning though, this method tend to produce nasty carry errors.

pestatije3y ago

(63 - nlz) / 7, where nlz is between 0 and 63

lookup table with 64 entries anyone?

dahart3y ago

hoseja3y ago

A bitshift oneliner IS nicer.

resource0x3y ago

Maybe nicer, but slower. And I'm not even sure about the former.

cryptonector3y ago

1 more reply

Matheus283y ago

Probably faster: no memory loads, no cache pressure

1 more reply

masswerk3y ago

Before there had been more complex instructions, doing entire multiplications or divisions at once, computers handled this by dedicated multiplication shift and division shift instructions.

Compare here for the DEC PDP-1 (1959) and related algorithms:

[0] https://www.masswerk.at/spacewar/inside/insidespacewar-pt6-g...

cormacrelf3y ago

Don’t modern compilers do this automatically (and aggressively) for almost any division by a constant?

gary_03y ago

Sometimes, but not always.

vintermann3y ago

bonzini3y ago

It's about variable-length encodings, not variable length integers.

cormacrelf3y ago

Yeah. The (63-nlz)/7 is part of a piece of code implementing an encoding. nlz is an int.

RagnarD3y ago

Wouldn’t a lookup table be a whole lot easier? I’d think a 64 entry table of integers would be fast to access once cached.

fweimer3y ago

It's possible to use a lookup table (sort of) in a register. With x86-64-v2 and later, the code is larger, but potentially faster due to the shorter dependency chain:

int div7 (int x) { return 9 - __builtin_popcountll (0x4081020408102040ULL >> x); }

https://godbolt.org/z/Wj6P6jYGM

I think any monotonic function on 0 .. 63 can be written this way, so should be possible to fold in the outer 63 - nlz expression, too.

gregschlom3y ago

This is a very cool technique. Someone ought to benchmark this vs the method described in the original article.

gary_03y ago

bodyfour3y ago

That's a good instinct to have. However in this case the input is 0..63 and the output comfortably sits in a single byte. It can fit in just one cache line if you bother to align it.

dahart3y ago

vintermann3y ago

The divisor is from 0 to 63. The dividend is a variable bit length integer, i.e. a bigint.

bonzini3y ago

Absolutely not, he needs to compute (63 - x) / 7, where x is computed from a 64-bit integers. There are no bigints involved.

vintermann3y ago

phkahler3y ago

eru3y ago

Your C compiler will do this for you automatically. See eg https://godbolt.org/z/3Yqbceaza

(I don't know what language is used here. I'm just assuming something like C.)

jsd19823y ago

cornstalks3y ago

OP mentions "playing around with variable-length integer encodings."

jameshart3y ago

Nice, if a little bit hand wavy. Seems a little bit of a stretch to call an operation that still includes a multiply bitwise, though.

alpaca1283y ago

The final non-generalized version uses two additions and two bitwise shifts, no multiplication:

    (v + (v << 3) + 9) >> 6

dahart3y ago

jameshart3y ago

But that requires turning ‘multiplication by a constant’ into a bitwise trick in exactly the same way that this is doing for division by a constant (albeit without the fuzziness about rounding).

dahart3y ago

Why’s that?

1 more reply

deisner3y ago

Appears to be down. Archived here: https://web.archive.org/web/20230301110743/http://h14s.p5r.o...

eru3y ago

I wonder whether the author ever ran a benchmark?

Alas, the commenting system on their website seems broken, so can't ask there.

plesner3y ago

eru3y ago

Thanks for replying! It's definitely a nice write-up.

You are right that micro-benchmarks are a bit suspicious, but they are better than nothing.

Btw, have a look at https://godbolt.org/z/zMarEnYP5 to see what Clang come up with on her own.

    #include<stdint.h>

    uint64_t div(uint64_t nlz) {
        __builtin_assume(nlz <= 63);
        return (63 - nlz) / 7;
    }

    div:                                    # @div
            xori    a0, a0, 63
            andi    a1, a0, 255
            li      a2, 37
            mul     a1, a1, a2
            srli    a1, a1, 8
            subw    a0, a0, a1
            slli    a0, a0, 56
            srli    a0, a0, 57
            add     a0, a0, a1
            srli    a0, a0, 2
            ret

This uses Risc-V assembly. Just for fun. x86 is also fascinating.

I haven't analysed it in detail. But it looks like Clang doesn't seem to mind multiplication.

Cthulhu_3y ago

I hope this isn't considered doxing, but he has his name and location on the blurb on the side; you may be able to reach him through linkedin: https://www.linkedin.com/in/christianplesnerhansen/

cabirum3y ago

Just precompute a list of values?

victor1063y ago

I always have a hard time grooking bit wise operations. Any good resources?

PhilipRoman3y ago

Most of it is pattern recognition and learning common idioms, like multiplying by 0x01010101..., extracting substrings of bits ((x >> y) & z), thinking of values as sets of integers and so on.

Icathian3y ago

earlyam3y ago

Start counting in binary on your fingers until you're just about "binary native". Also, perform operations, and think about why they behave the way they do.

You might start appreciating the imperial system of measurement though. You've been warned.

Aardwolf3y ago

Think about all tricks you can so in decimal, but in base 2 instead.

E.g in decimal you can multiply by ten by appending a zero, so in binary you multiply by 2 by doing so. And left shifting by 1 is what appends a zero.

Or you can take approximate log base ten in decimal by counting amount of digits, so in binary you can approximate log2 this way (counting up to most significant one-bit). Etc...

Reubensson3y ago

I think wikipedia does a quite good job explaining what different operations do: https://en.m.wikipedia.org/wiki/Bitwise_operation

secondcoming3y ago

Very cool and it must have been really satisfying to have figured it out.

edf133y ago

ChatGPT tells me:

Here's how it can be done:

    First, we need to find a "magic number" that will help us divide by 7. One way to do this is to use the fact that 7 is a prime number, which means that there is a number k such that 2^k ≡ -1 (mod 7). This number is called the "multiplicative inverse" of 2 modulo 7. We can find k by repeatedly squaring 2 until we get -1 modulo 7:

lua

2^1 ≡ 2 (mod 7) 2^2 ≡ 4 (mod 7) 2^3 ≡ 1 (mod 7) 2^4 ≡ 2 (mod 7) 2^5 ≡ 4 (mod 7) 2^6 ≡ 1 (mod 7)

So k = 3, since 2^3 ≡ 1 (mod 7).

    Next, we need to combine this magic number with bitwise operations. Specifically, we can use the following formula to divide a number x by 7:

scss

x/7 = (x + (x >> 3)) * k

Here's some example code in Python that implements this division algorithm:

python

def div_by_7(x): k = 3 return (x + (x >> 3)) * k

Note that this algorithm assumes that the input number x is a non-negative integer. It may not work correctly for negative numbers or floating-point numbers.

byteware3y ago

so... 7/7 = (7+(7>>3))*3 = 21? gotcha

benj1113y ago

Well it is chat gpt so of course it's confidently incorrect.

j / k navigate · click thread line to collapse