Branchless Lomuto Partitioning (opens in new tab)

(orlp.net)

38 pointskillcoder2y ago21 comments

21 comments

There was a recent post by Voultapher from the sort-research-rs project on Branchless Lomuto Partitioning

https://github.com/Voultapher/sort-research-rs/blob/main/wri...

Discussion here:

https://news.ycombinator.com/item?id=38528452

This post by orlp (creator of Pattern-defeating Quicksort and Glidesort) was linked to in the above post, and I found both to be interesting.

kazinator2y ago

What can you actually sort branch-free? Doesn't the comparison have to be inlined, and tiny, not to mention itself branchless? Forget string keys.

IPv4 addresses in a routing table or something.

Pointers, by increasing or decreasing address. Useful if we want to compact the objects in memory.

It's useful to sort floating-point values in a certain way before adding them together, to avoid adding a low magnitude X to a big magnitude Y such that X disappears.

orlp2y ago

> Doesn't the comparison have to be inlined, and tiny, not to mention itself branchless?

Yes.

This generally means floats and integers, or small combinations thereof. Note that this is after projection of the comparison operator, so things can remain branchless when sorting e.g. strings by length.

The inlining generally isn't an issue in languages like C++ and Rust.

> Forget string keys.

You can sort string keys (mostly) branchlessly with radix sorting, but yes for comparison sorting you can forget it.

anonymoushn2y ago

For string keys, I think you could do multi-key quicksort with almost no branches, but the partitioning becomes more troublesome because you need to separate elements that are <, =, and > your pivot.

orlp2y ago

Author here if anyone has any questions.

stavros2y ago

Thanks for the article! I'm finding it hard to understand this part:

> That is, there is a single iterator scanning the array from left-to-right (jj). If the element is found to belong in the left partition, it is swapped with the first element of the right partition (tracked by ii), otherwise it is left where it is.

So you start left, the first element on the left partition belongs in the left partition, so you swap it into the right partition? And what about the element that was in the right partition, when do you check where that one belongs?

orlp2y ago

> So you start left, the first element on the left partition belongs in the left partition, so you swap it into the right partition?

Yes and no. You temporarily do swap it into the right partition, but then increment the pointers which redefines where the right partition is: you swap it with the first element of the right partition before incrementing both i and j. In essence what this does is move the entire right partition one step to the right, while putting the previously unknown element before it.

Consider this example:

    l l l l r r r r r ? ? ? ?
            ^         ^
            i         j

Now if the first ? was an r element, you could simply increment j and be done with it. But suppose it was an l element, then you have this scenario:

    l l l l r r r r r l ? ? ?
            ^         ^
            i         j

Note that "the first element of the right partition" is equivalent to "the first element after the left partition". Does it now make more sense that swapping the first element of the right partition and the unknown element (v[i], v[j]) is the right thing to do? After our swap we have this:

            +- swap --+
            v         v
    l l l l l r r r r r ? ? ?
            ^         ^
            i         j

So now incrementing i and j both fixes our invariant:

    l l l l l r r r r r ? ? ?
              ^         ^
              i         j

> And what about the element that was in the right partition, when do you check where that one belongs?

That one still belongs in the right partition, which is why we increment both i and j.

1 more reply

lifthrasiir2y ago

Will it be viable to use SIMD to further optimize the algorithm when the predicate is simple enough? For example, one can imagine that you take v[j] .. v[j+N-1] and somehow weed them into two partitions u[0] .. u[k] and u[k+1] .. u[N-1] (probably in SIMD registers), which get pasted accordingly at v[i] or v[j]. The current assembly indeed looks very tight but a higher throughput at the expense of longer code is a common theme in SIMD, I guess.

orlp2y ago

SIMD partitioning is definitely a thing, but I think it is ill-suited for Lomuto-style partitioning. I think fulcrum (what crumsort uses) or out-of-place partitioning (in both cases the output buffers for < and >= are distinct) like glidesort does is the most amenable to SIMDization.

Then partitioning is 'simply' a vector comparison, two masked compressing stores (through shuffles or _mm_mask_compressstoreu_epi32) with one of the masks inverted, and counting how many elements were smaller with _mm256_movemask_epi8 and a popcnt.

For an out-of-place partition you can interleave the loops of one going left-to-right and one going right-to-left to increase instruction-level parallelism.

1 more reply

JonChesterfield2y ago

Is the assembly the inner loop or the whole function? The function looks like it could be restructured into a tail call - the trailing swap is quite like the loop body - but I'd be surprised if a compiler managed that transform.

orlp2y ago

The assembly is only for the inner loop, I did not include the preamble/postamble of the functions.

You can look at the full assembly here: https://cpp.godbolt.org/z/zzzTh47PG. But the full assembly isn't a fair comparison right now because I did not bother to convert both functions to have the same signature (Andrey's version assumes the pivot is in the array and selects it in the function).

Virtually all the time for any non-trivial input is spent in the inner loop though.

kazinator2y ago

Lomuto partitioning is flawed though; when implementing Quicksort, you want the original Hoare one.

Has someone attempted branchless Hoare partitioning?

orlp2y ago

> Lomuto partitioning is flawed though; when implementing Quicksort, you want the original Hoare one.

Can you elaborate? It is 'flawed' in that it requires more moves than Hoare as I write in the conclusion, but as the benchmarks show for a variety of input sizes, types and distributions it is superior despite the extra data movement.

> Has someone attempted branchless Hoare partitioning?

I mention a paper and three implementations (one of which is my own) that implement some variety of branchless Hoare partitioning in one way or another in the section on Hoare partitioning.

kazinator2y ago

Another problem is that if you have an array of identical values, Hoare contrived things so that his two partitioning pointers meet in the middle, ensuring O(n log n) behavior. Lomuto produces a degenerate partition, resulting in the quadratic worst case.

1 more reply

j / k navigate · click thread line to collapse

21 comments

killcoderOP2y ago

There was a recent post by Voultapher from the sort-research-rs project on Branchless Lomuto Partitioning

https://github.com/Voultapher/sort-research-rs/blob/main/wri...

Discussion here:

https://news.ycombinator.com/item?id=38528452

This post by orlp (creator of Pattern-defeating Quicksort and Glidesort) was linked to in the above post, and I found both to be interesting.

kazinator2y ago

What can you actually sort branch-free? Doesn't the comparison have to be inlined, and tiny, not to mention itself branchless? Forget string keys.

IPv4 addresses in a routing table or something.

Pointers, by increasing or decreasing address. Useful if we want to compact the objects in memory.

It's useful to sort floating-point values in a certain way before adding them together, to avoid adding a low magnitude X to a big magnitude Y such that X disappears.

orlp2y ago

> Doesn't the comparison have to be inlined, and tiny, not to mention itself branchless?

Yes.

The inlining generally isn't an issue in languages like C++ and Rust.

> Forget string keys.

You can sort string keys (mostly) branchlessly with radix sorting, but yes for comparison sorting you can forget it.

anonymoushn2y ago

For string keys, I think you could do multi-key quicksort with almost no branches, but the partitioning becomes more troublesome because you need to separate elements that are <, =, and > your pivot.

orlp2y ago

Author here if anyone has any questions.

stavros2y ago

Thanks for the article! I'm finding it hard to understand this part:

orlp2y ago

> So you start left, the first element on the left partition belongs in the left partition, so you swap it into the right partition?

Consider this example:

    l l l l r r r r r ? ? ? ?
            ^         ^
            i         j

Now if the first ? was an r element, you could simply increment j and be done with it. But suppose it was an l element, then you have this scenario:

    l l l l r r r r r l ? ? ?
            ^         ^
            i         j

            +- swap --+
            v         v
    l l l l l r r r r r ? ? ?
            ^         ^
            i         j

So now incrementing i and j both fixes our invariant:

    l l l l l r r r r r ? ? ?
              ^         ^
              i         j

> And what about the element that was in the right partition, when do you check where that one belongs?

That one still belongs in the right partition, which is why we increment both i and j.

1 more reply

lifthrasiir2y ago

orlp2y ago

For an out-of-place partition you can interleave the loops of one going left-to-right and one going right-to-left to increase instruction-level parallelism.

1 more reply

JonChesterfield2y ago

orlp2y ago

The assembly is only for the inner loop, I did not include the preamble/postamble of the functions.

Virtually all the time for any non-trivial input is spent in the inner loop though.

kazinator2y ago

Lomuto partitioning is flawed though; when implementing Quicksort, you want the original Hoare one.

Has someone attempted branchless Hoare partitioning?

orlp2y ago

> Lomuto partitioning is flawed though; when implementing Quicksort, you want the original Hoare one.

> Has someone attempted branchless Hoare partitioning?

I mention a paper and three implementations (one of which is my own) that implement some variety of branchless Hoare partitioning in one way or another in the section on Hoare partitioning.

kazinator2y ago

1 more reply

j / k navigate · click thread line to collapse