https://github.com/Voultapher/sort-research-rs/blob/main/wri...
Discussion here:
https://news.ycombinator.com/item?id=38528452
This post by orlp (creator of Pattern-defeating Quicksort and Glidesort) was linked to in the above post, and I found both to be interesting.
IPv4 addresses in a routing table or something.
Pointers, by increasing or decreasing address. Useful if we want to compact the objects in memory.
It's useful to sort floating-point values in a certain way before adding them together, to avoid adding a low magnitude X to a big magnitude Y such that X disappears.
Yes.
This generally means floats and integers, or small combinations thereof. Note that this is after projection of the comparison operator, so things can remain branchless when sorting e.g. strings by length.
The inlining generally isn't an issue in languages like C++ and Rust.
> Forget string keys.
You can sort string keys (mostly) branchlessly with radix sorting, but yes for comparison sorting you can forget it.
> That is, there is a single iterator scanning the array from left-to-right (jj). If the element is found to belong in the left partition, it is swapped with the first element of the right partition (tracked by ii), otherwise it is left where it is.
So you start left, the first element on the left partition belongs in the left partition, so you swap it into the right partition? And what about the element that was in the right partition, when do you check where that one belongs?
Yes and no. You temporarily do swap it into the right partition, but then increment the pointers which redefines where the right partition is: you swap it with the first element of the right partition before incrementing both i and j. In essence what this does is move the entire right partition one step to the right, while putting the previously unknown element before it.
Consider this example:
l l l l r r r r r ? ? ? ?
^ ^
i j
Now if the first ? was an r element, you could simply increment j and be done with it. But suppose it was an l element, then you have this scenario: l l l l r r r r r l ? ? ?
^ ^
i j
Note that "the first element of the right partition" is equivalent to "the first element after the left partition". Does it now make more sense that swapping the first element of the right partition and the unknown element (v[i], v[j]) is the right thing to do? After our swap we have this: +- swap --+
v v
l l l l l r r r r r ? ? ?
^ ^
i j
So now incrementing i and j both fixes our invariant: l l l l l r r r r r ? ? ?
^ ^
i j
> And what about the element that was in the right partition, when do you check where that one belongs?That one still belongs in the right partition, which is why we increment both i and j.
Then partitioning is 'simply' a vector comparison, two masked compressing stores (through shuffles or _mm_mask_compressstoreu_epi32) with one of the masks inverted, and counting how many elements were smaller with _mm256_movemask_epi8 and a popcnt.
For an out-of-place partition you can interleave the loops of one going left-to-right and one going right-to-left to increase instruction-level parallelism.
You can look at the full assembly here: https://cpp.godbolt.org/z/zzzTh47PG. But the full assembly isn't a fair comparison right now because I did not bother to convert both functions to have the same signature (Andrey's version assumes the pivot is in the array and selects it in the function).
Virtually all the time for any non-trivial input is spent in the inner loop though.
Has someone attempted branchless Hoare partitioning?
Can you elaborate? It is 'flawed' in that it requires more moves than Hoare as I write in the conclusion, but as the benchmarks show for a variety of input sizes, types and distributions it is superior despite the extra data movement.
> Has someone attempted branchless Hoare partitioning?
I mention a paper and three implementations (one of which is my own) that implement some variety of branchless Hoare partitioning in one way or another in the section on Hoare partitioning.