undefined | Better HN

0 pointsjanwas3y ago0 comments

Thanks for sharing the pointers. I hadn't come across that, will read soon :)

0 comments

Interestingly from the portability angle, this was all written in C# but with the SIMD intrinsics there he was still able to obtain ~SOTA performance.

The repo I linked is his C++ version of the C# original.

It was submitted to the C# standard library but was rejected (as an observer, the rejection was disappointing) – had it been accepted C# would have been in the interesting position of having a faster sort than any native language standard library, at least for integer keys.

janwasOP3y ago

I've read parts 1-5, there doesn't seem to be a part6. Thanks again for the link, it is regrettable that there is duplication of effort. His approach is quite similar to ours (including HeapSort, sorting network, unrolled in-place partition), and re-invents what he calls "double-pumped partitioning" (introduced by Bramas).

Copying the blog post to arxiv and citing some related work so it comes up in alerts and cited-by searches would [have made/make] this good work visible to a wider circle, including the algorithms community.

It's nice to see a C++ version, this will be easier for me to benchmark, and it also seems to support other built-in types now, as well as AVX2+AVX-512 (though not other platforms). A quick list of differences based on reading the blog:

* Nifty idea for simplifying the branch in partition. (We also found that branchless code is worse)

* Major effort to make all reads vector-aligned. This could make a difference especially on AVX-512. I had tried two variants but it was difficult to make this safe for asan (no out of bounds reads allowed).

* The lookup table is 2KB, which could be halved using 32-bit broadcast + variable shift instead of PDEP.

* The sorting network seems to be half the size and bitonic. I haven't yet compared the number of permute instructions (not obvious from the source because it's generated).

* The pivot is only a median of three. We use a much larger sample.

j / k navigate · click thread line to collapse