The repo I linked is his C++ version of the C# original.
It was submitted to the C# standard library but was rejected (as an observer, the rejection was disappointing) – had it been accepted C# would have been in the interesting position of having a faster sort than any native language standard library, at least for integer keys.
Copying the blog post to arxiv and citing some related work so it comes up in alerts and cited-by searches would [have made/make] this good work visible to a wider circle, including the algorithms community.
It's nice to see a C++ version, this will be easier for me to benchmark, and it also seems to support other built-in types now, as well as AVX2+AVX-512 (though not other platforms). A quick list of differences based on reading the blog:
* Nifty idea for simplifying the branch in partition. (We also found that branchless code is worse)
* Major effort to make all reads vector-aligned. This could make a difference especially on AVX-512. I had tried two variants but it was difficult to make this safe for asan (no out of bounds reads allowed).
* The lookup table is 2KB, which could be halved using 32-bit broadcast + variable shift instead of PDEP.
* The sorting network seems to be half the size and bitonic. I haven't yet compared the number of permute instructions (not obvious from the source because it's generated).
* The pivot is only a median of three. We use a much larger sample.