Why not use whichever AVX the CPU has? Not a problem when using runtime dispatch :)
I'm not sure what you mean by that. You can't assume the presence of AVX or AVX2 without explicitly checking for it, because Intel was still disabling those features on new low-end Pentium and Celeron parts at least a recently as Comet Lake (2020). Sure, AVX2 support is much more widespread than AVX512 support, but that has nothing to do with open-source and it's a bit strange to describe that in terms of reliability.
In short, what is being compared is O(1) djbsort sorting network, vs our full quicksort with pivot sampling, partitioning, then sorting network.
This is because our sorting network size is 16 * elements_per_vector i.e. 128 in this configuration.
[0] https://web.engr.oregonstate.edu/~mjb/cs575/ [1] https://media.oregonstate.edu/media/t/1_7wju0jtq [2] https://web.engr.oregonstate.edu/~mjb/cs575/Projects/proj04....
Does anyone have others to share?
The Wavefront Algorithm (WFA) flips the problem on its head by progressively exploring the best scoring alignment until a global alignment is attained. Then no more work needs to be done to fill the matrix. The total work is actually quadratic in sequence divergence rather than length, a huge improvement over SWG for almost all applications.
In WFA the data dependencies are trivial and compilers easily auto-vectorize the inner loop of the algorithm. It's also possible to implement this in linear memory relative to sequence divergence with a bidirectional approach (biWFA).
All this is to say that vectorization and SIMD hardware is cool, but new theory and approach can completely overwhelm it's potential benefits.
Manual
https://www.agner.org/optimize/vcl_manual.pdf
GitHub
https://github.com/vectorclass/version2
Even for trivial stuff like linebreaking input files, massive speedups are there for the taking.Sort of tangentially related, what sort of tools are you using for disassembly in 2022? Is there anything better than objdump today? What I really want is a version of godbolt that reads compile_commands.json and can show me the godbolt-style color-coded disassembly for an entire binary, any function. I find that when I’m asking these sort of questions I’m wasting a lot of time fitting my problem into godbolt so I can just get a rough idea of what’s coming out, because it’s so much faster (maybe 10x) to read the color tagged version from godbolt than it is to do the same with objdump.
Edit: along the same lines, what can people do (or rather, what do you do) about situations like this: https://github.com/google/highway/blob/master/hwy/examples/b... where things might get better in the future but for now we have to implement it another way? How do you avoid regressions and how do you track workarounds? I want some sort of database that tries competing implementations on every build and emits a compiler warning when things get better. How are you dealing with this?
I'm also a huge fan of Godbolt/Compiler Explorer. Highway is integrated there, so you can just copy in your functions. Here's the last throwaway test that I was using: https://gcc.godbolt.org/z/5azbK95j9
> things might get better in the future but for now we have to implement it another way
There's several possible answers. 1) For the issue you pointed to, that we cannot have arrays of N vectors, it's a reasonable workaround to instead allocate an array of N vectors' worth of elements, and trust the compiler to elide unnecessary Load/Store. This often works using clang, and would avoid having to manually unroll here. I do prefer to minimize the amount of compiler magic required for good performance, though, so typically we're unrolling manually as shown in that code.
2) If there are compiler bugs, typically the workarounds have to stay in because in the open-source world people are still using that compiler years later.
Automatically detecting when things get better is an interesting idea but I am not yet aware of such infrastructure.
Sometimes the state of the art is not found in another paper but somewhere else, e.g,. there is vxsort by Dan Shechter (damageboy):
https://github.com/damageboy/vxsort-cpp/tree/master
He uses a similar approach and while I'm not sure how it compares to the older Blacher et al version, I expect it to be in the ballpark.
He's written an excellent series of blog posts on the design & implementation:
The white paper is calling out the memory bandwidth as a major bottleneck, which is where I have a question.
The M1 Max sports a unusually wide (512 bit wide) memory bus coupled with 6.4Ghz DDR5 unified memory which, according to the available empirical evidence, allows a single CPU core at achieve 100Gb/sec circa memory transfer speed. M1 cores also feature a very large L1 D-cache as well as a large L2 cache (48 Mb has been reported). Results, however, are approximately 2.4x lower for the M1 Max. I do realise that the NEON SIMD processing will always be slower compared to AVX-512 SIMD based one due to a 4x difference of the vector size, but isn't the M1 Max supposed to perform somewhat faster than in observed figures due to the faster and much wider memory bus that would partially compensate for NEON inefficiency?
Other than the vector size difference between NEON and AVX-512, would you attribute the difference in performance to the small test batch size (~4/8/16 Mb for 32/64/128 unit sizes used in the test) thus being able to able to fit in the L2 cache nearly entirely, or due to GCC being unaware of cache line sizes on M1 Pro/Max therefore resulting in the cache underutilisation or inefficient instruction scheduling, or you would purely attribute it to NEON having not aged gracefully to meet current data processing demands?
Thank you.
L2 caches are typically partitioned and private to a core. If that also applies to the M1 Max, then each core would only access 3 MB, thus the working set is larger than cache as intended.
I do believe NEON is the limiting factor here. I haven't looked into how many IPC we should expect, but even if it is 4 (the number of 128-bit NEON units), Skylake is often reaching 2 (with 512-bit vectors), so the measurement that an M1 core is about half as fast as SKX for this use case seems plausible.
Thanks for open sourcing!
One requirement is C++11 or later - some people have asked about supporting C but I believe that's infeasible.
Highway instead aims for a path that each platform can implement efficiently. For example, ReorderWidenMulAccumulate bridges differences between NEON and x86 which user code need not care about if they just want a dot product, and CountTrue is efficient on both platforms without requiring NEON to go to the extra trouble of matching the exact x86 movmskb semantics. Also, Highway uses only intrinsics and does not rely on autovectorization (+), which makes for more predictable performance.
+ except in the EMU128 target, which is only used if SIMD/intrinsics are disabled
In short, it turns out not to help for single core with vectors, but a few initial passes of ips4o (with 64..256-way partitioning) is faster for parallel sorts.
I don't know what's worth looking at!
Thanks!
Some of the things on my short to medium-term todo list include:
- CompressStore for 8-bit lanes (before Icelake, that will likely require splitting up into runs of 8 bytes with one PSHUFB + store each)
- also ExpandLoad (analogous to CompressStore) and LeadingZeroCount
- the C++ STL isn't autovectorized much (http://0x80.pl/notesen/2021-01-18-autovectorization-gcc-clan...). Manual vectorization can help, we have some of the simpler algorithms already in hwy/contrib/algo, others such as unique, reverse, minmax_element, all_of are more interesting.
EDIT: to be fair it does seem that Bramas' first published work on SIMD sorting (that I could find) predates djbsort, https://arxiv.org/abs/1704.08579. A comparison would still be nice though.
It looks like this algorithm is interesting in the specifics of how it works, eg. the use of a compress-store instruction for partitioning. The algorithm I mentioned above mostly focused on speeding up comparisons through the use of SIMD instructions, rather than changing the partitioning part of the QuickSort algorithm itself. Not sure how that compares to djbsort, I didn't see technical details of the algorithm on the page.
This is not my field, so I can't judge the relevance of this, but the author cites the state of the art. It just seems that djbsort is not the state of the art and therefore does not need to be cited.
I still don't understand the hype around Bernstein, he is quite relevant in cryptography/cryptography implementations but not everything he does is gold.
[1]: https://raphlinus.github.io/gpu/2020/09/05/stack-monoid.html [2]: https://arxiv.org/abs/2205.11659
In other words, do you have to be more careful when integrating these wide operators to preserve some resources for other operations?
If you're really taking a microscope to performance, the main hazards would be intermittently using AVX for only a few instructions, because that might lead to the CPU stopping for a few microseconds to turn the power on and off on the functional units. If you're using them heavily the overall thermal situation might cause a core or package-wide clock rate degradation, but if you have a use case for sustained AVX-512 usage this is likely to be a good tradeoff.
But to my knowledge they use different registers, and when properly pipelined they don't hog the CPU cache like unoptimized algorithms that constantly round trip to RAM.
If it is a compute-heavy code that was spilling to a stack and now can fit in the SIMD registers, there could be a speed up with less memory traffic. This is analogous to a program's working set either fitting in RAM or spilling to swap while it runs, but at a different level of the memory hierarchy. In the extreme case, your working set could be a fixed set of variables in expensive loops for some sequential algorithm, and so the compiler register placement ability and number of registers available could be the difference between effectively O(1) or O(n) memory accesses with respect to the number of compute steps.
Of course, you can find such transitions between registers/L1 cache, L1/L2 cache, L2/L3 cache (if present), local/remote RAM (for modern NUMA machines), etc. This happens as you transition from a pure CPU-bound case to something with bigger data structures, where the program focus has to move to different parts of the data to make progress. Naively holding other things equal, an acceleration of a CPU-bound code will of course mean you churn through these different data faster, which means more cache spills as you pull in new data and have to displace something else. Going back to the earlier poster's question, this cache spilling can have a detrimental effect on other processes sharing the same cache or NUMA node, just like one extremely bloated program can cause a virtual memory swapping frenzy and hurt every other program on the machine.
One form of "pipelining" optimization is to recognize when your loop is repeatedly fetching a lot of the same input data in multiple iterations and changing that to use variables/buffers to retain state between iterations and avoid redundant access. This often happens in convolution algorithms on arrays of multi-dimensional data, e.g. image processing and neural nets and many cellular-automata style or grid-based simulations. Your algorithm slides a "sampling window" along an array to compute an output value as a linear combination of all the neighboring values within the window using some filtering kernel (a fixed pattern of multiplicative scalars/weights). Naive code keeps loading this whole window once for each iteration of the loop to address one output position. It is effectively stateless except for the loops to visit every output. Pipelined code would manage a window buffer and fetch and replace only the fringe of the buffer while holding and reusing inputs from the previous iterations. While this makes the loops more stateful, it can dramatically reduce the number of memory reads and shift a memory-bound code back into a CPU-bound regime.
Other major optimizations for these N-dimensional convolutions are done by the programmer/designer: some convolution kernels are "decomposable" and the expensive multi-dimensional operation can be strength-reduced to a sequence of 1-dimensional convolutions with appropriately chosen filter kernels to produce the same mathematical result. This optimization has nothing to do with SIMD but simplifies the work to embody the operation. Fewer input values to read (e.g. a 1D line of neighbors instead of 2D box) and the number of arithmetic operations to do all the multiply-and-add steps to produce the output. Imagine a 2D filter that operates on an 8x8 window. The 2D convolution has to sample and combine 64 input neighbors per output, while the decomposed filter does 8 neighbors in each axis and so one quarter as many steps total after one pass for the X axis and one pass for the Y axis.
Naively, decompsition is done as separate 1D passes over the data and so performs more memory writes and allocates an additional intermediate array between the original input and final output. This is often a big win in spite of the extra memory demands. It's a lot more coding, but this could also be "pipelined" to some degree in N-dimensions to avoid pushing the entire intermediate results arrays through main memory. Approaches differ, but you can make different tradeoffs for how many intermediate results to store or buffer versus redundant calculation in a less stateful manner.
Generally, as you make the above kinds of optimizations your code becomes more "block-based", loading larger chunks of data with fewer changes to new "random" memory locations. This is very much like how databases and filesystems optimized their access to disk storage in prior decades to avoid random seeks for individual bytes or blocks and to instead use clever structures to support faster loading of sets of blocks or pages. When this is successful, your program does most of its memory access sequentially and achieves higher throughput that is bound by total memory bus bandwidth rather than random access latency limitations.
The final form of "pipelining" matters once you really hit your stride with these optimized, block-based programs. All this access again can cause cache spilling as you sustain high bandwidth memory access. But if you can provide the right hints to the CPU, or if the CPU is clever enough to detect the pattern, it can shift into a different cache management regime. Knowing that you will only access each block of data once and then move on (because you are holding the reusable state in registers), the CPU can use a different prefetching and spilling policy to both optimize for your next memory access and to avoid churning the whole cache with these values you will only read once and then ignore. This reduces the impact of the code on other concurrent tasks which can still enjoy more reasonable caching behavior in spite of the busy neighbor.
If you want your code to have low perturbance on other concurrent work done by the system, implementing it in a inefficient way doesn't usually help with that goal, since then your code will just be executing all the time because it takes a long time to finish. And you still won't have good control of the execution resources compared to using normal tools like OS scheduling policies to address the goal.
I've actually thought all this stuff has already been solved?
Is there a point in still trying to improve efficiency of sorting?
I can do that! I'd love such a challenge if there was a point to it!
On the one hand this means if you can come up with a better way to do it in some case that's awesome and could be very valuable. On the other hand it means a lot of people have thought pretty hard about it and the next improvement may be hard to come up with unless you pick a tight enough subset of the problem.
Definitely go for it though :)
So, in freepascal there's some subpar pattern matching/string search function. It annoyed me so greatly, I've started researching into the "state of the art" and apparently they're all crap. My fixed size pattern matching solution fits into less than 256 bytes, not counting setup code, which is minimal.
And here start the problems for me, because I have no way of actually comparing my solution to "the state of the art", because "the state of the art" doesn't offer me downloadable, compiled benchmarks I can just run to compare stuff to my own and learning a new language (like, i don't speak C and certainly never will), installing a compiler, getting through all the walls that inevitably be in my way to getting it to run, etc ... totally breaks my timebudget and thus I drop the problem.
Basically, what I'm looking for is:
Give me a problem that's worth tackling, with a defined set of parameters and a compiled solution I can compare my own against and I will beat it eventually. I've tried doing this on my own, but that's just a big no-go on many levels, so I'm dependent on someone handing something to me.
Like ... the QuickSort. I have no way of comparing my potential own solution against "the state of the art", which is - as far as I know - pretty much all that's holding me back.
It would be more likely to show up in something like Apache Arrow which is designed columnar to leverage these sort of tricks
Those are things that are super simple, once you get past the scary mystique of SIMD.
It’s stuff like this that should be getting all the witch burning :D
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...
Radix sort may actually be faster if you know that only a few bytes are guaranteed to distinguish keys, but that's difficult to guarantee/assume at the library level.
That's pretty disingenuous. The standard implementation is known to be terribly slow because of constraints. They should compare against current state of the art.
Well, to me sounds that C++ std::sort is simply lacking an optimization which can very easy be implemented in next iteration of the compiler/linker. I had high hopes this would be a really breakthrough algorithm, not lack of a simple optimization using some instruction set that compiler/linker didn't implement. If they want, CPU vendors would make an even more awesome and faster instruction set, let's say SIMD2, which would leave this one in the dust.
It's definitely interesting to discuss std::sort benefitting from such optimizations, but "very easy" seems rather optimistic.