Thinking parallel gets you enormous speed benefits for any number of arbitrary algorithms: https://mcyoung.xyz/2023/11/27/simd-base64/
I wouldn't be holding my breath though - proper support of high-level portable SIMD abstraction requires quite a lot of compiler complexity due to how wide (heh) the API surface of SIMD extensions is in most ISAs, and because of details necessary to get right to keep data in appropriate (vector and/or mask) registers. This, naturally, goes in the complete opposite direction to the design philosophy of Go's compiler. Instead, you are supposed to write a custom Go ASM syntax, with byte literals used to encode opcodes if they are not natively supported (which is common).
If you're interested in what high-effort SIMD implementation in this kind of language looks like, take a look at C#'s cross-platform Vector API: https://github.com/dotnet/runtime/blob/main/docs/coding-guid...
https://lemire.me/blog/2024/07/05/scan-html-faster-with-simd... (uses platform intrisics, but showcases that you can go one abstraction level lower, retaining the same Vector128<T> type if you need to specialize a particular part of your algorithm for a platform, without having to maintain separate copy for each one)
Here's high-effort vectorized CRC64 implementation that uses these: https://github.com/dotnet/runtime/blob/283de5b5adf08c42d4945... (performs as fast as C++-based mnemonic variant)
Mojo's effort in bringing portable SIMD abstraction to Python audience is commendable. I'm looking forward to open-sourcing of it to try it out!
For anyone's curious, the reason I'm mostly talking about C# above is that its Vector API is the most accessible and mature portable SIMD abstraction that is part of standard library / comes out of box among most other options - you really only need to install SDK and `dotnet new console` to start working with it over more complex alternatives.
Though Haswell seems like a pretty obsolete platform to optimize for at this point. Even Skylake will be a decade old next year.
AVX-512 is no longer ubiquitous on Intel servers, but only on new AMD servers.
Even earlier, there were cheap Intel servers with CPUs using Atom cores, for example the Denverton, Snow Ridge, Denverton Refresh, Snow Ridge Refresh, Parker Ridge and Arizona Beach series of server CPUs. None of these supported AVX-512 and many did not support even AVX.
However, now, after the launch of the Sierra Forest server CPUs, which will be followed next year by the Clearwater Forest server CPUs, the Atom cores have expanded up to the biggest Intel server CPUs. While such server CPUs are intended for applications where computations using array operations are less important, like Web servers or the hosting of many small virtual machines, the fragmentation of the Intel ISA is extremely annoying, especially when AMD demonstrates how they can implement the same ISA, but at different levels of performance (by varying the number of vector pipelines and the maximum achievable clock frequency) both in laptop CPUs and in desktop/server CPUs and both in compact cores with low clock frequency and in big cores with high clock frequency.
At least for me, the lack of AVX-512 support is the reason that made me stop buying Intel CPUs already some years ago, even if there are some features of the Intel CPUs that I prefer over the AMD CPUs (like TSC deadline), but none of those can compensate the lack of AVX-512 support.
The greater width of AVX-512 is not its greater advantage, but the mask registers and a more complete set of instructions, which simplify many algorithms. Therefore when Intel will support AVX10/256 across all their CPUs, that will partially restore the competitivity of the Intel CPUs, but that is not likely to happen before 2026.
2) chunks with too many small numbers cannot be processed with just 2 shuffle-adds
3) (not mentioned in the post) HighLoad limits the size of the source code you can submit, so you can't put all possible values in the look-up table
Which would be much faster than the checked add (adc). Does any hardware support such checked SIMD arithmetic already?
Or can you still assume that most arithmetic is still broken in most languages/libraries.
At each step from Larrabee to Skylake Server, some instructions have been lost, because the initial set of instructions was more complete in order to enable the writing of efficient GPU algorithms, while later Intel believed that for a general-purpose CPU they can reduce the costs by omitting some of those instructions.
(Nevertheless, later they have added many other instructions, some of which may be less useful and more expensive than the original instructions that have been removed.)
Among the original Larrabee instructions that have been deleted, was addition with unsigned overflow (a.k.a. carry), where the output overflow flags were stored in a mask register, enabling their use in a later conditional SIMD instruction.
Signed overflow can be implemented in hardware with negligible additional complexity (a single gate per each result number), so it would have been easy to also add to Larrabee/AVX-512 an addition instruction with signed overflow flags stored in a mask register. Even when only unsigned overflow is available, it is possible to preprocess the operands in such a way that detecting signed overflow would be possible with the unsigned overflow bits, though that requires multiple instructions, slowing down a lot the algorithm.
However in this problem the numbers that are added are non-negative, so the addition with unsigned overflow of the original Larrabee ISA would have been sufficient, had Intel not removed it from AVX-512.
That's terrifying