I wish it had a few more builtins for commonly supported operations without me having to write inline assembly (e.g., runtime LUTs are basically untenable for implementing something like bolt [0] without inline asm), but otherwise the abstraction level is about where I'd like it to be. I usually prefer it to gcc intrinsics, fully inline asm, and other such shenanigans.
https://blog.habets.se/2024/04/Rust-is-faster-than-C.html and code at https://github.com/ThomasHabets/zipbrute/blob/master/rust/sr... showed me getting 3x faster using portable SIMD, on my first attempt.
One of my goals of writing these articles is to learn so feedback is more than welcome!
As an additional note, aarch64 always has NEON (similar to how x86-64 always has SSE2; extensions useful to dispatch would be SVE on aarch64 and AVX/AVX2/AVX-512 on x86-64), so no point dynamically checking for it.
> One of my goals of writing these articles is to learn so feedback is more than welcome!
When I went into the Rust playground to see the assembly output for the Cumulative Sum example, I could only get it to show the compiler warnings, not the actual assembly. I'm probably doing something wrong, but for me this was a barrier that detracted from the article. I'd suggest incorporating the assembly directly into the article, although keeping the playground link for people who are more dedicated / competent than I am.
Godbolt is a better choice for looking at asm anyway. https://rust.godbolt.org/z/3Y9ovsoz9
Seems written by an LLM for the most part.
Without easy to use SIMD abstraction, many* of .NET's CoreLib functions would have been significantly slower.
* UTF-8 validation, text encoding/decoding, conversion to/from hex bytes, copying data, zeroing, various checksum and hash functions, text/element counting, searching, advanced text search with multiple algorithms under SearchValues type used by Regex engine, etc.
SIMD in Pure Python
https://www.da.vidbuchanan.co.uk/blog/python-swar.html
Don't let the "SIMD in Python" section fool you, it's a short stop on Numpy before putting it aside.
As to whether LLVM actually takes advantage of this effectively, I don't know. I know that we do supply the necessary attributes to LLVM in most cases, but I haven't looked at the individual transform and optimization passes to see whether they take advantage of this (e.g. emitting movdqa vs. falling back to movdqu).
There is also a traditional SIMD extension (P I think?) but it isn't finished. Most focus has been on the vector extension.
I am wondering how and if Rust will support these vector processing extensions.
Isn't SIMD a subset of vector processors?
To that matter, can anybody here provide a proper and useful distinction between the two, that is SIMD and vector ISAs?
You imply it's because it's vector length agnostic, but you could take e.g. the SSE encoding, and apart from a few instructions, make it operate on SIMD registers of any length. Wouldn't that also be vector length agnostic, as long as software can query the vector length? I think most people wouldn't call this a vector ISA, and how is this substantially different from dispatching to different implementations for SSE AVX and AVX512?
I've also seen people say it's about the predication, which would make AVX512 a vector isa.
I've seen others say it's about resource usage and vector chaining, but that is just an implementation detail and can be used or not used on traditional SIMD ISAs to the same extend as on vector ISAs.
There is still a difference in the binutils, because SSE4 and AVX2 and AVX-512 have different instruction encodings per length.
But yes, it is possible to write VL-agnostic code for both SIMD and vector, and indeed the same user code written with Highway works on both SIMD and RISC-V.
The P extension is intended more for embedded microcontrollers for which the V extension would be too expensive. It reuses the GPRs at whatever width they are at (32 or 64 bits).
https://github.com/dotnet/runtime/blob/main/docs/coding-guid...
Here's an example of "checked" sum over a span of integers that uses platform-specific vector width:
https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...
Other examples:
CRC64 https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...
Hamming distance https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...
Default syntax is a bit ugly in my opinion, but it can be significantly improved with helper methods like here where the code is a port of simdutf's UTF-8 code point counting: https://github.com/U8String/U8String/blob/main/Sources/U8Str...
There are more advanced scenarios. Bepuphysics2 engine heavily leverages SIMD to perform as fast as PhysX's CPU back-end: https://github.com/bepu/bepuphysics2/blob/master/BepuPhysics...
Note that practically none of these need to reach out to platform-specific intrinsics (except for replacing movemask emulation with efficient ARM64 alternative) and use the same path for all platforms, varied by vector width rather than specific ISA.
There are no private SIMD APIs save for sequence comparison intrisic for unrolling against known lengths which JIT/ILC does for spans and strings.
You also noted the lack of ability to express numeric properties of T within generic context. This was indeed true, and this limitation was eventually addressed by generic math feature. There are INumber<T>, IBinaryInteger<T> and others to constrain the T on, which bring the comparison operators you were looking for.
In general, the knowledge around vectorized code has substantially improved within the community, and it is used quite more liberally nowadays by those who are aware of it.