Using SIMD for Parallel Processing in Rust (opens in new tab)

(nrempel.com)

124 pointsnbrempel1y ago42 comments

42 comments

There are a lot of factors that go into how fast a hash function is, but the case we're showing in the big red chart at https://github.com/BLAKE3-team/BLAKE3 is almost entirely driven by SIMD. It's a huge deal.

ww5201y ago

Zig actually has a very nice abstraction for SIMD in the form of vector programming. The size of the vector is agnostic to the underlying cpu architecture. The compiler or LLVM will generate code for using SIMD128, 256, or 512 registers. And you are just programming straight vectors.

pcwalton1y ago

Rust has that too, with nalgebra if you want arbitrary-sized tensors as scientific computing wants, or with glam and similar crates if your needs are more modest as in graphics. In all cases they're SIMD-accelerated.

hansvm1y ago

I do generally like their approach. It's especially well suited given how easy comptime allows metaprogramming against the target register size.

I wish it had a few more builtins for commonly supported operations without me having to write inline assembly (e.g., runtime LUTs are basically untenable for implementing something like bolt [0] without inline asm), but otherwise the abstraction level is about where I'd like it to be. I usually prefer it to gcc intrinsics, fully inline asm, and other such shenanigans.

[0] https://arxiv.org/abs/1706.10283

ladyanita221y ago

Isn't that what std:simd is for Rust?

ladyanita221y ago

But zig lacks the intrinsics support, and not ever single simd spec is exposed on the abstraction.

jvanderbot1y ago

Yeah, the article overlooked library support for SIMD. nalgebra had a decent writeup on their ability to squeeze out autovectorization for their vector and matrix types.

thomashabets21y ago

The portable SIMD is quite nice. We can't really trust a "sufficiently smart compiler" to make the best SIMD decisions, since it may not see through what you're actually doing.

https://blog.habets.se/2024/04/Rust-is-faster-than-C.html and code at https://github.com/ThomasHabets/zipbrute/blob/master/rust/sr... showed me getting 3x faster using portable SIMD, on my first attempt.

nbrempelOP1y ago

Thanks for reading everyone. I’ve gotten some feedback over on Reddit as well that the example is not effectively showing the benefits of SIMD. I plan on revising this.

One of my goals of writing these articles is to learn so feedback is more than welcome!

dzaima1y ago

What's fun is that, as the use of SIMD in your example is useless, LLVM correctly completely removes it, and makes your "neon" and "fallback" versions exactly the same - without any SIMD (compiler explorer: https://godbolt.org/z/YWoMGoaxT).

As an additional note, aarch64 always has NEON (similar to how x86-64 always has SSE2; extensions useful to dispatch would be SVE on aarch64 and AVX/AVX2/AVX-512 on x86-64), so no point dynamically checking for it.

KineticLensman1y ago

Great read!

> One of my goals of writing these articles is to learn so feedback is more than welcome!

When I went into the Rust playground to see the assembly output for the Cumulative Sum example, I could only get it to show the compiler warnings, not the actual assembly. I'm probably doing something wrong, but for me this was a barrier that detracted from the article. I'd suggest incorporating the assembly directly into the article, although keeping the playground link for people who are more dedicated / competent than I am.

the84721y ago

The function has to be made pub so it doesn't get optimized out as unusued private function.

Godbolt is a better choice for looking at asm anyway. https://rust.godbolt.org/z/3Y9ovsoz9

hayley-patton1y ago

Narrator: "The code did not, in fact, auto-vectorise."

(There's only addsd/movsd instructions, which are add/move scalar-double; we want addpd/movpd which are add/move packed-double in vectorised code.)

KineticLensman1y ago

Ah, that worked, thanks!

Although I can now see why he didn't include the output directly.

devit1y ago

Are you really writing them?

Seems written by an LLM for the most part.

eachro1y ago

This is cool that simd primitives exist in the std lib of rust. I've wanted wanted to mess around a bit more with simd in python but I don't think that native support exists. Or your have to go down to C/C++ bindings to actually mess around with it (last I checked at least, please correct me if I'm wrong).

bardak1y ago

I feel like most languages could use simd in the standard library. We have all this power in the vector units of our CPUs that compilers struggle to use but yet we also don't make it easy to do manually

neonsunset1y ago

C# is the language that is doing this exact thing, with the next two close options being Swift and, from my understanding, Mojo.

Without easy to use SIMD abstraction, many* of .NET's CoreLib functions would have been significantly slower.

* UTF-8 validation, text encoding/decoding, conversion to/from hex bytes, copying data, zeroing, various checksum and hash functions, text/element counting, searching, advanced text search with multiple algorithms under SearchValues type used by Regex engine, etc.

pjmlp1y ago

D as well.

mroche1y ago

Quick search turned up this:

SIMD in Pure Python

https://www.da.vidbuchanan.co.uk/blog/python-swar.html

Don't let the "SIMD in Python" section fool you, it's a short stop on Numpy before putting it aside.

Calavar1y ago

What would native SIMD support entail in a language without first party JIT or AOT compilation?

runevault1y ago

At some point bytecode still turns into CPU instructions, so if you added syntax or special functions that went to parts of the interpreter that are SIMD you could certainly add it to a purely interpreted language.

Calavar1y ago

If we're talking low level SIMD, like opcode level, I'm really struggling to see the use case for interpreted bytecode. The cost of type checking operands to dynamically dispatch down a SIMD path would almost certainly outweigh the savings of the SIMD path itself.

JIT is different because in function-level JIT, you can check types just once at the opening of the function, then you stay on the SIMD happy path for the rest of the function. And in AOT, you may able to elide the checks entirely.

There is certainly a space for higher level SIMD functionality. Numpy is one example.

anonymousDan1y ago

The interesting question for me is whether Rust makes it easier for the compiler to extract SIMD parallelism automatically given the restrictions imposed by its type system.

pcwalton1y ago

The main thing I can think of that would help here is the fact that Rust has stricter alignment requirements than C++ does. Any live reference can more or less be assumed to point to validly-aligned memory at all times, which isn't true in C++.

As to whether LLVM actually takes advantage of this effectively, I don't know. I know that we do supply the necessary attributes to LLVM in most cases, but I haven't looked at the individual transform and optimization passes to see whether they take advantage of this (e.g. emitting movdqa vs. falling back to movdqu).

PoignardAzur1y ago

Aside from aliasing restrictions, you can use chunked iterators which IIRC make it easier for the compiler to auto-vectorize your loop. The actual code changes very little.

IshKebab1y ago

Minor nit: RISC-V Vector isn't SIMD. It's actually like ARM's Scalable Vector Extension. Unlike traditional SIMD the code is agnostic to the register width and different hardware can run the same code with different widths.

There is also a traditional SIMD extension (P I think?) but it isn't finished. Most focus has been on the vector extension.

I am wondering how and if Rust will support these vector processing extensions.

camel-cdr1y ago

> RISC-V Vector isn't SIMD

Isn't SIMD a subset of vector processors?

To that matter, can anybody here provide a proper and useful distinction between the two, that is SIMD and vector ISAs?

You imply it's because it's vector length agnostic, but you could take e.g. the SSE encoding, and apart from a few instructions, make it operate on SIMD registers of any length. Wouldn't that also be vector length agnostic, as long as software can query the vector length? I think most people wouldn't call this a vector ISA, and how is this substantially different from dispatching to different implementations for SSE AVX and AVX512?

I've also seen people say it's about the predication, which would make AVX512 a vector isa.

I've seen others say it's about resource usage and vector chaining, but that is just an implementation detail and can be used or not used on traditional SIMD ISAs to the same extend as on vector ISAs.

janwas1y ago

I agree that SIMD and vector are basically interchangeable at a certain level.

There is still a difference in the binutils, because SSE4 and AVX2 and AVX-512 have different instruction encodings per length.

But yes, it is possible to write VL-agnostic code for both SIMD and vector, and indeed the same user code written with Highway works on both SIMD and RISC-V.

Findecanor1y ago

RISC-V's vector extension will have at least 128 bits in application processors, so I think you could set VLEN=128 and just use SIMD algorithms.

The P extension is intended more for embedded microcontrollers for which the V extension would be too expensive. It reuses the GPRs at whatever width they are at (32 or 64 bits).

camel-cdr1y ago

That or you can detect the vector length and specialize for it, just like it's already done on x86 with VLEN 128, 256, and 512 for sse, avx, and avx512.

brundolf1y ago

std::simd is a delight. I'd never done SIMD before in any language, and it was very easy and natural (and safe!) to introduce to my code, and just automatically works cross-platform. Can't recommend it enough

neonsunset1y ago

If you like SIMD and would like to dabble in it, I can strongly recommend trying it out in C# via its platform-agnostic SIMD abstraction. It is very accessible especially if you already know a little bit of C or C++, and compiles to very competent codegen for AdvSimd, SSE2/4.2/AVX1/2/AVX512, WASM's Packed SIMD and, in .NET 9, SVE1/2:

https://github.com/dotnet/runtime/blob/main/docs/coding-guid...

Here's an example of "checked" sum over a span of integers that uses platform-specific vector width:

https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...

Other examples:

CRC64 https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...

Hamming distance https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...

Default syntax is a bit ugly in my opinion, but it can be significantly improved with helper methods like here where the code is a port of simdutf's UTF-8 code point counting: https://github.com/U8String/U8String/blob/main/Sources/U8Str...

There are more advanced scenarios. Bepuphysics2 engine heavily leverages SIMD to perform as fast as PhysX's CPU back-end: https://github.com/bepu/bepuphysics2/blob/master/BepuPhysics...

Note that practically none of these need to reach out to platform-specific intrinsics (except for replacing movemask emulation with efficient ARM64 alternative) and use the same path for all platforms, varied by vector width rather than specific ISA.

runevault1y ago

Funny you mention c#, I started to look at this and I made the mistake of wanting to do string comparison via SIMD, except you can't do it externally because it relies on private internals (note, the built in comparison for c# already does SIMD, you just can't easily reimplement it against the built in string type).

neonsunset1y ago

What kind of private internals do you have in mind? You absolutely can hand-roll your own comparison routine, just hard to beat existing implementation esp. once you start considering culture-sensitive comparison (which may defer to e.g. ICU).

There are no private SIMD APIs save for sequence comparison intrisic for unrolling against known lengths which JIT/ILC does for spans and strings.

runevault1y ago

IIRC (Been a month or so since I looked into it) I couldn't access the underlying array in a way SIMD liked I think? If you look at how they did it inside the actual string class it uses those private properties of the string that are only available internally to guarantee you don't change the string data if memory serves.

1 more reply

zvrba1y ago

I implemented a sorting network in C# with AVX2 intrinsics. https://github.com/zvrba/SortingNetworks

neonsunset1y ago

It's a nice piece of work! If you're interested, .NET's compiler has improved significantly since 3.1, in particular, around structs and pre-existing intrinsics (which are no longer needed to be used directly in most situations - pretty much all code prefers to use plain methods on VectorXXX<T> whenever possible). Also note the use of AggressiveOptimization attribute which disables tiered compilation and forces the static initialization checks your readme refers to - removing AO allows the compiler to bake statics directly into codegen through tiered compilation as upon reaching Tier 1 the value of such readonly statics will be known. For trivially constructed values, it is better to not store such in fields but rather construct them in place via e.g. expression-bodied properties like 'Vector128<byte> MASK => Vector128.Create((byte)0x80)`. I don't remember exactly whether this was introduced in Core 3.1 or 5, but today the use of `AggressiveOptimization` flag is discouraged unless you do need to bypass DynamicPGO.

You also noted the lack of ability to express numeric properties of T within generic context. This was indeed true, and this limitation was eventually addressed by generic math feature. There are INumber<T>, IBinaryInteger<T> and others to constrain the T on, which bring the comparison operators you were looking for.

In general, the knowledge around vectorized code has substantially improved within the community, and it is used quite more liberally nowadays by those who are aware of it.

j / k navigate · click thread line to collapse

42 comments

oconnor6631y ago

ww5201y ago

pcwalton1y ago

hansvm1y ago

I do generally like their approach. It's especially well suited given how easy comptime allows metaprogramming against the target register size.

[0] https://arxiv.org/abs/1706.10283

ladyanita221y ago

Isn't that what std:simd is for Rust?

ladyanita221y ago

But zig lacks the intrinsics support, and not ever single simd spec is exposed on the abstraction.

jvanderbot1y ago

Yeah, the article overlooked library support for SIMD. nalgebra had a decent writeup on their ability to squeeze out autovectorization for their vector and matrix types.

thomashabets21y ago

The portable SIMD is quite nice. We can't really trust a "sufficiently smart compiler" to make the best SIMD decisions, since it may not see through what you're actually doing.

nbrempelOP1y ago

Thanks for reading everyone. I’ve gotten some feedback over on Reddit as well that the example is not effectively showing the benefits of SIMD. I plan on revising this.

One of my goals of writing these articles is to learn so feedback is more than welcome!

dzaima1y ago

KineticLensman1y ago

Great read!

> One of my goals of writing these articles is to learn so feedback is more than welcome!

the84721y ago

The function has to be made pub so it doesn't get optimized out as unusued private function.

Godbolt is a better choice for looking at asm anyway. https://rust.godbolt.org/z/3Y9ovsoz9

hayley-patton1y ago

Narrator: "The code did not, in fact, auto-vectorise."

(There's only addsd/movsd instructions, which are add/move scalar-double; we want addpd/movpd which are add/move packed-double in vectorised code.)

KineticLensman1y ago

Ah, that worked, thanks!

Although I can now see why he didn't include the output directly.

devit1y ago

Are you really writing them?

Seems written by an LLM for the most part.

eachro1y ago

bardak1y ago

neonsunset1y ago

C# is the language that is doing this exact thing, with the next two close options being Swift and, from my understanding, Mojo.

Without easy to use SIMD abstraction, many* of .NET's CoreLib functions would have been significantly slower.

pjmlp1y ago

D as well.

mroche1y ago

Quick search turned up this:

SIMD in Pure Python

https://www.da.vidbuchanan.co.uk/blog/python-swar.html

Don't let the "SIMD in Python" section fool you, it's a short stop on Numpy before putting it aside.

Calavar1y ago

What would native SIMD support entail in a language without first party JIT or AOT compilation?

runevault1y ago

Calavar1y ago

There is certainly a space for higher level SIMD functionality. Numpy is one example.

anonymousDan1y ago

The interesting question for me is whether Rust makes it easier for the compiler to extract SIMD parallelism automatically given the restrictions imposed by its type system.

pcwalton1y ago

PoignardAzur1y ago

Aside from aliasing restrictions, you can use chunked iterators which IIRC make it easier for the compiler to auto-vectorize your loop. The actual code changes very little.

IshKebab1y ago

There is also a traditional SIMD extension (P I think?) but it isn't finished. Most focus has been on the vector extension.

I am wondering how and if Rust will support these vector processing extensions.

camel-cdr1y ago

> RISC-V Vector isn't SIMD

Isn't SIMD a subset of vector processors?

To that matter, can anybody here provide a proper and useful distinction between the two, that is SIMD and vector ISAs?

I've also seen people say it's about the predication, which would make AVX512 a vector isa.

janwas1y ago

I agree that SIMD and vector are basically interchangeable at a certain level.

There is still a difference in the binutils, because SSE4 and AVX2 and AVX-512 have different instruction encodings per length.

But yes, it is possible to write VL-agnostic code for both SIMD and vector, and indeed the same user code written with Highway works on both SIMD and RISC-V.

Findecanor1y ago

RISC-V's vector extension will have at least 128 bits in application processors, so I think you could set VLEN=128 and just use SIMD algorithms.

The P extension is intended more for embedded microcontrollers for which the V extension would be too expensive. It reuses the GPRs at whatever width they are at (32 or 64 bits).

camel-cdr1y ago

That or you can detect the vector length and specialize for it, just like it's already done on x86 with VLEN 128, 256, and 512 for sse, avx, and avx512.

brundolf1y ago

neonsunset1y ago

https://github.com/dotnet/runtime/blob/main/docs/coding-guid...

Here's an example of "checked" sum over a span of integers that uses platform-specific vector width:

https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...

Other examples:

CRC64 https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...

Hamming distance https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...

There are more advanced scenarios. Bepuphysics2 engine heavily leverages SIMD to perform as fast as PhysX's CPU back-end: https://github.com/bepu/bepuphysics2/blob/master/BepuPhysics...

runevault1y ago

neonsunset1y ago

There are no private SIMD APIs save for sequence comparison intrisic for unrolling against known lengths which JIT/ILC does for spans and strings.

runevault1y ago

1 more reply

zvrba1y ago

I implemented a sorting network in C# with AVX2 intrinsics. https://github.com/zvrba/SortingNetworks

neonsunset1y ago

In general, the knowledge around vectorized code has substantially improved within the community, and it is used quite more liberally nowadays by those who are aware of it.

j / k navigate · click thread line to collapse