The language is actually great with SIMD, you just have to do it yourself with intrinsics, or use libraries. BTW, here’s a library which implements 4-wide vectorized exponent functions for FP32 precision on top of SSE, AVX and NEON SIMD intrinsics (MIT license): https://github.com/microsoft/DirectXMath/blob/oct2024/Inc/Di...
Mostly because of a very different memory model, of course. Arrays are first-class things, not hacks on top of pointers, and can't alias each other.
Anyway auto-vectorization will never compete with intrinsics for performance, so this all seems rather silly. Try expressing pshub in auto-vectorizer. Or a bunch of optimal load/swizzles to out perform gather.
"Hacks" feels like the wrong term to use here. Comparing managed objects surrounded by runtime to memory isn't a fair comparison.
Is this article really confused, or did I misunderstand it?
The thing that makes C/C++ a good language for SIMD is how easily it lets you control memory alignment.
Not sure for C
The fact that exp is implemented in hardware is not the argument. The argument is that exp is a library function, compiled separately, and thus the compiler cannot inline the function and fuse it with an array-wide loop, to detect later on an opportunity to generate the SIMD instructions.
It is true however that exp is implemented in hardware in the X86 world, and to be fair, perhaps a C compiler only needs to represent that function as an intrinsic instead, to give itself a chance to later replace it either with the function call or some SIMD instruction; but I guess that the standard doesn't provide that possibility?
I gesture to this in the blog post:
> In C, one writes a function, and it is exported in an object file. To appreciate why this is special, consider sum :: Num a => [a] -> a in Haskell. This function exists only in the context of GHC. > ... > perhaps there are more fluent methods for compilation (and better data structures for export à la object files).
This is especially surprising given that very few architectures have an `exp` in hardware. It's almost always done in software.
The function exp has an abstract mathematical definition as mapping from numbers to numbers, and you could implement that generically if the language allows for it, but in C you cannot because it's bound to the signature `double exp(double)` and fixed like that at compile time. You cannot use this function in a different context and pass e.g. __m256d.
Basically because C does not have function dispatch, it's ill-suited for generic programming. You cannot write an algorithm on scalar, and now pass arrays instead.
It is limited to cases where you know all overloads beforehand and limited because of C’s weak type system (can’t have an overload for arrays of int and one for pointers to int, for example), but you can. https://en.cppreference.com/w/c/language/generic:
#include <math.h>
#include <stdio.h>
// Possible implementation of the tgmath.h macro cbrt
#define cbrt(X) _Generic((X), \
long double: cbrtl, \
default: cbrt, \
float: cbrtf \
)(X)
int main(void)
{
double x = 8.0;
const float y = 3.375;
printf("cbrt(8.0) = %f\n", cbrt(x)); // selects the default cbrt
printf("cbrtf(3.375) = %f\n", cbrt(y)); // converts const float to float,
// then selects cbrtf
}
It also, IMO, is a bit ugly, but that fits in nice with the rest of C, as seen through modern eyes.There also is a drawback for lots of cases that you cannot overload operators. Implementing vector addition as ‘+’ isn’t possible, for example.
Many compilers partially support that as a language extension, though, see for example https://clang.llvm.org/docs/LanguageExtensions.html#vectors-....
https://stackoverflow.com/questions/73269625/create-memory-a...
Also structlayout and fieldoffset
I count ~8 different implementations [1], which demonstrates considerable commitment :) Personally, I prefer to write once with portable intrinsics.
https://github.com/simdutf/simdutf/tree/1d5b5cd2b60850954df5...
Portable instructions and “scalable” length-agnostic vectors are usually fine for straightforward mathy (type-1[2]) SIMD code. But real SIMD trickery, the one that starts as soon as you take your variable byte shuffle out of the stable, is rarely so kind.
[1] https://branchfree.org/2019/04/01/fitting-my-head-through-th...
[2] https://branchfree.org/2024/06/09/a-draft-taxonomy-of-simd-u...
The author is suggesting array languages as the solution, which are a separate category from functional languages.
When you get down to it, you're optimizing (searching) for some program that maximizes/minimizes some objective function with terms for error relative to specification/examples and size of synthesized program, while applying some form of cleverness to pruning the search space.
This is absolutely within the wheelhouse of SMT solvers, and something that they are used for.
SMT doesn't have to be used, but for implementation it enables iterating on the concept more quickly (see also, more broadly: prototyping in Prolog), and in other cases it's simply the most effective tool for the job. So, it tends to get a lot of play.
> Does what you expect: Highway is a C++ library with carefully-chosen functions that map well to CPU instructions without extensive compiler transformations. The resulting code is more predictable and robust to code changes/compiler updates than autovectorization.
So C compilers are not a good place to start if one wants to write a compiler for an array language (which naturally expresses SIMD calculations). Which is what I point out in the last paragraph of the blog post:
> To some extent this percolates compilers textbooks. Array languages naturally express SIMD calculations; perhaps there are more fluent methods for compilation (and better data structures for export à la object files).
As an array implementer I've thought about the issue a lot and have been meaning to write a full page on it. For now I have some comments at https://mlochbaum.github.io/BQN/implementation/versusc.html#... and the last paragraph of https://mlochbaum.github.io/BQN/implementation/compile/intro....
"Modularity is the enemy of performance"
If you want optimal performance, you have to collapse the layers. Look at Deepseek, for example.
Wanna SMP? Use multi-thread libreries. Wanna SIMD/MIMD? Use (inline) assembler functions. Or design your own language.
If you implement a scalar expf in a vectorizer friendly way, and it's visible to the compiler, then it could also be vectorized: https://godbolt.org/z/zxTn8hbEe
E.g. you avoid lookup tables when you can, or only use smaller ones you know to fit in one or two SIMD registers. gcc and clang can't vevtorize it as is, but they do if you remove the brancjes than handle infinity and over/under-flow.
In the godbolt link I copied the musl expf implementation and icx was able to vectorize it, even though it uses a LUT to large for SIMD registers.
#pragma omp simd and equivalents will encourage the compiler to vectorize a specific loop and produce a warning if a loop isn't vectorized.
https://ocw.mit.edu/courses/6-945-adventures-in-advanced-sym...
Some people have vectorized successfully with C, even with all the hacks/pointers/union/opaque business. It requires careful programming, for sure. The ffmpeg cases are super good examples of how compiler misses happen, and how to optimize for full throughput in those cases. Worth a look for all compiler engineers.
https://github.com/dezashibi-c/a-simd_in_c
Copy rights go to Navid Dezashibi.
```
generic
type T is private;
Aligned : Bool := True;
function Inverse_Sqrt_T (V : T) return T;
function Inverse_Sqrt_T (V : T) return T is
Result : aliased T;
THREE : constant Real := 3.0;
NEGATIVE_HALF : constant Real := -0.5;
VMOVPS : constant String := (if Aligned then "vmovaps" else "vmovups");
begin
Asm (Clobber => "xmm0, xmm1, xmm2, xmm3, memory",
Inputs => (Ptr'Asm_Input ("r", Result'Address),
Ptr'Asm_Input ("r", V'Address),
Ptr'Asm_Input ("r", THREE'Address),
Ptr'Asm_Input ("r", NEGATIVE_HALF'Address)),
Template => VMOVPS & " (%1), %%xmm0 " & E & -- xmm0 ← V
" vrsqrtps %%xmm0, %%xmm1 " & E & -- xmm1 ← Reciprocal sqrt of xmm0
" vmulps %%xmm1, %%xmm1, %%xmm2 " & E & -- xmm2 ← xmm1 \* xmm1
" vbroadcastss (%2), %%xmm3 " & E & -- xmm3 ← NEGATIVE_HALF
" vfmsub231ps %%xmm2, %%xmm0, %%xmm3 " & E & -- xmm3 ← (V - xmm2) \* NEGATIVE_HALF
" vbroadcastss (%3), %%xmm0 " & E & -- xmm0 ← THREE
" vmulps %%xmm0, %%xmm1, %%xmm0 " & E & -- xmm0 ← THREE \* xmm1
" vmulps %%xmm3, %%xmm0, %%xmm0 " & E & -- xmm0 ← xmm3 \* xmm0
VMOVPS & " %%xmm0, (%0) "); -- Result ← xmm0
return Result;
end;
function Inverse_Sqrt is new Inverse_Sqrt_T (Vector_2D, Aligned => False);
function Inverse_Sqrt is new Inverse_Sqrt_T (Vector_3D, Aligned => False);
function Inverse_Sqrt is new Inverse_Sqrt_T (Vector_4D);
``````C
vector_3d vector_inverse_sqrt(const vector_3d\* v) {
...
vector_4d vector_inverse_sqrt(const vector_4d\* v) {
vector_4d out;
static const float THREE = 3.0f; // 0x40400000
static const float NEGATIVE_HALF = -0.5f; // 0xbf000000
__asm__ (
// Load the input vector into xmm0
"vmovaps (%1), %%xmm0\n\t"
"vrsqrtps %%xmm0, %%xmm1\n\t"
"vmulps %%xmm1, %%xmm1, %%xmm2\n\t"
"vbroadcastss (%2), %%xmm3\n\t"
"vfmsub231ps %%xmm2, %%xmm0, %%xmm3\n\t"
"vbroadcastss (%3), %%xmm0\n\t"
"vmulps %%xmm0, %%xmm1, %%xmm0\n\t"
"vmulps %%xmm3, %%xmm0, %%xmm0\n\t"
"vmovups %%xmm0, (%0)\n\t" // Output operand
:
: "r" (&out), "r" (v), "r" (&THREE), "r" (&NEGATIVE_HALF) // Input operands
: "xmm0", "xmm1", "xmm2", "memory" // Clobbered registers
);
return out;
}
```If only GPU makers could standardise an extended ISA like AVX on CPU, and we could all run SIMD or SIMT code without needing any librairies, but our compilers.
(Oh, that's SIMT. Carry on then.)
C functions can't be vectorized? WTF are you talking about? You can certainly pass vector registers to functions.
Exp can also be vectorized, AVX512 even includes specific instructions to make it easier.( there is no direct exp instruction on most hardware,it is generally a sequence of instructions)