C is not suited to SIMD (2019) (opens in new tab)

(blog.vmchale.com)

89 pointszetalyrae1y ago120 comments

120 comments

I would rather conclude that automatic vectorizers are still less than ideal, despite SIMD instructions have been widely available in commodity processors for 25 years now.

The language is actually great with SIMD, you just have to do it yourself with intrinsics, or use libraries. BTW, here’s a library which implements 4-wide vectorized exponent functions for FP32 precision on top of SSE, AVX and NEON SIMD intrinsics (MIT license): https://github.com/microsoft/DirectXMath/blob/oct2024/Inc/Di...

nine_k1y ago

Why does FORTRAN, which is older than C, has no trouble with auto-vectorization?

Mostly because of a very different memory model, of course. Arrays are first-class things, not hacks on top of pointers, and can't alias each other.

nitwit0051y ago

When I tested this before (some years ago), adding "restrict" to the C pointers resulted in the same behavior, so the aliasing default seem to be the key issue.

2 more replies

TinkersW1y ago

The article seems to conflate C and C++ in its very first sentence, though in reality these are very different languages. In C++ you can use std::array, and you can make your SIMD containers use restrict to make it more like Fortran.

Anyway auto-vectorization will never compete with intrinsics for performance, so this all seems rather silly. Try expressing pshub in auto-vectorizer. Or a bunch of optimal load/swizzles to out perform gather.

Const-me1y ago

I believe it’s ecosystem. For example, Python is very high level language, not sure it even has a memory model. But it has libraries like NumPy which support all these vectorized exponents when processing long arrays of numbers.

2 more replies

MisterTea1y ago

> Arrays are first-class things, not hacks on top of pointers, and can't alias each other.

"Hacks" feels like the wrong term to use here. Comparing managed objects surrounded by runtime to memory isn't a fair comparison.

2 more replies

Laiho1y ago

I've had great success with autovectorization in Rust. Majority of non-branching loops seem to consistently generate great assembly.

pjmlp1y ago

None of that is ISO C though, and any language can have similar extensions, no need to be stuck with C for that.

Keyframe1y ago

isn't DirectXMath C++, not C?

npalli1y ago

Yes, it is in C++.

Const-me1y ago

Yeah, but if you need C versions of low-level SIMD algorithms from there, copy-paste the implementation and it will probably compile as C code. That’s why I mentioned the MIT license: copy-pasting from GPL libraries may or may not work depending on the project, but the MIT license is copy-paste friendly.

AnimalMuppet1y ago

The argument seems to be that C isn't suited to SIMD because some functions (exp, for instance) are not. But it then talks about how exp is implemented in hardware, in a way that forces it to be a non-SIMD calculation. That's not a C issue, that's a CPU issue. No language can save you there (unless it uses a software-only function for exp).

Is this article really confused, or did I misunderstand it?

The thing that makes C/C++ a good language for SIMD is how easily it lets you control memory alignment.

woooooo1y ago

Not in the article, but I've read that the way C does pointers, a single address with length implicit, makes it hard for compilers to assert that 2 arrays don't overlap/alias and this is an obstacle for generating SIMD instructions.

drysine1y ago

That's what the restrict type qualifier[0] is for.

[0] https://en.cppreference.com/w/c/language/restrict

alkh1y ago

I thought you could simply add `restrict`[1] to indicate that only one pointer points to a specific object? Shouldn't this help?

[1]https://en.cppreference.com/w/c/language/restrict

1 more reply

wayoverthecloud1y ago

In C++ there is the align_alloc specificier. https://en.cppreference.com/w/c/memory/aligned_alloc

Not sure for C

1 more reply

bee_rider1y ago

When compiling a library, it is generally impossible for the compiler to know whether or not the pointers will be aliased, right? That decision is made after the library is already compiled.

1 more reply

17186274401y ago

Not really. Pointers from different allocations are guaranteed to never alias. If they do, it is UB.

1 more reply

ta34011y ago

> But it then talks about how exp is implemented in hardware, in a way that forces it to be a non-SIMD calculation.

The fact that exp is implemented in hardware is not the argument. The argument is that exp is a library function, compiled separately, and thus the compiler cannot inline the function and fuse it with an array-wide loop, to detect later on an opportunity to generate the SIMD instructions.

It is true however that exp is implemented in hardware in the X86 world, and to be fair, perhaps a C compiler only needs to represent that function as an intrinsic instead, to give itself a chance to later replace it either with the function call or some SIMD instruction; but I guess that the standard doesn't provide that possibility?

AlotOfReading1y ago

The compiler can (and does) know enough about the stdlib functions to treat them specially. The C standard doesn't require that libm exist as a separately linked library either, that's just an implementation detail from an earlier time that got enshrined in POSIX.

vmchale1y ago

> perhaps a C compiler only needs to represent that function as an intrinsic instead, to give itself a chance to later replace it either with the function call or some SIMD instruction

I gesture to this in the blog post:

> In C, one writes a function, and it is exported in an object file. To appreciate why this is special, consider sum :: Num a => [a] -> a in Haskell. This function exists only in the context of GHC. > ... > perhaps there are more fluent methods for compilation (and better data structures for export à la object files).

1 more reply

adgjlsfhk11y ago

> But it then talks about how exp is implemented in hardware, in a way that forces it to be a non-SIMD calculation.

This is especially surprising given that very few architectures have an `exp` in hardware. It's almost always done in software.

stabbles1y ago

I think the interpretation is more conceptual about functions, reusability and composability.

The function exp has an abstract mathematical definition as mapping from numbers to numbers, and you could implement that generically if the language allows for it, but in C you cannot because it's bound to the signature `double exp(double)` and fixed like that at compile time. You cannot use this function in a different context and pass e.g. __m256d.

Basically because C does not have function dispatch, it's ill-suited for generic programming. You cannot write an algorithm on scalar, and now pass arrays instead.

Someone1y ago

> You cannot write an algorithm on scalar, and now pass arrays instead.

It is limited to cases where you know all overloads beforehand and limited because of C’s weak type system (can’t have an overload for arrays of int and one for pointers to int, for example), but you can. https://en.cppreference.com/w/c/language/generic:

  #include <math.h>
  #include <stdio.h>
 
  // Possible implementation of the tgmath.h macro cbrt
  #define cbrt(X) _Generic((X),     \
              long double: cbrtl, \
                  default: cbrt,  \
                    float: cbrtf  \
              )(X)
 
  int main(void)
  {
    double x = 8.0;
    const float y = 3.375;
    printf("cbrt(8.0) = %f\n", cbrt(x));    // selects the default cbrt
    printf("cbrtf(3.375) = %f\n", cbrt(y)); // converts const float to float,
                                            // then selects cbrtf
  }

It also, IMO, is a bit ugly, but that fits in nice with the rest of C, as seen through modern eyes.

There also is a drawback for lots of cases that you cannot overload operators. Implementing vector addition as ‘+’ isn’t possible, for example.

Many compilers partially support that as a language extension, though, see for example https://clang.llvm.org/docs/LanguageExtensions.html#vectors-....

1 more reply

uecker1y ago

You can definitely implement generic algorithms in C and there are different ways to do this. For example, you create an abstract data type "ring" with addition, multiplication, and multiplication with a scalar, and pass objects to this type and functions for these three operations to a generic exponential function.

leecarraher1y ago

i think the article is confused. i write quite a bit of SIMD code for HPC and c is my goto for low level computing because of the low level memory allocation in c. In fact that tends to be the lion share of work to implement simd operations where memory isn't the bottleneck. if only it were as simple as calling vfmadd132ps on two datatypes.

high_na_euv1y ago

Even high level languages like c# allow you to express such a thing

https://stackoverflow.com/questions/73269625/create-memory-a...

Also structlayout and fieldoffset

PaulHoule1y ago

So are those functional languages. The really cool SIMD code I see people writing now is by people like Lemire using the latest extensions who do clever things like decoding UTF-8 which I think will always take assembly language to express unless you are getting an SMT solver to write it for you.

janwas1y ago

Lemire and collaborators often write in C++ intrinsics, or thin platform-specific wrappers on top of them.

I count ~8 different implementations [1], which demonstrates considerable commitment :) Personally, I prefer to write once with portable intrinsics.

https://github.com/simdutf/simdutf/tree/1d5b5cd2b60850954df5...

mananaysiempre1y ago

I don’t think that really works beyond 128 bits. AVX/AVX2 with its weird tiered 2×128-bit structure occasionally forces you down some really awkward paths. AVX-512 with its masks is powerful and quite elegant, but the idioms it enables don’t really translate to other ISAs. And even in 128-bit land, the difference between good code for SSE and NEON can be quite significant, all because of a single instruction (PMOVMSKB) the latter lacks[1].

Portable instructions and “scalable” length-agnostic vectors are usually fine for straightforward mathy (type-1[2]) SIMD code. But real SIMD trickery, the one that starts as soon as you take your variable byte shuffle out of the stable, is rarely so kind.

[1] https://branchfree.org/2019/04/01/fitting-my-head-through-th...

[2] https://branchfree.org/2024/06/09/a-draft-taxonomy-of-simd-u...

1 more reply

kibwen1y ago

> So are those functional languages.

The author is suggesting array languages as the solution, which are a separate category from functional languages.

dagelf1y ago

Link?

mananaysiempre1y ago

https://simdjson.org/publications/

https://simdutf.github.io/simdutf/

https://lemire.me/blog/

https://branchfree.org/

http://0x80.pl/

throwawaymaths1y ago

not even an SMT solver, that's not what they are for.

1propionyl1y ago

No? SMT solvers underpin the majority of program synthesis research tools.

When you get down to it, you're optimizing (searching) for some program that maximizes/minimizes some objective function with terms for error relative to specification/examples and size of synthesized program, while applying some form of cleverness to pruning the search space.

This is absolutely within the wheelhouse of SMT solvers, and something that they are used for.

SMT doesn't have to be used, but for implementation it enables iterating on the concept more quickly (see also, more broadly: prototyping in Prolog), and in other cases it's simply the most effective tool for the job. So, it tends to get a lot of play.

1 more reply

npalli1y ago

C++ does OK.

Google's highway [1]

Microsoft DirectXMath [2]

[1] https://github.com/google/highway

[2] https://github.com/microsoft/DirectXMath

vmchale1y ago

From highway:

> Does what you expect: Highway is a C++ library with carefully-chosen functions that map well to CPU instructions without extensive compiler transformations. The resulting code is more predictable and robust to code changes/compiler updates than autovectorization.

So C compilers are not a good place to start if one wants to write a compiler for an array language (which naturally expresses SIMD calculations). Which is what I point out in the last paragraph of the blog post:

> To some extent this percolates compilers textbooks. Array languages naturally express SIMD calculations; perhaps there are more fluent methods for compilation (and better data structures for export à la object files).

pjmlp1y ago

It is more like, ISO C++ with compiler extensions does ok.

mlochbaum1y ago

Several comments seem confused about this point: the article is not about manual SIMD, which of course C is perfectly capable of with intrinsics. It's discussing problems in compiling architecture-independent code to SIMD instructions, which in practice C compilers very often fail to do (so, exp having scalar hardware support may force other arithmetic not to be vectorized). An alternative mentioned is array programming, where operations are run one at a time on all the data; these languages serve as a proof of concept that useful programs can be run in a way that uses SIMD nearly all the time it's applicable, but at the cost of having to write every intermediate result to memory. So the hope is that "more fluent methods for compilation" can generate SIMD code without losing the advantages of scalar compilation.

As an array implementer I've thought about the issue a lot and have been meaning to write a full page on it. For now I have some comments at https://mlochbaum.github.io/BQN/implementation/versusc.html#... and the last paragraph of https://mlochbaum.github.io/BQN/implementation/compile/intro....

mlochbaum1y ago

Finding it increasingly funny how many people have come out of the woodworks to "defend" C by offering this or that method of explicit vectorization. What an indictment! C programmers can no longer even conceive of a language that works from a description of what should be done and not which instructions should be used!

mlochbaum1y ago

Got the introduction written at https://mlochbaum.github.io/BQN/implementation/compile/fusio....

brundolf1y ago

Thank you. The title is confusing

pjmlp1y ago

Compiler specific C is capable of, ISO C has no SIMD support of any form.

commandlinefan1y ago

This seems more of a generalization of Richard Steven's observation:

"Modularity is the enemy of performance"

If you want optimal performance, you have to collapse the layers. Look at Deepseek, for example.

notorandit1y ago

C has been invented when CPUs, those few available, were just single cored and single threaded.

Wanna SMP? Use multi-thread libreries. Wanna SIMD/MIMD? Use (inline) assembler functions. Or design your own language.

dragontamer1y ago

Fortran is older but it's array manipulation maps naturally to SIMD.

LiamPowell1y ago

That's missing the point of the article, but it's also not true, at least not at the time of C89. This can be easily verified by a Google search. As an aside, Ada also had multi-threading support before C89 was released, but this article is about SIMD, not multi-threading.

johnnyjeans1y ago

c89 is 19 years into c's life

2 more replies

camel-cdr1y ago

This depends entirely on compiler support. Intels ICX compiler can easily vectorize a sigmoid loop, by calling SVMLs vectorized expf function: https://godbolt.org/z/no6zhYGK6

If you implement a scalar expf in a vectorizer friendly way, and it's visible to the compiler, then it could also be vectorized: https://godbolt.org/z/zxTn8hbEe

dzaima1y ago

gcc and clang are also capable of it, given certain compiler flags: https://godbolt.org/z/z766hc64n

camel-cdr1y ago

Thanks, I didn't know about this. Interesting that it seems to require fast-math.

2 more replies

bee_rider1y ago

I’m not sure what “suited to SIMD” means exactly in this context. I mean, it is clearly possible for a compiler to apply some SIMD optimizations. But the program is essentially expressed as a sequential thing, and then the compiler discovers the SIMD potential. Of course, we write programs that we hope will make it easy to discover that potential. But it can be difficult to reason about how a compiler is going to optimize, for anything other than a simple loop.

camel-cdr1y ago

Suites for SIMD means you write the scalar equivalent of what you'd do on a single element in a SIMD implementation.

E.g. you avoid lookup tables when you can, or only use smaller ones you know to fit in one or two SIMD registers. gcc and clang can't vevtorize it as is, but they do if you remove the brancjes than handle infinity and over/under-flow.

In the godbolt link I copied the musl expf implementation and icx was able to vectorize it, even though it uses a LUT to large for SIMD registers.

#pragma omp simd and equivalents will encourage the compiler to vectorize a specific loop and produce a warning if a loop isn't vectorized.

1 more reply

sifar1y ago

It means auto-vectorization. Write scalar code that can be automatically vectorized by the compiler by using SIMD instructions

svilen_dobrev1y ago

reminds me of this old article.. ~2009, Sun/MIT - "The Future is Parallel" .. and "sequential decomposition and usual math are wrong horse.."

https://ocw.mit.edu/courses/6-945-adventures-in-advanced-sym...

vkaku1y ago

I also think that vectorizers and compilers can detect parallel memory adds/moves/subs and without that, many do not take time to provide adequate hints to the compiler about it.

Some people have vectorized successfully with C, even with all the hacks/pointers/union/opaque business. It requires careful programming, for sure. The ffmpeg cases are super good examples of how compiler misses happen, and how to optimize for full throughput in those cases. Worth a look for all compiler engineers.

Vosporos1y ago

Woo, very happy to see more posts by McHale these days!

dvorack1011y ago

Not my code, but illustrates how SIMD works.

https://github.com/dezashibi-c/a-simd_in_c

Copy rights go to Navid Dezashibi.

exitcode00001y ago

Of course C easily allows you to directly write SIMD routines via intrinsic instructions or inline assembly:

```

  generic
     type T is private;
     Aligned : Bool := True;
  function Inverse_Sqrt_T (V : T) return T;
  function Inverse_Sqrt_T (V : T) return T is
    Result : aliased T;
    THREE         : constant Real   := 3.0;
    NEGATIVE_HALF : constant Real   := -0.5;
    VMOVPS        : constant String := (if Aligned then "vmovaps" else "vmovups");
    begin
      Asm (Clobber  => "xmm0, xmm1, xmm2, xmm3, memory",
           Inputs   => (Ptr'Asm_Input ("r", Result'Address),
                        Ptr'Asm_Input ("r", V'Address),
                        Ptr'Asm_Input ("r", THREE'Address),
                        Ptr'Asm_Input ("r", NEGATIVE_HALF'Address)),                     
           Template => VMOVPS & "       (%1), %%xmm0         " & E & --   xmm0 ← V
                       " vrsqrtps     %%xmm0, %%xmm1         " & E & --   xmm1 ← Reciprocal sqrt of xmm0
                       " vmulps       %%xmm1, %%xmm1, %%xmm2 " & E & --   xmm2 ← xmm1 \* xmm1
                       " vbroadcastss   (%2), %%xmm3         " & E & --   xmm3 ← NEGATIVE_HALF
                       " vfmsub231ps  %%xmm2, %%xmm0, %%xmm3 " & E & --   xmm3 ← (V - xmm2) \* NEGATIVE_HALF
                       " vbroadcastss   (%3), %%xmm0         " & E & --   xmm0 ← THREE
                       " vmulps       %%xmm0, %%xmm1, %%xmm0 " & E & --   xmm0 ← THREE \* xmm1
                       " vmulps       %%xmm3, %%xmm0, %%xmm0 " & E & --   xmm0 ← xmm3 \* xmm0
                       VMOVPS & "     %%xmm0,   (%0)         ");     -- Result ← xmm0
      return Result;
    end;

  function Inverse_Sqrt is new Inverse_Sqrt_T (Vector_2D, Aligned => False);
  function Inverse_Sqrt is new Inverse_Sqrt_T (Vector_3D, Aligned => False);
  function Inverse_Sqrt is new Inverse_Sqrt_T (Vector_4D);

```

```C

  vector_3d vector_inverse_sqrt(const vector_3d\* v) {
  ...
  vector_4d vector_inverse_sqrt(const vector_4d\* v) {
    vector_4d out;
    static const float THREE = 3.0f;          // 0x40400000
    static const float NEGATIVE_HALF = -0.5f; // 0xbf000000

    __asm__ (
        // Load the input vector into xmm0
        "vmovaps        (%1), %%xmm0\n\t"
        "vrsqrtps     %%xmm0, %%xmm1\n\t"
        "vmulps       %%xmm1, %%xmm1, %%xmm2\n\t"
        "vbroadcastss   (%2), %%xmm3\n\t"
        "vfmsub231ps  %%xmm2, %%xmm0, %%xmm3\n\t"
        "vbroadcastss   (%3), %%xmm0\n\t"
        "vmulps       %%xmm0, %%xmm1, %%xmm0\n\t"
        "vmulps       %%xmm3, %%xmm0, %%xmm0\n\t"
        "vmovups      %%xmm0,   (%0)\n\t"                         // Output operand
        :
        : "r" (&out), "r" (v), "r" (&THREE), "r" (&NEGATIVE_HALF) // Input operands
        : "xmm0", "xmm1", "xmm2", "memory"                        // Clobbered registers
    );

    return out;

} ```

jiehong1y ago

It seems that Zig is well suited for writing SIMD [0].

If only GPU makers could standardise an extended ISA like AVX on CPU, and we could all run SIMD or SIMT code without needing any librairies, but our compilers.

[0]: https://zig.guide/language-basics/vectors/

dundarious1y ago

I like zig, but no, what it provides pales in comparison to what intrinsics offer.

musicale1y ago

That's why CUDA never went anywhere. /s ;-)

(Oh, that's SIMT. Carry on then.)

musicale1y ago

https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/gro...

TinkersW1y ago

This reads like jibberish.

C functions can't be vectorized? WTF are you talking about? You can certainly pass vector registers to functions.

Exp can also be vectorized, AVX512 even includes specific instructions to make it easier.( there is no direct exp instruction on most hardware,it is generally a sequence of instructions)

1 more reply

j / k navigate · click thread line to collapse

120 comments

Const-me1y ago

I would rather conclude that automatic vectorizers are still less than ideal, despite SIMD instructions have been widely available in commodity processors for 25 years now.

nine_k1y ago

Why does FORTRAN, which is older than C, has no trouble with auto-vectorization?

Mostly because of a very different memory model, of course. Arrays are first-class things, not hacks on top of pointers, and can't alias each other.

nitwit0051y ago

When I tested this before (some years ago), adding "restrict" to the C pointers resulted in the same behavior, so the aliasing default seem to be the key issue.

2 more replies

TinkersW1y ago

Const-me1y ago

2 more replies

MisterTea1y ago

> Arrays are first-class things, not hacks on top of pointers, and can't alias each other.

"Hacks" feels like the wrong term to use here. Comparing managed objects surrounded by runtime to memory isn't a fair comparison.

2 more replies

Laiho1y ago

I've had great success with autovectorization in Rust. Majority of non-branching loops seem to consistently generate great assembly.

pjmlp1y ago

None of that is ISO C though, and any language can have similar extensions, no need to be stuck with C for that.

Keyframe1y ago

isn't DirectXMath C++, not C?

npalli1y ago

Yes, it is in C++.

Const-me1y ago

AnimalMuppet1y ago

Is this article really confused, or did I misunderstand it?

The thing that makes C/C++ a good language for SIMD is how easily it lets you control memory alignment.

woooooo1y ago

drysine1y ago

That's what the restrict type qualifier[0] is for.

[0] https://en.cppreference.com/w/c/language/restrict

alkh1y ago

I thought you could simply add `restrict`[1] to indicate that only one pointer points to a specific object? Shouldn't this help?

[1]https://en.cppreference.com/w/c/language/restrict

1 more reply

wayoverthecloud1y ago

In C++ there is the align_alloc specificier. https://en.cppreference.com/w/c/memory/aligned_alloc

Not sure for C

1 more reply

bee_rider1y ago

When compiling a library, it is generally impossible for the compiler to know whether or not the pointers will be aliased, right? That decision is made after the library is already compiled.

1 more reply

17186274401y ago

Not really. Pointers from different allocations are guaranteed to never alias. If they do, it is UB.

1 more reply

ta34011y ago

> But it then talks about how exp is implemented in hardware, in a way that forces it to be a non-SIMD calculation.

AlotOfReading1y ago

vmchale1y ago

> perhaps a C compiler only needs to represent that function as an intrinsic instead, to give itself a chance to later replace it either with the function call or some SIMD instruction

I gesture to this in the blog post:

1 more reply

adgjlsfhk11y ago

> But it then talks about how exp is implemented in hardware, in a way that forces it to be a non-SIMD calculation.

This is especially surprising given that very few architectures have an `exp` in hardware. It's almost always done in software.

stabbles1y ago

I think the interpretation is more conceptual about functions, reusability and composability.

Basically because C does not have function dispatch, it's ill-suited for generic programming. You cannot write an algorithm on scalar, and now pass arrays instead.

Someone1y ago

> You cannot write an algorithm on scalar, and now pass arrays instead.

  #include <math.h>
  #include <stdio.h>
 
  // Possible implementation of the tgmath.h macro cbrt
  #define cbrt(X) _Generic((X),     \
              long double: cbrtl, \
                  default: cbrt,  \
                    float: cbrtf  \
              )(X)
 
  int main(void)
  {
    double x = 8.0;
    const float y = 3.375;
    printf("cbrt(8.0) = %f\n", cbrt(x));    // selects the default cbrt
    printf("cbrtf(3.375) = %f\n", cbrt(y)); // converts const float to float,
                                            // then selects cbrtf
  }

It also, IMO, is a bit ugly, but that fits in nice with the rest of C, as seen through modern eyes.

There also is a drawback for lots of cases that you cannot overload operators. Implementing vector addition as ‘+’ isn’t possible, for example.

Many compilers partially support that as a language extension, though, see for example https://clang.llvm.org/docs/LanguageExtensions.html#vectors-....

1 more reply

uecker1y ago

leecarraher1y ago

high_na_euv1y ago

Even high level languages like c# allow you to express such a thing

https://stackoverflow.com/questions/73269625/create-memory-a...

Also structlayout and fieldoffset

PaulHoule1y ago

janwas1y ago

Lemire and collaborators often write in C++ intrinsics, or thin platform-specific wrappers on top of them.

I count ~8 different implementations [1], which demonstrates considerable commitment :) Personally, I prefer to write once with portable intrinsics.

https://github.com/simdutf/simdutf/tree/1d5b5cd2b60850954df5...

mananaysiempre1y ago

[1] https://branchfree.org/2019/04/01/fitting-my-head-through-th...

[2] https://branchfree.org/2024/06/09/a-draft-taxonomy-of-simd-u...

1 more reply

kibwen1y ago

> So are those functional languages.

The author is suggesting array languages as the solution, which are a separate category from functional languages.

dagelf1y ago

Link?

mananaysiempre1y ago

https://simdjson.org/publications/

https://simdutf.github.io/simdutf/

https://lemire.me/blog/

https://branchfree.org/

http://0x80.pl/

throwawaymaths1y ago

not even an SMT solver, that's not what they are for.

1propionyl1y ago

No? SMT solvers underpin the majority of program synthesis research tools.

This is absolutely within the wheelhouse of SMT solvers, and something that they are used for.

1 more reply

npalli1y ago

C++ does OK.

Google's highway [1]

Microsoft DirectXMath [2]

[1] https://github.com/google/highway

[2] https://github.com/microsoft/DirectXMath

vmchale1y ago

From highway:

pjmlp1y ago

It is more like, ISO C++ with compiler extensions does ok.

mlochbaum1y ago

Got the introduction written at https://mlochbaum.github.io/BQN/implementation/compile/fusio....

brundolf1y ago

Thank you. The title is confusing

pjmlp1y ago

Compiler specific C is capable of, ISO C has no SIMD support of any form.

commandlinefan1y ago

This seems more of a generalization of Richard Steven's observation:

"Modularity is the enemy of performance"

If you want optimal performance, you have to collapse the layers. Look at Deepseek, for example.

notorandit1y ago

C has been invented when CPUs, those few available, were just single cored and single threaded.

Wanna SMP? Use multi-thread libreries. Wanna SIMD/MIMD? Use (inline) assembler functions. Or design your own language.

dragontamer1y ago

Fortran is older but it's array manipulation maps naturally to SIMD.

LiamPowell1y ago

johnnyjeans1y ago

c89 is 19 years into c's life

2 more replies

camel-cdr1y ago

This depends entirely on compiler support. Intels ICX compiler can easily vectorize a sigmoid loop, by calling SVMLs vectorized expf function: https://godbolt.org/z/no6zhYGK6

If you implement a scalar expf in a vectorizer friendly way, and it's visible to the compiler, then it could also be vectorized: https://godbolt.org/z/zxTn8hbEe

dzaima1y ago

gcc and clang are also capable of it, given certain compiler flags: https://godbolt.org/z/z766hc64n

camel-cdr1y ago

Thanks, I didn't know about this. Interesting that it seems to require fast-math.

2 more replies

bee_rider1y ago

camel-cdr1y ago

Suites for SIMD means you write the scalar equivalent of what you'd do on a single element in a SIMD implementation.

In the godbolt link I copied the musl expf implementation and icx was able to vectorize it, even though it uses a LUT to large for SIMD registers.

#pragma omp simd and equivalents will encourage the compiler to vectorize a specific loop and produce a warning if a loop isn't vectorized.

1 more reply

sifar1y ago

It means auto-vectorization. Write scalar code that can be automatically vectorized by the compiler by using SIMD instructions

svilen_dobrev1y ago

reminds me of this old article.. ~2009, Sun/MIT - "The Future is Parallel" .. and "sequential decomposition and usual math are wrong horse.."

https://ocw.mit.edu/courses/6-945-adventures-in-advanced-sym...

vkaku1y ago

I also think that vectorizers and compilers can detect parallel memory adds/moves/subs and without that, many do not take time to provide adequate hints to the compiler about it.

Vosporos1y ago

Woo, very happy to see more posts by McHale these days!

dvorack1011y ago

Not my code, but illustrates how SIMD works.

https://github.com/dezashibi-c/a-simd_in_c

Copy rights go to Navid Dezashibi.

exitcode00001y ago

Of course C easily allows you to directly write SIMD routines via intrinsic instructions or inline assembly:

```

  generic
     type T is private;
     Aligned : Bool := True;
  function Inverse_Sqrt_T (V : T) return T;
  function Inverse_Sqrt_T (V : T) return T is
    Result : aliased T;
    THREE         : constant Real   := 3.0;
    NEGATIVE_HALF : constant Real   := -0.5;
    VMOVPS        : constant String := (if Aligned then "vmovaps" else "vmovups");
    begin
      Asm (Clobber  => "xmm0, xmm1, xmm2, xmm3, memory",
           Inputs   => (Ptr'Asm_Input ("r", Result'Address),
                        Ptr'Asm_Input ("r", V'Address),
                        Ptr'Asm_Input ("r", THREE'Address),
                        Ptr'Asm_Input ("r", NEGATIVE_HALF'Address)),                     
           Template => VMOVPS & "       (%1), %%xmm0         " & E & --   xmm0 ← V
                       " vrsqrtps     %%xmm0, %%xmm1         " & E & --   xmm1 ← Reciprocal sqrt of xmm0
                       " vmulps       %%xmm1, %%xmm1, %%xmm2 " & E & --   xmm2 ← xmm1 \* xmm1
                       " vbroadcastss   (%2), %%xmm3         " & E & --   xmm3 ← NEGATIVE_HALF
                       " vfmsub231ps  %%xmm2, %%xmm0, %%xmm3 " & E & --   xmm3 ← (V - xmm2) \* NEGATIVE_HALF
                       " vbroadcastss   (%3), %%xmm0         " & E & --   xmm0 ← THREE
                       " vmulps       %%xmm0, %%xmm1, %%xmm0 " & E & --   xmm0 ← THREE \* xmm1
                       " vmulps       %%xmm3, %%xmm0, %%xmm0 " & E & --   xmm0 ← xmm3 \* xmm0
                       VMOVPS & "     %%xmm0,   (%0)         ");     -- Result ← xmm0
      return Result;
    end;

  function Inverse_Sqrt is new Inverse_Sqrt_T (Vector_2D, Aligned => False);
  function Inverse_Sqrt is new Inverse_Sqrt_T (Vector_3D, Aligned => False);
  function Inverse_Sqrt is new Inverse_Sqrt_T (Vector_4D);

```

```C

  vector_3d vector_inverse_sqrt(const vector_3d\* v) {
  ...
  vector_4d vector_inverse_sqrt(const vector_4d\* v) {
    vector_4d out;
    static const float THREE = 3.0f;          // 0x40400000
    static const float NEGATIVE_HALF = -0.5f; // 0xbf000000

    __asm__ (
        // Load the input vector into xmm0
        "vmovaps        (%1), %%xmm0\n\t"
        "vrsqrtps     %%xmm0, %%xmm1\n\t"
        "vmulps       %%xmm1, %%xmm1, %%xmm2\n\t"
        "vbroadcastss   (%2), %%xmm3\n\t"
        "vfmsub231ps  %%xmm2, %%xmm0, %%xmm3\n\t"
        "vbroadcastss   (%3), %%xmm0\n\t"
        "vmulps       %%xmm0, %%xmm1, %%xmm0\n\t"
        "vmulps       %%xmm3, %%xmm0, %%xmm0\n\t"
        "vmovups      %%xmm0,   (%0)\n\t"                         // Output operand
        :
        : "r" (&out), "r" (v), "r" (&THREE), "r" (&NEGATIVE_HALF) // Input operands
        : "xmm0", "xmm1", "xmm2", "memory"                        // Clobbered registers
    );

    return out;

} ```

jiehong1y ago

It seems that Zig is well suited for writing SIMD [0].

If only GPU makers could standardise an extended ISA like AVX on CPU, and we could all run SIMD or SIMT code without needing any librairies, but our compilers.

[0]: https://zig.guide/language-basics/vectors/

dundarious1y ago

I like zig, but no, what it provides pales in comparison to what intrinsics offer.

musicale1y ago

That's why CUDA never went anywhere. /s ;-)

(Oh, that's SIMT. Carry on then.)

musicale1y ago

https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/gro...

TinkersW1y ago

This reads like jibberish.

C functions can't be vectorized? WTF are you talking about? You can certainly pass vector registers to functions.

Exp can also be vectorized, AVX512 even includes specific instructions to make it easier.( there is no direct exp instruction on most hardware,it is generally a sequence of instructions)

1 more reply

j / k navigate · click thread line to collapse