If you just wrote your SIMD in CUDA 15 years ago, NVidia compilers would have given you maximum performance across all NVidia GPUs rather than being forced to write and rewrite in SSE vs AVX vs AVX512.
GPU SIMD is still SIMD. Just... better at it. I think AMD and Intel GPUs can keep up btw. But software advantage and long term benefits of rewriting into CUDA are heavily apparent.
Intel ISPC is a great project btw if you need high level code that targets SSE, AVX, AVX512 and even ARM NEON all with one codebase + auto compiling across all the architectures.
-------
Intels AVX512 is pretty good at a hardware level. But software methodology to interact with SIMD using GPU-like languages should be a priority.
Intrinsics are good for maximum performance but they are too hard for mainstream programmers.
It's pretty funny how NEON ended up in there. A former Intel employee decided to implement it for fun and submitted it as a pull request, which Intel quietly ignored for obvious reasons, but then another former Intel employee who still had commit rights merged the PR, and the optics of publicly reverting it would be even worse than stonewalling so Intel begrudgingly let it stand (but they did revoke that devs commit rights).
The CUDA style of writing parallel programs is nothing else than the use of the so-called "parrallel do" a.k.a. "parrallel for" program structure, which has been already discussed in 1963. Notable later evolutions of this concept have been present in "Communicating Sequential Processes" by C.A.R. Hoare (1978-08: "arrays of processes"), then in the programming language Occam, which was designed based on what Hoare had described, then in the OpenMP extension of Fortran (1997-10), then in the OpenMP extension of C and C++ (1998-10).
Programming in CUDA does not bring anything new, except that in comparison e.g. with OpenMP some keywords are implicit and others are different, so the equivalence is not immediately obvious.
Programming for CPUs in the much older OpenMP is equivalent with programming in CUDA for GPUs.
The real innovation of NVIDIA has been the high quality of the NVIDIA CUDA compiler and CUDA runtime GPU driver, which are able to distribute the work that must be done on the elements of an array over all the available cores, threads and SIMD lanes, in a manner that is transparent for the programmer, so in many cases the programmer is free to ignore which is the actual structure of the GPU that will run the program.
Previous compilers for OpenMP or for other such programming language extensions for parallel programming have been much less capable to produce efficient parallel programs without being tuned by the programmer for each hardware variant.
I'm not calling it new. I'm just saying that the intrinsics style is much much harder than what Lisp*, DirectX HLSL, CUDA, OpenCL (etc. etc) does.
A specialized SIMD language makes writing SIMD easier compared to intrinsic style. Look at any CUDA code today and compare it to the AVX that is in the above article and it becomes readily apparent.
Larrabee was going to take over it all, as I enjoyed its presentation at GDCE 2009.
And then the Skylake and Cannon Lake debacle..
First they pulled it from the consumer chips a fairly short time before launch. Then the server chips it was present in would downclock aggressively when you did use it, so you could get at best maybe 40% more performance, certainly far from the 2x+ it promised.
Ten years on and the AMD 9950X does a pretty good job with it, however.
So Larrabee lives on for... some reason. These E cores are well known to be modified Intel Atom cores and those were modified Xeon Phi cores which were Larrabee based.
Just with.... AVX512 being disabled. (Lost when Xeon Phi turned into Intel Atoms).
Intels technical strategy is completely bonkers. In a bad way. Intel invented all this tech 10 to 20 years ago but fails to have a cohesive strategy to bring it to market. There's clearly smart people there but somehow all the top level decisions are just awful
NVidia compilers would have compiled your code into something functional, but if you want to approach peak performance you need to at least tweak your kernels, and sometimes rewrite them from scratch. See for example the various MMA instructions that were introduced over time.
Edit: I see somebody made a similar comment and you addressed it. Sorry for the churn.
That's not true. For maximum performance you need to tweak the code to a particular GPU model/architecture.
Intel has SSE/AVX/AVX2/AVX512, but CUDA has like 10 iterations of this (increasing capabilities). Code written 15 years ago would not use modern capabilities, like more flexible memory access, atomics.
But CUDA -> PTX intermediate code has allowed for significantly more flexibility. For crying out loud, the entire machine code (aka SASS) of NVidia GPUs has been cycled out at least 4 times in the past decade (128-bit bundles, changes to instruction formats, acquire/release semantics, etc etc)
It's amazing what backwards compatibility NVidia has achieved in the past 15 years thanks to this architecture. SASS changes so dramatically from generation to generation but the PTX intermediate code has stayed highly competitive.
And the PTX to SASS compiler DOES a degree of automatic fine tuning between architectures. Nothing amazing or anything, but it's a minor speed boost that has made PTX just a easier 'assembly-like language' to build on top of.
Now if your vectors are INT8/FP8 you’re supposed to shovel them into this accelerator via PCIe, rather than packing into registers for AVX512.
I wish they’d just pick an interface for vector ops and stick with it.
"too hard for mainstream programmers" seems overly pessimistic. I've run several workshops where devs have written dot-product kernels using Highway after 30 minutes of introduction.
What's your opinion on sycl?
Thank you for saying it out loud. XLAT/XLATB of x86 is positively tame compared to e.g. vrgatherei16.vv/vrgather.vv.
As this would only use 1 lane, perhaps if you have multiple of these to normalize, you could vectorize it.
From a SIMD perspective, it’s worth noting that on most platforms, the cost of computing one square root or two is the same. On modern x86 server CPUs, for instance, you can calculate up to 8 double-precision roots in parallel with identical latency. So there’s no additional cost in terms of performance.
I hope this sheds some light on the design of my code.
PS: In a previous life, I did research in Astro- and Plasma Physics. While I don’t claim to remember all the Math, it’s usually more productive to ask for clarification than to assume ignorance ;)
The only 2 approaches that still make sense to me:
A. Writing serial vectorization-aware code in a native compiled language, hoping your compiler will auto-vectorize.
B. Implementing natively for every hardware platform, as the ISA differences are too big to efficiently abstract away anything beyond 128-register float multiplication and addition.
This article, in a way, an attempt to show how big the differences even for simple data-parallel floating-point tasks.
.NET has roughly three vector APIs:
- Vector<T> which is platform-defined width vector that exposes common set of operations
- Vector64/128/256/512<T> which has wider API than the previous one
- Platform intrinsics - basically immintrin.h
Notably, platform intrinsics use respective VectorXXX<T> types which allows to write common parts of the algorithm in a portable way and apply platform intrinsics in specific areas where it makes sense. Also some method have 'Unsafe' and 'Native' variants to allow for vector to exhibit platform-specific behavior like shuffles since in many situations this is still the desired output for the common case.
The .NET's compiler produces competitive with GCC and sometimes Clang codegen for these. It's gotten particularly good at lowering AVX512.
Using C or C++ with vector extensions (Gcc/Clang) or Rust (nightly) std::simd is very easy and you get code that is portable to different CPUs and ISAs.
But most importantly they have a zero cost fallback option to CPU-specific intrinsics when you need them. An f32x8 can be passed at zero cost as __mm256 to any core::arch::x86_64::__mm_intrinsic (or xmmintrin.h in C++ land).
You gain portable arithmetic and swizzles and SIMD vector types, but lose nothing. Not having to write everything for x86_64 and aarch64 is a huge win even if doesn't quite cover everything.
Additionally you can use wider vectors than your hardware supports, the compiler is able to split your f64x64 to 128, 256 or 512 bit registers as needed depending on the compile target.
You're writing C++ code but as if it was shader code.
I've seen impressive results with clang doing this sort of thing.
There's a good amount of stuff that can clearly utilize SIMD without much platform-specificness, but doesn't easily autovectorize - early-exit checks in a loop, packed bit boolean stuff, some data rearranging, probing hashmap checks, some very-short-variable-length-loop things. And while there might often be some parts that do just need to be entirely target-specific, they'll usually be surrounded by stuff that doesn't (the loop, trip count calculation, loads/stores, probably some arithmetic).
You've not linked to or explained what Mojo is. There's also a lot going on with different products mentioned: Modular, Unum cloud, SimSIMD that are not contextualised either. While I'm at it, where do the others come in (Ovadia, Lemire, Lattner), you all worked on SimSIMD, I guess?
That said, this is a great article, thanks.
Edit: Mojo is a programming language with python-like syntax, and is a product by Modular: https://github.com/modularml/mojo
It's based on a newer compiler framework that has been added to the LLVM umbrella of projects.
> MLIR is a newer compiler framework that allows Mojo to exploit higher level compiler passes unavailable in LLVM alone... It can often more effectively use certain types of CPU optimizations directly, like SIMD, with no direct intervention by a developer
Also, the feature set being all over the place (e.g. integer support is fairly recent) doesn't help either.
ISPC is a good idea, but execution is meh... it's hard to setup and integrate.
Ideally you would want to be able to easily use this from other popular languages, like Java, Python, Javascript, without having to resort to linking a library written in C/C++.
Granted, language extensions may be required to approach something like that in an ergonomic way, but most somehow end up just mimicking what C++ does and expose a pseudo assembler.
Just like using SQL is much more sane than low level C APIs to handle BTree nodes.
The language extensions help, but code still requires too much low level expertise, with algorithms and data structures having to take SIMD/MIMD capabilities into account anyway.
https://github.com/VcDevel/Vc https://github.com/Applied-Scientific-Research/nvortexVc
Once you need more complex operations, you need to use the specific operations from System.Runtime.Intrinsics.(X86|ARM) based on the current architecture. And you need to adjust your implementation on the CPUs capabilities. There are still a lot of older x64 CPUs around that don't have AVX512 for example.
Superscalar is a real term (multiple operations in one clock tick due to parallel pipelines within a core). But hyperscalar is cringe to me. There are tons of words describing SIMD already, it seems unclear why someone would make up a new word to describe an already existing concept.
Especially when a similar word (superscalar) already is defined and likely gets confused for this new word.
PS: Should be an easy patch, will update!
https://en.m.wikipedia.org/wiki/Hyperscale_computing
IE our simd implementation allows you to scale across different architectures/ CPU revisions without having to rewrite assembly for each CPU processor?
Edit: Rereading, that does not make much sense...
On the image: https://www.modular.com/blog/understanding-simd-infinite-com...
SIMD on the CPU is most compelling to me due to the latency characteristics. You are nanoseconds away from the control flow. If the GPU needs some updated state regarding the outside world, it takes significantly longer to propagate this information.
For most use cases, the GPU will win the trade off. But, there is a reason you don't hear much about systems like order matching engines using them.
Maximizing performance on a CPU today requires all the steps in the above article, and the article is actually very well written with regards to the 'mindset' needed to tackle a problem such as this.
It's a great article for people aiming to maximize the performance on Intel or AMD systems.
------
CPUs have the memory capacity advantage and will continue to hold said advantage for the foreseeable future (despite NVidias NVLink and other techs to try to bridge the gap).
And CPU code remains far easier than learning CUDA, despite how hard these AVX intrinsics are in comparison to CUDA.
perhaps also more precisely they also have quite an advantage on anything that needs and plays nicely with caches? when I sliced my problem to maximize cache usage, I saw pretty clear scalability with cores: L1/L2 cache bandwidth is ~30GB/s, so e.g. a 32 core system starts to compete with the big consumer GPUs.
The one truly unfixable issue is round-trip latency.
Apple's neural engine shows that you can live in between those two worlds.
As you said, the trouble is the latency, the programming model is still great.
If you've got tens to hundreds of microseconds worth of workload, sure, get the GPU to do it.
But bear in mind 1000 clocks at 4GHz is 250ns, there's still a sizeable region where tight CPU/GPU integration isn't tight enough.