Array languages and SIMD are a match made in heaven. This should be the paradigm of choice for high-performance programming, but it's unfortunately pretty obscure.
Huh. I kinda figured the whole point of array programming languages was that the compiler doesn't have to guess which parts of the code are inherently parallel.
So for example if you do a pattern of "do a small op to each part of a large block of data and then do another small op to each part of that block of data, etc" then at least in CPU SIMD (ex AVX) you end up memory bottlenecked.
However if you can do a bunch of ops on the same small blocks of data before moving on to the next blocks of data in your overall large block of data then said small blocks can fit inside the L1 cache (or in the registers directly) and that can run the CPU to it's absolute limit.
Hence it becomes a game of scheduling. You already know what you need to optimise but actually doing so gets really hard really fast. Albeit things like MLIR (which are still very new) are making this easy to approach.
NumPy is partly inspired by APL and descendants. One of the few places that programmers commonly get performance afforded by hardware!
For people prefer C-like syntax, there is ispc[2], which supports x86 AVX and ARM Neon programming via LLVM.