undefined | Better HN

0 pointszamadatix4y ago0 comments

I'm not sure I'd say many compilers are even that great with SIMD these days and that is easier than what the itanium was asking of compilers.

There are real gains to be had by using SIMD but it tends to be massively parallel data processing workloads with specially written SIMD code or even hand tuned assembly (image/video processing, neural networks) not just feeding in a source file and compiling with the SIMD flag to then realize meaningful gains.

0 comments

cogman104y ago

The reverse is true.

SIMD is harder because you have to have a uniform operation across a set of data.

Imagine a for loop that looks like this

    int[] x, y, z;
    int[] p, d, q;

    for (int i = 0; i < size; ++i) {
       p[i] = x[i] / z[i]
       d[i] = z[i] * x[i]
       q[i] = y[i] + z[i]  
    }

For SIMD, this is a complicated mess for the compiler to unravel. What the compiler would LIKE to do is turn this into 3 for loops and use the SIMD instructions to perform those operations in parallel.

The itanium optimization, however, is a lot easier. The compiler can see that none of p, d, or q depend on the results of the previous stage (that is q[i] doesn't depend on p[i]). As a result, the entire thing can be packed into a single operation.

Now, of course, modern OOO processors can do the same optimization so maybe it's not a huge win? Still, would have been something worth exploring more (IMO) but the market forces killed it. Moving that sort of optimization out of the processor hardware and into the compiler software seems like it could lead to some nice power/performance benefits.

jcranmer4y ago

That loop is actually nicely vectorizable, at least assuming that you replace int with float (there is no integer division vector instruction on x86).

All of the array accesses are uniform, so the resulting vector code is roughly:

  for (i = 0 .. size by vector width) {
    r0 = vector load x[i..i + vw]
    r1 = vector load y[i..i + vw]
    r2 = vector load z[i..i + vw]
    r3 = r0 / r2
    r4 = r2 * r0
    r5 = r1 + r2
    vector store r3 to p[i..i + vw]
    vector store r4 to d[i..i + vw]
    vector store r5 to q[i..i + vw]
  }

(and probably unroll the loop for good measure). No need to fission the loop to vectorize here.

sifar4y ago

and any vliw compiler worth it's salt would bundle the load, div/mul/alu, store into one instruction packet

sifar4y ago

>> For SIMD, this is a complicated mess for the compiler to unravel

this is trivially vectorizable for simd, would fit nicely in a vliw packet too. The only issue is if there was a runtime memory stall with any access, then the entire pipeline would stall.

with predication, modern simd even parallelize if conditions like below.

int[] x, y, z; int[] p, d, q;

    for (int i = 0; i < size; ++i) {
       p[i] = x[i] / z[i];
       d[i] = z[i] * x[i];
       if(i>n) {
         q[i] = y[i] + z[i]  ;
       } else {
         q[i] = y[i];
       } 
    }

hajile4y ago

VLIW architecture is so bad that AMD and Nvidia couldn't make it work well with embarrassingly parallel graphics code. AMD first moved from VLIW-5 to VLIW-4 because they couldn't find enough data to reliably keep unit 5 busy.

AMD then followed Nvidia into the world of SIMD/SIMT because it offered better real-world performance for the majority of applications.

VLIW has been tried repeatedly only to be replaced with something that worked better.

j / k navigate · click thread line to collapse

0 comments

cogman104y ago

The reverse is true.

SIMD is harder because you have to have a uniform operation across a set of data.

Imagine a for loop that looks like this

    int[] x, y, z;
    int[] p, d, q;

    for (int i = 0; i < size; ++i) {
       p[i] = x[i] / z[i]
       d[i] = z[i] * x[i]
       q[i] = y[i] + z[i]  
    }

jcranmer4y ago

That loop is actually nicely vectorizable, at least assuming that you replace int with float (there is no integer division vector instruction on x86).

All of the array accesses are uniform, so the resulting vector code is roughly:

  for (i = 0 .. size by vector width) {
    r0 = vector load x[i..i + vw]
    r1 = vector load y[i..i + vw]
    r2 = vector load z[i..i + vw]
    r3 = r0 / r2
    r4 = r2 * r0
    r5 = r1 + r2
    vector store r3 to p[i..i + vw]
    vector store r4 to d[i..i + vw]
    vector store r5 to q[i..i + vw]
  }

(and probably unroll the loop for good measure). No need to fission the loop to vectorize here.

sifar4y ago

and any vliw compiler worth it's salt would bundle the load, div/mul/alu, store into one instruction packet

sifar4y ago

>> For SIMD, this is a complicated mess for the compiler to unravel

this is trivially vectorizable for simd, would fit nicely in a vliw packet too. The only issue is if there was a runtime memory stall with any access, then the entire pipeline would stall.

with predication, modern simd even parallelize if conditions like below.

int[] x, y, z; int[] p, d, q;

    for (int i = 0; i < size; ++i) {
       p[i] = x[i] / z[i];
       d[i] = z[i] * x[i];
       if(i>n) {
         q[i] = y[i] + z[i]  ;
       } else {
         q[i] = y[i];
       } 
    }

hajile4y ago

AMD then followed Nvidia into the world of SIMD/SIMT because it offered better real-world performance for the majority of applications.

VLIW has been tried repeatedly only to be replaced with something that worked better.

j / k navigate · click thread line to collapse