undefined | Better HN

0 pointsK0nserv5y ago0 comments

This post is really topical for me. I spent hours yesterday trying to write explicit SIMD code[0] for my the vector dot product in my raytracer and all I managed to do was slow the code down about 20-30%.

The code generated by Rust from the naive solution uses ss instructions mostly whereas my two tries using `mm_dp_ps` and `mm_mul_ps` and `mm_hadd_ps` where both significantly slower even though it results in fewer instructions. I suspect that the issue is that for a single dot product the overhead of loading in and out of mm128 registers is more cost than it's worth.

Naive Rust version output

    .cfi_startproc
    pushq %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset %rbp, -16
    movq  %rsp, %rbp
    .cfi_def_cfa_register %rbp
    vmovss  (%rdi), %xmm0
    vmulss  (%rsi), %xmm0, %xmm0
    vmovsd  4(%rdi), %xmm1
    vmovsd  4(%rsi), %xmm2
    vmulps  %xmm2, %xmm1, %xmm1
    vaddss  %xmm1, %xmm0, %xmm0
    vmovshdup %xmm1, %xmm1
    vaddss  %xmm1, %xmm0, %xmm0
    popq  %rbp
    retq

My handwritten version with `mm_mul_ps` and `mm_hadd_ps`

    .cfi_startproc
    pushq %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset %rbp, -16
    movq  %rsp, %rbp
    .cfi_def_cfa_register %rbp
    vmovaps (%rdi), %xmm0
    vmulps  (%rsi), %xmm0, %xmm0
    vhaddps %xmm0, %xmm0, %xmm0
    vhaddps %xmm0, %xmm0, %xmm0
    popq  %rbp
    retq

Intuatively it feels like my version should be faster but it isn't. In this code I changed the the struct from 3 f32 components to an array with 4 f32 elements to avoid having to create the array during computation itself, the code also requires specific alignment not to segfault which I guess might also affected performance.

0: https://github.com/k0nserv/rusttracer/commits/SIMD-mm256-dp-...

0 comments

slavik815y ago

I'm actually rather bad at this, but my understanding is that the horizontal operations are relatively slow. The easiest way to get throughput out of SIMD is to have a structure representing 4 points, with a vec4 of your X values, a vec4 of your Y values, and a vec4 of your Z values. Then you can do 4 dot products easily and efficiently, using only a handful of vertical packed instructions. (If figuring out how to effectively use a structure like that sounds difficult and annoying, that's because it is.)

K0nservOP5y ago

Yeah that was my conclusion too. I don't think I have any cases where the need to perform the dot product between multuple paris of vectors arise however, at least not anywhere in the hot loop where it would help. Raytracers tend to use a lot of dot products followed by some checks then another dot product but there's a strictly sequential process in these algorithms.

a_e_k5y ago

I suspect that most people attempting to vectorize their ray tracer try the straightforward horizontal approach here and discover that it's not really any faster. I certainly did when I first started in this area.

The key to the vertical, structure-of-arrays approach is to move to a more data-flow style approach. Instead of generating a single ray at a time from your camera code you can generate them in small batches corresponding to a rectangle on the screen. When I was in grad school we'd call these ray packets. We'd also do things like organize them into little 2x2 pixel "packlets" within the packet to take advantage of SSE or AltiVec. (This tends to work great with coherent pinhole camera rays, and shadow rays towards a common light source, but not so well with path-traced indirect rays which quickly become incoherent.)

Likewise, don't go all the way down to single primitives in the leaves of your acceleration structure. Instead, try to group them into small blocks of the same primitive type that are clustered together spatially. For example, you might have triangle mesh intersector code that has a BVH specialized for large numbers of pure triangles (generic programming can help here, e.g., BVH<triangle> for meshes and BVH<sphere> for point clouds). Since the leaves may have different numbers of primitives, a typical approach would be to pack all the primitives into a single flat array and then have your BVH leaves just give you the indices of the range within the array. (The nice thing is that this typically also shrinks your acceleration structures since you have fewer nodes due to the leaves not being so fine-grained.)

If you're curious to see some vectorized ray/triangle intersection code that was fairly state-of-the-art in its day, I have an old paper[0] on the topic. I'd also suggest having a look at Intel's Embree[1], especially the early publications. The library is open source (Apache licensed), well organized, and not too large.

[0] https://www.cs.utah.edu/~aek/research/triangle.pdf

[1] https://www.embree.org/

MauranKilom5y ago

Without looking deeper into it, I assume that you are falling prey to dependency chains here. All your vectorized code is one big dependency chain (every instruction uses the result of the previous one in xmm0). The scalar code has more instructions but they form multiple, somewhat separate, dependency chains.

On microcode level, you can generally have multiple of the same kind of instruction running "in parallel" if they are independent.

For example, look at 256 bit vmulps here: https://software.intel.com/sites/landingpage/IntrinsicsGuide...

On Ivy Bridge, you can start one vmulps per cycle, but it takes 5 cycles before you get a result. If you do several vmulps (and similar) in one long dependency chain, you will only progress by one instruction every 5 cycles!

Another point to consider is that multithreading in a hyperthreading environment could change these result: If I'm not mistaken, the two hyperthreads sharing a core compete for the same execution ports but have separate instruction scheduling. What this means is that, in the above scenario, you could theoretically have two hyperthreads each executing one vmulps every five cycles on the same core, so that you actually get double the speed from two threads over one. However, less dependency-laden code (the scalar version?) could fully saturate the floating point related execution ports with just one thread, in which case you might not see any speed benefit at all from a second thread.

This of course strongly depends on the hardware and how how the code is. I'm also not confident that either of these effects are necessarily at play or the prime influence here. But if you are interested in writing well-performing code on this level, these are topics you should look into!

j / k navigate · click thread line to collapse