The code generated by Rust from the naive solution uses ss instructions mostly whereas my two tries using `mm_dp_ps` and `mm_mul_ps` and `mm_hadd_ps` where both significantly slower even though it results in fewer instructions. I suspect that the issue is that for a single dot product the overhead of loading in and out of mm128 registers is more cost than it's worth.
Naive Rust version output
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
vmovss (%rdi), %xmm0
vmulss (%rsi), %xmm0, %xmm0
vmovsd 4(%rdi), %xmm1
vmovsd 4(%rsi), %xmm2
vmulps %xmm2, %xmm1, %xmm1
vaddss %xmm1, %xmm0, %xmm0
vmovshdup %xmm1, %xmm1
vaddss %xmm1, %xmm0, %xmm0
popq %rbp
retq
My handwritten version with `mm_mul_ps` and `mm_hadd_ps` .cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
vmovaps (%rdi), %xmm0
vmulps (%rsi), %xmm0, %xmm0
vhaddps %xmm0, %xmm0, %xmm0
vhaddps %xmm0, %xmm0, %xmm0
popq %rbp
retq
Intuatively it feels like my version should be faster but it isn't. In this code I changed the the struct from 3 f32 components to an array with 4 f32 elements to avoid having to create the array during computation itself, the code also requires specific alignment not to segfault which I guess might also affected performance.0: https://github.com/k0nserv/rusttracer/commits/SIMD-mm256-dp-...