Nice try. You can’t escape being known as a verb now.
Everyone knows the tool as godbolt.
It's amazing how many more code generation questions occur to me now that there's so much less friction in getting the answers.
#pragma omp simd reduction(+:res)
as a more precise way to achieve vectorization in the reduction (compile with -fopenmp-simd to only use it for SIMD without linking an OpenMP library): https://godbolt.org/z/17oTz1Unfortunately, the pragma is not supported with the new-style class iterators in a released compiler, though it works in clang-trunk: https://godbolt.org/z/hbP11W Note that Clang disables floating point contraction by default (so no vfmadd instructions), despite them being more accurate. One usually wants this globally (-ffp-contract=fast) except when trying to bitwise reproduce software compiled for pre-Haswell.
This was my key takeaway from this article. Writing clear code that is easier to maintain will have good enough performance most of the time. I was particularly impressed with the devirtualization optimizations and will be less likely to shy away from using polymorphism in future due to performance concerns.
Most important: this optimization enables pipelined execution.
When people talk about a CPU executing an integer add instruction in ~1 cycle, what they actually mean is that the add has this latency when the CPU pipelines are full.
If you have an 11 stage pipeline... the add can often have a latency of ~11 cycles... if you write the _right_ code for it.
Then, looking at the code it's not obvious where the infinite loop occurs.