I do the search multiple (thousands) times and calculate total time spent, so the timer is called only 2 times. I search across the same array, but for the different number each time. So the array should be completely within L1 cache. It's also important to use the result of the search otherwise the compiler simply omits the code.
I use clock_gettime() which takes on order of 1us. Timestamp counter (which is faster) is not available for me because every CPU core has its own counter and they show different values, so Linux refuses to use it. First core is 600 ms ahead of other cores. How in 2025 we cannot make a boring counter amazes me. Or, maybe this is intentional to prevent using cheaper consumer hardware for professional purposes.
Here are some numbers:
Array size 8, 200 000 iterations, time ns/seach: linear 8 ns, linear no-branch <1 ns (I wonder if it is an error), binary 11 ns, binary no-branch 7 ns.
Array size 128, 100 000 iterations, time: linear 34 ns, linear no-branch 22 ns, binary 35 ns, binary no-branch 16 ns.
As you can see, the branches is what ruins the performance here. Every time the CPU mis-predicts a branch, it wastes several cycles.
Looking through disassembly on godbolt, the compiler inlined and vectorized the measurement loop for no-branch code, which might explain the numbers. The SIMD assembly is hard to read and I don't quite understand what it does: https://godbolt.org/z/dofKnT3W3
I'll be happy to read if you point me at any mistakes in my benchmark. If you want to compile the code, I used gcc with "-O3" and maybe with "-march=native".