My naive, untested intuition is that there's only one meaningful difference: the former has to dump the entire pipeline on a miss, and the latter only has to nop a single instruction on a miss.
But maybe I'm missing something. I'll re-read his rant.
EDIT:
Linus rants a lot, but makes one concrete claim:
You can always replace it by
j<negated condition> forward
mov ..., %reg
forward:
and assuming the branch is AT ALL predictable (and 95+% of all branches
are), *the branch-over will actually be a LOT better for a CPU.*
So, I decided to test that. [18:50:14 user@boxer ~/src/looptest] $ diff -u loop2.s loop4.s
--- loop2.s 2023-07-06 18:40:11.000000000 -0400
+++ loop4.s 2023-07-06 18:46:58.000000000 -0400
@@ -17,11 +17,15 @@
incq %rdi
xorl %edx, %edx
cmpb $115, %cl
- sete %dl
+ jne _run_switches_jmptgt1
+ mov $1, %dl
+_run_switches_jmptgt1:
addl %edx, %eax
xorl %edx, %edx
cmpb $112, %cl
- sete %dl
+ jne _run_switches_jmptgt2
+ mov $1, %dl
+_run_switches_jmptgt2:
subl %edx, %eax
testb %cl, %cl
jne LBB0_1
[18:50:29 user@boxer ~/src/looptest] $ gcc -O3 bench.c loop2.s -o l2
[18:50:57 user@boxer ~/src/looptest] $ gcc -O3 bench.c loop4.s -o l4
[18:51:02 user@boxer ~/src/looptest] $ time ./l2 1000 1
449000
./l2 1000 1 0.69s user 0.00s system 99% cpu 0.697 total
[18:51:09 user@boxer ~/src/looptest] $ time ./l4 1000 1
449000
./l4 1000 1 4.53s user 0.01s system 99% cpu 4.542 total
I feel pretty confident that Linus has made a poor prediction about poor prediction here. Jumps are indeed slower.To be fair to Linus, since Clang and I are using sete here, not cmov, I also tested cmov, and the difference was insignificant:
[19:53:12 user@boxer ~/src/looptest] $ time ./l2 1000 1
449000
./l2 1000 1 0.69s user 0.00s system 99% cpu 0.700 total
[19:53:15 user@boxer ~/src/looptest] $ time ./l5 1000 1
449000
./l5 1000 1 0.68s user 0.00s system 99% cpu 0.683 total
Jumps are slower.