cmpb $115, %cl
sete %dl
addl %edx, %eax
vs cmpb $115, %cl
jne _run_switches_jmptgt1
mov $1, %dl
_run_switches_jmptgt1:
addl %edx, %eax
The argument about why `jne` might be faster is that that in the former case, the CPU always executes a dependency chain of length 3: `cmpb` -> `sete` -> `addl`. Each of these instructions have to be computed one after the other, as `sete` depends on the result of `cmpb`, and `addl` depends on the result of `sete`.With `jne`, the CPU might predict the branch is not taken, in which case, the dependency chain is `mov` -> `addl` (the `mov` of an immediate might be handled by register renaming?).
Or that it is taken, in which case in which case the dependency chain is just `addl`.
I guess you're arguing that the CPU should handle `sete` the same way? That is, instead of treating `addl` as dependent on the result, predict what `sete` does and start executing `addl` before `sete` finishes, rewinding if that went wrong?