My recollection is that the branch predictor should mispredict the last branch of the loop. Unrolling eliminates that misprediction by replacing all of the loop iteration branches with a jump table branch that it will predict correctly more often. This is micro-optimizing to an extreme. I vaguely recall seeing a slight improvement in a microbenchmark that I attributed to this effect.