I might be somehow miscounting, but it seems to me that for the slow implementation a loop iteration issues 6 instructions, for the fast one it issues 21, but it is unrolled 4 times (compare the loop counter increment), so it iterates one fourth of the times and for the whole loop it ends up actually issuing slightly less instructions.
edit: to be clear, I'm only arguing about two things that the original parent quitestioned: whether vectorization is not free (it is because wider ALUs require less instructions) and whether the second loop used more instructions (it does not as it is unrolled by 4).