It's hard to know exactly without staring at the code and knowing the exact CPU, because nothing stands out as a red flag. If you used an extra register, maybe you caused a spill (pushed something else out of registers into memory). Maybe you made the loop code bigger and it no longer fit in the loop stream buffer (if that model had one). Maybe you hit a weird frontend decode issue and it could only decode N-1 instructions per clock in that line instead of N, and that was critical to the loop's performance. Maybe your code layout changed for another reason, or the memory layout, and you got some bad cache aliasing.
These things are knowable if you have enough curiosity and maybe masochism :)