Loop unrolling is one of those optimizations that actually highlights the need for dynamic CPU optimizations like out of order execution and speculative execution. It's very difficult to statically make a good decision about the optimal amount of loop unrolling to do, especially if you want to generate code that will continue to perform well on future CPUs using the same ISA. Even when targeting a specific CPU model it's difficult however since you don't know statically how many iterations of the loop you're expecting, what's currently in cache, what other code might be running immediately before or after the loop, what's running at the same time on other threads, etc.