The cost of bound checking is second order effects like making vectorization harder, slightly higher instruction (and possibly data) cache pressure, or requiring higher decode bandwidth. For the vast majority of programs these bottlenecks do not really matter.
Think about this: a machine with infinite execution units and memory bandwidth, potentially could execute all iterations of a loop at the same time, in parallel.
Unless each loop iteration depends somehow on the result of the previous iteration. Then only independent instructions of that iteration can execute in parallel and the loop is latency-chain bound (especially when it involves memory accesses). This is often the case. Because branch prediction breaks dependencies, bound checking is never part of a dependency chain, so it is often free or nearly so. For more optimized code, the assumption of infinite resources is of course not warranted and execution bandwidth and possibly even memory bandwidth need to be taken into consideration.
There is an interesting talk titled ‘the death of optimizing compilers’ that argues that for most code these optimizations are almost completely meaningless, and in the hot loops where it actually matters, they are not good compared to humans (and sometimes 100x or more improvements are possible and left on the table). While I don’t completely agree with its points, it is a good talk/slides to read through.