IME loops with induction variables (integer indexes) often produces better codegen than with iterators. Compare the these two Rust functions for inverting bits: https://rust.godbolt.org/z/cE4vPdbdY
This got improved in Rust 1.65 just this month, but the point stands.
Might be a problem of the number of pass repetitions, where O2 does not rerun the vectorisation after whatever manages to unroll everything but O3 does.