I know of a few programs that when compiled with -O2 work fine but break with -O3. This is because some optimizations that are applied are just not what the programmer expected or in the older days just broken working code.
Formally verifying the output is difficult because you need to prove the same operation is happening both times even if they are extremely different in what they are doing. I'm assuming that's where the difficulty comes in.