I looked at the asm generate from my original example and they generate very different codes, gcc applies other optimization when compiled with -O1.
I've been fighting the compiler to generate a minimal working example of the subnormals, but didn't have any success.
Some things take need to be taken in account (from the top of my head):
- Rounding. You don't want to get stuck in the same number.
- The FPU have some accumulator register that are larger than the floating point register.
- Using more register than the architecture has it not trivial because the register renaming and code reordering. The CPU might optimize in a way that the data never leaves those register.
Trying to make a mwe, I found this code:
#include <stdio.h>
int
main ()
{
double x = 5e-324;
double acc = x;
for (size_t i; i < (1ul<<46); i++) {
acc += x;
}
printf ("%e\n", acc);
return 0;
}
Runs is fraction of seconds with -O0:
gcc double.c -o double -O0
But takes forever (killed after 5 minutes) with -O1:
gcc double.c -o double -O1
I'm using gcc (Arch Linux 9.3.0-1) 9.3.0 on i7-8700
I also manage to create a code that sometimes run in 1s, but in others would take 30s. Didn't matter if I recompiled.
Floating point is hard.