Weird compiler bug – Same code, different results (opens in new tab)

(blog.zaita.com)

73 pointszaitanz5y ago38 comments

38 comments

If you read the article you find that this is a bug in MinGW64 libraries.

It's not a compiler bug, it's not a problem in C/C++, or even in IEEE-754 floating point math. The MinGW64 thread library simply forgot to initialize the FPU correctly.

He'd have had the same problem if he wrote his code in assembly language and then ran it in a MinGW64 thread.

zaitanzOP5y ago

Yes this is technically true, but it manifests itself through use of the MinGW compiler. When I first noticed the problem, there was very little indication as to the cause.

TBH I accredit dumb luck more than anything to finding the cause of this. It took many hours.

gus_massa5y ago

> This means that floating point arithmetic is non-associative. In that A + B != B + A.

The equation is the commutative property, not associative. IIRC addition in IEEE-754 is commutative.

The property that fails is (A+B)+C = A+(B+C).

zaitanzOP5y ago

Op Here. Yes you are correct. Been very sleep derived when I wrote this (newborns eh). I'll update this.

I didn't want to go too far into the background of floating point math in the post as it wasn't ultimately the issue, just something I was mindful of.

ojnabieoot5y ago

I think you made the wrong correction in the article:

> Update: Thanks to gus_massa and wiml @ HackerNews for pointing out I had used associative instead of commutative.... This means that floating point arithmetic is noncommutative [i]n that A + B != B + A.

Floating-point arithmetic is commutative, but not associative. So for floating-point numbers,

A + B = B + A always, but

A + (B + C) != (A + B) + C in many cases.

The problem comes about from significant digits and rounding - A + B might round up to 1, but B + C might round down to zero, etc. The point remains that the order in which you do things actually does matter in floating-point arithmetic, but it's the order you compute the "inner parentheses," not the actual order of numbers in (e.g.) the FPU registers.

zaitanzOP5y ago

Thank you for your reply. I'm going to just lift your explanation as it's much better than my own.

lysium5y ago

I would have never thought that the output of some floating point operations depend on the „reset of the floating point package“. What gives?

comex5y ago

The processor itself supports configuring some aspects of floating point behavior; on x86 this is the MXCSR register. The most important one is rounding mode: you can choose round-to-nearest, round-down, round-up, or round-towards-zero. Less common options include whether to raise an exception on various out-of-range conditions, and whether to treat denormal numbers as zero (faster but less accurate and not IEEE 754 compliant).

The C standard library has functions to do this configuration; search for fesetenv. _fpreset is not a standard function but, at least in Wine's implementation [1], just resets the relevant configuration registers to an initial state.

[1] https://github.com/wine-mirror/wine/blob/e909986e6ea5ecd49b2...

lysium5y ago

Very informative! It even sets SSE2.

brucehoult5y ago

At a minimum, when there is a switch between threads the floating point state of the old thread needs to be saved and the floating point state of the incoming thread loaded. This includes such things as rounding mode.

If the thread package was not setting up the "saved" state for a new thread before switching to it then that state could be not what was desired. The best thing to do would probably be to copy the current FP state of the parent at the moment the thread is created.

On a long series of calculations changing the rounding mode could well be enough to cause the difference in results the poster saw.

I would tend to the opinion that redoing the calculation with different rounding modes is in fact a not bad way to figure out how reliable the results are in the first place.

(the rounding mode is not the only thing affected by not saving and restoring the FP state correctly)

willxinc5y ago

Things like rounding mode and denornal behavior could be controlled by CPU flags. Imy not familiar with x86, but this is defiy the case in PPC.

lysium5y ago

Yes, that makes sense, thank you!

wiml5y ago

Some kind of internal state specific to Microsoft's libm? It doesn't look like _fpreset() is present anywhere else.

Though in general, IEEE-754 math does have a small amount of global state, for stuff like rounding modes and subnormals. Perhaps the spawned thread's fenv was unexpected.

I'm not super convinced by this blog post anyway, since it immediately confuses associativity with commutativity.

zaitanzOP5y ago

Op here. Yea this only occurs on Windows with MinGW. Visual C++ and Clang do not have this issue. None of the Linux compilers I tested also had this issue.

And yea you're right on the associative mistake. Will correct this :)

chrchang5235y ago

I currently build Windows binaries via cross-compilation on Linux using gcc-mingw-w64; I assume that is affected by the bug?

1 more reply

zaitanzOP5y ago

TBH. I had no idea this was a thing either. I was incredibly confused when I identified a single piece of code that was producing an variation in result.

cperciva5y ago

Probably the author is accidentally getting 80-bit "extended double" arithmetic.

cozzyd5y ago

That should be clear from looking at the assembly, shouldn't it? Don't think the floating point environment can change that but I could be wrong.

Floating-point rounding modes seem more likely to me. The author should be able to dump the floating point configuration to confirm, I'm sure.

cperciva5y ago

The same x86 instructions are used for arithmetic on float/double/extended types; the difference is a precision setting in the x87 control word.

1 more reply

peter_d_sherman5y ago

>"Compiler bug? But how… Looking at the code, the only things that could create the bug were the operators and pow() call. These are part of the C++ standard offering so it’s highly unlikely that either of these would have an error. What next? Off to GodBolt to try some other compilers to see if I get wonky results on them. MSVC++… nope, Clang… nope, GCC-Trunk on Linux… nope. I am only getting this weird behaviour in TDM-GCC. Time to try some other MingW64 compilers. Nuwen… Yes, Mingw64.. yes, TDM-GCC… yes.

Confirmed: It’s a bug in MinGW64.

When I create a new thread and run floating point operations in that thread, I get slightly different answers."

Interesting!

titzer5y ago

Do not use C/C++ for numerical code where accuracy is needed. These languages are not specified to conform to IEEE 754 and you are absolutely asking for trouble.

cperciva5y ago

My copy of C11 says

   Annex F
   (normative)

   IEC 60559 floating-point arithmetic

   F.1 Introduction
   This annex specifies C language support for the IEC
   60559 floating-point standard [...] previously designated
   ANSI/IEEE 754-1985.

shakna5y ago

> Support for Annex F (IEEE-754 / IEC 559) of C99/C11

> The Clang compiler does not support IEC 559 math functionality. Clang also does not control and honor the definition of __STDC_IEC_559__ macro. Under specific options such as -Ofast and -ffast-math, the compiler will enable a range of optimizations that provide faster mathematical operations that may not conform to the IEEE-754 specifications. The macro __STDC_IEC_559__ value may be defined but ignored when these faster optimizations are enabled.

If you make use of Clang, you won't find support for Annex F, and in point of fact you can't even macro check to see if it even will support Annex F.

The story with GCC is... More complicated. It guarantees it'll follow Annex F for only some operations [0].

So the upshot is... Most people can't tell when and how Annex F might be followed.

[0] https://gcc.gnu.org/onlinedocs/gcc-7.4.0/gcc/Floating-point-...

titzer5y ago

C "supports" whatever the hardware and compiler feel like supporting. It absolutely does not mandate anything. In particular, there are a number of particularly simple compiler optimizations that are not forbidden, though they are not technically correct according to IEEE 754, such as algebraic reassociation and simple commutativity. Moreover, C allows subexpressions to be computed in higher precision (e.g. 80 bit "long double"), which is observable. That last one is primarily due to the x87 FPU coprocessor design that has given us a good 35 years of headaches. Good riddance to that!

mhh__5y ago

Keep in mind that C++ doesn't exist. The compilers all have ways of enforcing compliance.

Intel C++ seems to be pretty naughty in that it effectively has fast-math on by default.

1 more reply

jcranmer5y ago

In practice, all of the C compilers will (or can be made to) conform to IEEE 754, although #pragma STDC FENV_ACCESS ON support is very spotty (and consequently non-default rounding-mode support), although Clang/LLVM has been working on this for the past several months.

(That reminds me, I need to harangue the llvm-libc folks to add that pragma to their fpenv code for correctness sake).

shakna5y ago

The clang docs explicitly say that they don't support Annex F at all. Is this story finally going to change? So that you have at least partial support?

titzer5y ago

No, they consistently perform optimizations that violate IEEE 754 rules, such as reassociation and commutation of expressions. These are difficult to observe (unless you are in the habit of observing -0 vs 0), but yes, you can observe them. They are very precisely specified by IEEE 754 and C/C++ compilers perform illegal optimizations.

jcranmer5y ago

They do those optimizations when you enable fast-math flags. Some compilers (e.g., icc) enable those by default. When fast-math is not enabled, then the IEEE 754-violating optimizations are disabled.

Source: I've been working on such optimizations this past week. And that means establishing precisely which flags I need to have enabled in order for optimization to kick in.

1 more reply

zaitanzOP5y ago

Op here. There is a huge amount of scientific code written in C++. Just have a look through https://www.coin-or.org/

This code was specifically C++ because we had the intention of integrating with libraries like CppAD and ADOL-C.

parekhnish5y ago

So, what should be used?

qayxc5y ago

The correct - but useless - answer is: any language that's IEC 60559:xxxx conformant.

Examples are: Fortran, R, C#, Java, D and many others.

You'd need to refer to a language's specification to find out, though some aren't very well defined. zig comes to mind; it supports the data types and offers strict mode, but doesn't explicitly state its IEE 754:2008 compliance.

CodesInChaos5y ago

Last I checked C# allowed the compiler to perform the computation with higher precision than what the code requires an then narrow down the the expected precision whenever it wants (you can trigger a narrowing by explicitly casting from a floating point type to itself). Which makes writing correct floating point code practically impossible.

cratermoon5y ago

COBOL

Seriously? Anything but floating point if you care about precision and accuracy.

j / k navigate · click thread line to collapse

38 comments

brucehoult5y ago

If you read the article you find that this is a bug in MinGW64 libraries.

It's not a compiler bug, it's not a problem in C/C++, or even in IEEE-754 floating point math. The MinGW64 thread library simply forgot to initialize the FPU correctly.

He'd have had the same problem if he wrote his code in assembly language and then ran it in a MinGW64 thread.

zaitanzOP5y ago

Yes this is technically true, but it manifests itself through use of the MinGW compiler. When I first noticed the problem, there was very little indication as to the cause.

TBH I accredit dumb luck more than anything to finding the cause of this. It took many hours.

gus_massa5y ago

> This means that floating point arithmetic is non-associative. In that A + B != B + A.

The equation is the commutative property, not associative. IIRC addition in IEEE-754 is commutative.

The property that fails is (A+B)+C = A+(B+C).

zaitanzOP5y ago

Op Here. Yes you are correct. Been very sleep derived when I wrote this (newborns eh). I'll update this.

I didn't want to go too far into the background of floating point math in the post as it wasn't ultimately the issue, just something I was mindful of.

ojnabieoot5y ago

I think you made the wrong correction in the article:

Floating-point arithmetic is commutative, but not associative. So for floating-point numbers,

A + B = B + A always, but

A + (B + C) != (A + B) + C in many cases.

zaitanzOP5y ago

Thank you for your reply. I'm going to just lift your explanation as it's much better than my own.

lysium5y ago

I would have never thought that the output of some floating point operations depend on the „reset of the floating point package“. What gives?

comex5y ago

[1] https://github.com/wine-mirror/wine/blob/e909986e6ea5ecd49b2...

lysium5y ago

Very informative! It even sets SSE2.

brucehoult5y ago

On a long series of calculations changing the rounding mode could well be enough to cause the difference in results the poster saw.

I would tend to the opinion that redoing the calculation with different rounding modes is in fact a not bad way to figure out how reliable the results are in the first place.

(the rounding mode is not the only thing affected by not saving and restoring the FP state correctly)

willxinc5y ago

Things like rounding mode and denornal behavior could be controlled by CPU flags. Imy not familiar with x86, but this is defiy the case in PPC.

lysium5y ago

Yes, that makes sense, thank you!

wiml5y ago

Some kind of internal state specific to Microsoft's libm? It doesn't look like _fpreset() is present anywhere else.

Though in general, IEEE-754 math does have a small amount of global state, for stuff like rounding modes and subnormals. Perhaps the spawned thread's fenv was unexpected.

I'm not super convinced by this blog post anyway, since it immediately confuses associativity with commutativity.

zaitanzOP5y ago

Op here. Yea this only occurs on Windows with MinGW. Visual C++ and Clang do not have this issue. None of the Linux compilers I tested also had this issue.

And yea you're right on the associative mistake. Will correct this :)

chrchang5235y ago

I currently build Windows binaries via cross-compilation on Linux using gcc-mingw-w64; I assume that is affected by the bug?

1 more reply

zaitanzOP5y ago

TBH. I had no idea this was a thing either. I was incredibly confused when I identified a single piece of code that was producing an variation in result.

cperciva5y ago

Probably the author is accidentally getting 80-bit "extended double" arithmetic.

cozzyd5y ago

That should be clear from looking at the assembly, shouldn't it? Don't think the floating point environment can change that but I could be wrong.

Floating-point rounding modes seem more likely to me. The author should be able to dump the floating point configuration to confirm, I'm sure.

cperciva5y ago

The same x86 instructions are used for arithmetic on float/double/extended types; the difference is a precision setting in the x87 control word.

1 more reply

peter_d_sherman5y ago

Confirmed: It’s a bug in MinGW64.

When I create a new thread and run floating point operations in that thread, I get slightly different answers."

Interesting!

titzer5y ago

Do not use C/C++ for numerical code where accuracy is needed. These languages are not specified to conform to IEEE 754 and you are absolutely asking for trouble.

cperciva5y ago

My copy of C11 says

   Annex F
   (normative)

   IEC 60559 floating-point arithmetic

   F.1 Introduction
   This annex specifies C language support for the IEC
   60559 floating-point standard [...] previously designated
   ANSI/IEEE 754-1985.

shakna5y ago

> Support for Annex F (IEEE-754 / IEC 559) of C99/C11

If you make use of Clang, you won't find support for Annex F, and in point of fact you can't even macro check to see if it even will support Annex F.

The story with GCC is... More complicated. It guarantees it'll follow Annex F for only some operations [0].

So the upshot is... Most people can't tell when and how Annex F might be followed.

[0] https://gcc.gnu.org/onlinedocs/gcc-7.4.0/gcc/Floating-point-...

titzer5y ago

mhh__5y ago

Keep in mind that C++ doesn't exist. The compilers all have ways of enforcing compliance.

Intel C++ seems to be pretty naughty in that it effectively has fast-math on by default.

1 more reply

jcranmer5y ago

(That reminds me, I need to harangue the llvm-libc folks to add that pragma to their fpenv code for correctness sake).

shakna5y ago

The clang docs explicitly say that they don't support Annex F at all. Is this story finally going to change? So that you have at least partial support?

titzer5y ago

jcranmer5y ago

They do those optimizations when you enable fast-math flags. Some compilers (e.g., icc) enable those by default. When fast-math is not enabled, then the IEEE 754-violating optimizations are disabled.

Source: I've been working on such optimizations this past week. And that means establishing precisely which flags I need to have enabled in order for optimization to kick in.

1 more reply

zaitanzOP5y ago

Op here. There is a huge amount of scientific code written in C++. Just have a look through https://www.coin-or.org/

This code was specifically C++ because we had the intention of integrating with libraries like CppAD and ADOL-C.

parekhnish5y ago

So, what should be used?

qayxc5y ago

The correct - but useless - answer is: any language that's IEC 60559:xxxx conformant.

Examples are: Fortran, R, C#, Java, D and many others.

CodesInChaos5y ago

cratermoon5y ago

COBOL

Seriously? Anything but floating point if you care about precision and accuracy.

j / k navigate · click thread line to collapse