These operations are
1. Localized, not a function-wide or program-wide flag.
2. Completely safe, -ffast-math includes assumptions such that there are no NaNs, and violating that is undefined behavior.
So what do these algebraic operations do? Well, one by itself doesn't do much of anything compared to a regular operation. But a sequence of them is allowed to be transformed using optimizations which are algebraically justified, as-if all operations are done using real arithmetic.
The library has some merit, but the goal you've stated here is given to you with 5 compiler flags. The benefit of the library is choosing when these apply.
This can be expanded in the future as LLVM offers more flags that fall within the scope of algebraically motivated optimizations.
feenableexcept(FE_DIVBYZERO | FE_INVALID | FE_OVERFLOW);
will cause your code to get a SIGFPE whenever a NaN crawls out from under a rock. Of course it doesn't work with fast-math enabled, but if you're unknowingly getting NaNs without fast-math enabled, you obviously need to fix those before even trying fast-math, and they can be hard to find, and feenableexcept() makes finding them a lot easier.Be very careful with it in production code though [1]. If you're in a dll then changing the FPU exception flags is a big no-no (unless you're really really careful to restore them when your code goes out of scope).
[1]: https://randomascii.wordpress.com/2016/09/16/everything-old-...
Not being able to auto-vectorize seems like a pretty critical bug given hardware trends that have been going on for decades now; on the other hand sacrificing platform-independent determinism isn't a trivial cost to pay either.
I'm not familiar with the details of OpenCL and CUDA on this front - do they have some way to guarrantee a specific order-of-operations such that code always has a predictable result on all platforms and nevertheless parallelizes well on a GPU?
Most popular programming languages have the defect that they impose a sequential semantics even where it is not needed. There have been programming languages without this defect, e.g. Occam, but they have not become widespread.
Because nowadays only a relatively small number of users care about computational applications, this defect has not been corrected in any mainline programming language, though for some programming languages there are extensions that can achieve this effect, e.g. OpenMP for C/C++ and Fortran. CUDA is similar to OpenMP, even if it has a very different syntax.
The IEEE standard for floating-point arithmetic has been one of the most useful standards in all history. The reason is that both hardware designers and naive programmers have always had the incentive to cheat in order to obtain better results in speed benchmarks, i.e. to introduce errors in the results with the hope that this will not matter for users, which will be more impressed by the great benchmark results.
There are always users who need correct results more than anything else and it can be even a matter of life and death. For the very limited in scope uses where correctness does not matter, i.e. mainly graphics and ML/AI, it is better to use dedicated accelerators, GPUs and NPUs, which are designed by prioritizing speed over correctness. For general-purpose CPUs, being not fully-compliant with the IEEE standard is a serious mistake, because in most cases the consequences of such a choice are impossible to predict, especially not by the people without experience in floating-point computation who are the most likely to attempt to bypass the standard.
Regarding CUDA, OpenMP and the like, by definition if some operations are parallelizable, then the order of their execution does not matter. If the order matters, then it is impossible to provide guarantees about the results, on any platform. If the order matters, it is the responsibility of the programmer to enforce it, by synchronization of the parallel threads, wherever necessary.
Whoever wants vectorized code should never rely on programming languages like C/C++ and the like, but they should always use one of the programming language extensions that have been developed for this purpose, e.g. OpenMP, CUDA, OpenCL, where vectorization is not left to chance.
Whether it's the standards fault or the languages fault for following the standard in terms of preventing auto-vectorization is splitting hairs; the whole point of the standard is to have predictable and usually fairly low-error ways of performing these operations, which only works when the order of operations is defined. That very aim is the problem; to the extent the stardard is harmless when ordering guarrantees don't exist you're essentially applying some of those tricky -ffast-math suboptimizations.
But to be clear in any case: there are obviously cases whereby order-of-operations is relevant enough and accuracy altering reorderings are not valid. It's just that those are rare enough that for many of these features I'd much prefer that to be the opt-in behavior, not opt-out. There's absolutely nothing wrong with having a classic IEEE 754 mode and I expect it's an essentialy feature in some niche corner cases.
However, given the obviously huge application of massively parallel processors and algorithms that accept rounding errors (or sometimes conversely overly precise results!), clearly most software is willing to generally accept rounding errors to be able to run efficiently on modern chips. It just so happens that none of the computer languages that rely on mapping floats to IEEE 754 floats in a straitforward fashion are any good at that, which is seems like its a bad trade off.
There could be multiple types of floats instead; or code-local flags that delineate special sections that need precise ordering; or perhaps even expressions that clarify how much error the user is willing to accept and then just let the compiler do some but not all transformations; and perhaps even other solutions.
We have memory ordering functions to let compilers know the atomic operation preference of the programmer… couldn’t we do the same for maths and in general a set of expressions?
AFAIK GPU code is basically always written as scalar code acting on each "thing" separately, that's, as a whole, semantically looped over by the hardware, same way as multithreading would (i.e. no order guaranteed at all), so you physically cannot write code that'd need operation reordering to vectorize. You just can't write an equivalent to "for (each element in list) accumulator += element;" (or, well, you can, by writing that and running just one thread of it, but that's gonna be slower than even the non-vectorized CPU equivalent (assuming the driver respects IEEE-754)).
This is slightly obfuscated by not using a keyword like "for" or "do", by the fact that the body of the loop (the "kernel") is written in one place and and the header of the loop (which gives the ranges for the loop indices) is written in another place, and by the fact that the loop indices have standard names.
A "parallel for" may have as well a syntax identical with a sequential "for". The difference is that for the "parallel for" the compiler knows that the iterations are independent, so they may be scheduled to be executed concurrently.
NVIDIA has been always greatly annoying by inventing a huge amount of new terms that are just new words for old terms that have been used for decades in the computing literature, with no apparent purpose except of obfuscating how their GPUs really work. Worse, AMD has imitated NVIDIA, by inventing their own terms that correspond to those used by NVIDIA, but they are once again different.
Well, all standards are bad when you really get into them, sure.
But no, the problem here is that floating point code is often sensitive to precision errors. Relying on rigorous adherence to a specification doesn't fix precision errors, but it does guarantee that software behavior in the face of them is deterministic. Which 90%+ of the time is enough to let you ignore the problem as a "tuning" thing.
But no, precision errors are bugs. And the proper treatment for bugs is to fix the bugs and not ignore them via tricks with determinism. But that's hard, as it often involves design decisions and complicated math (consider gimbal lock: "fixing" that requires understanding quaternions or some other orthogonal orientation space, and that's hard!).
So we just deal with it. But IMHO --ffast-math is more good than bad, and projects should absolutely enable it, because the "problems" it discovers are bugs you want to fix anyway.
Or just avoiding gimbal lock by other means. We went to the moon using Euler angles, but I don't suppose there's much of a choice when you're using real mechanical gimbals.
What's wrong with fun, safe math optimizations?!
(:
Is there any IEEE standards committee working on FP alternative for examples Unum and Posit [1],[2].
[1] Unum & Posit:
[2] The End of Error:
https://www.oreilly.com/library/view/the-end-of/978148223986...
https://www.forth.com/starting-forth/5-fixed-point-arithmeti...
With 32 and 64 bit numbers, you can just scale decimals up. So, Torvalds was right. On dangerous contexts (uper-precise medical doses, FP has good reasons to exist, and I am not completely sure).
Also, both Forth and Lisp internally suggest to use represented rationals before floating point numbers. Even toy lisps from https://t3x.org have rationals too. In Scheme, you have both exact->inexact and inexact->exact which convert rationals to FP and viceversa.
If you have a Linux/BSD distro, you may already have Guile installed as a dependency.
Hence, run it and then:
scheme@(guile-user)> (inexact->exact 2.5)
$2 = 5/2
scheme@(guile-user)> (exact->inexact (/ 5 2))
$3 = 2.5
Thus, in Forth, I have a good set of q{+,-,*,/} operations for rational (custom coded, literal four lines) and they work great for a good 99% of the cases.As for irrational numbers, NASA used up 16 decimals, and the old 113/355 can be precise enough for a 99,99 of the pieces built in Earth. Maybe not for astronomical distances, but hey...
In Scheme:
scheme@(guile-user)> (exact->inexact (/ 355 113))
$5 = 3.1415929203539825
In Forth, you would just use : pi* 355 133 m*/ ;
with a great precision for most of the objects being measured against.“The problem is how FTZ actually implemented on most hardware: it is not set per-instruction, but instead controlled by the floating point environment: more specifically, it is controlled by the floating point control register, which on most systems is set at the thread level: enabling FTZ will affect all other operations in the same thread.
“GCC with -funsafe-math-optimizations enables FTZ (and its close relation, denormals-are-zero, or DAZ), even when building shared libraries. That means simply loading a shared library can change the results in completely unrelated code, which is a fun debugging experience.”
I particularly find the discussion of - fassociative-math because I assume that most writers of some code that translates a mathetical formula to into simulations will not know which would be the most accurate order of operations and will simply codify their derivation of the equation to be simulated (which could have operations in any order). So if this switch changes your results it probably means that you should have a long hard look at the equations you're simulating and which ordering will give you the most correct results.
That said I appreciate that the considerations might be quite different for libraries and in particular simulations for mathematics.
Then all other math will be fast-math, except where annotated.
Not sure how that interacts with this fast math thing, I don't use C
ffast-math is sacrificing both the first and the second for performance. Compilers usually sacrifice the first for the second by default with things like automation fma contraction. This isn't a necessary trade-off, it's just easier.
There's very few cases where you actually need accuracy down to the ULP though. No robot can do anything meaningful with femtometer+ precision, for example. Instead you choose a development balance between reproducibility (relatively easy) and accuracy (extremely hard). In robotics, that will usually swing a bit towards reproducibility. CAD would swing more towards accuracy.
I find the robotics example quite surprising in particular. I think the precision of most input sensors is less than 16bit so. If your inputs have this much noise on them how come you need so much precision your calculations?
I work in audio software and we have some comparison tests that compare the audio output of a chain of audio effects with a previous result. If we make some small refactoring of the code and the compiler decides to re-organize the arithmetic operations then we might suddenly get a slightly different output. So of course we disable fast-math.
One thing we do enable though, is flushing denormals to zero. That is predictable behavior and it saves some execution time.
Previous discussion: Beware of fast-math (Nov 12, 2021, https://news.ycombinator.com/item?id=29201473)
https://www.jviotti.com/2017/12/05/an-introduction-to-adas-s...
http://www.ada-auth.org/standards/22rm/html/RM-3-5-7.html
http://www.ada-auth.org/standards/22rm/html/RM-A-5-3.html
Ada also has fixed point types:
EDIT: I am now reading Goldberg 1991
Double edit: Kahan Summation formula. Goldberg is always worth going back to.
Could be wrong but that’s my gut feeling.
Vivaldi 7.4.3691.52
Android 15; ASUS_AI2302 Build/AQ3A.240812.002
I'm surprised by the take that FTZ is worse than reassociation. FTZ being environmental rather than per instruction is certainly unfortunate, but that's true of rounding modes generally in x86. And I would argue that most programs are unprepared to handle subnormals anyway.
By contrast, reassociation definitely allows more optimization, but it also prohibits you from specifying the order precisely:
> Allow re-association of operands in series of floating-point operations. This violates the ISO C and C++ language standard by possibly changing computation result.
I haven't followed standards work in forever, but I imagine that the introduction of std::fma, gets people most of the benefit. That combined with something akin to volatile (if it actually worked) would probably be good enough for most people. Known, numerically sensitive code paths would be carefully written, while the rest of the code base can effectively be "meh, don't care".
> This is perhaps the single most frequent cause of fast-math-related StackOverflow questions and GitHub bug reports
The second line above should settle the first.
If it’s not always correct, whoever chooses to use it chooses to allow error…
Sounds worse than worthless to me.
I've spent most of my career writing trading systems that have executed 100's of billions of dollars worth of trades, and have never had any floating point related bugs.
Using some kind of fixed point math would be entirely inappropriate for most HFT or scientific computing applications.
With fixed point and at least 2 decimal places, 10.01 + 0.01 is always exactly equal to 10.02. But with FP you may end up with something like 10.0199999999, and then you have to be extra careful anywhere you convert that to a string that it doesn't get truncated to 10.01. That could be logging (not great but maybe not the end of the world if that goes wrong), or you could be generating an order message and then it is a real problem. And either way, you have to take care every time you do that, as opposed to solving the problem once at the source, in the way the value is represented.
> Using some kind of fixed point math would be entirely inappropriate for most HFT or scientific computing applications.
In the case of HFT, this would have to depend very greatly on the particulars. I know the systems I write are almost never limited by arithmetical operations, either FP or integer.
If you need to be extremely fast (like fpga fast), you don't waste compute transforming their fixed point representation into floating.
May I ask why? (generally curious)
Should I be running my accounting system on units of 10 billionths of a dollar?
But you probably should run your billing in fixed point or floating decimals with a billionth of a dollar precision, yes. Either that or you should consolidate the expenses into larger bunches.
https://ethereum.stackexchange.com/questions/158517/does-sol...
My best guess for the latter proposition is that people are reacting to the default float printing logic of languages like Java, which display a float as the shortest base-10 number that would correctly round to that value, which extremely exaggerates the effect of being off by a few ULPs. By contrast, C-style printf specifies the number of decimal digits to round to, so all the numbers that are off by a few ULPs are still correct.
[1] I'm not entirely sure about the COBOL mainframe applications, given that COBOL itself predates binary floating-point. I know that modern COBOL does have some support for IEEE 754, but that tells me very little about what the applications running around in COBOL do with it.
If you're summing up the cost of items in a webshop, then you're in the domain of accounting. If the result appears to be off by a single cent because of a rounding subtlety, then you're in trouble, because even though no one should care about that single cent, it will give the appearance that you don't know what you're doing. Not to mention the trouble you could get in for computing taxes wrong.
If, on the other hand, you're doing financial forecasting or computing stock price targets, then you're not in the domain of accounting, and using floating point for money is just fine.
I'm guessing from your post that your finance people are more like the latter. I could be wrong though - accountants do tend to use Excel.
It’s really more of a concern in accounting, when monetary amounts are concrete and represent real money movement between distinct parties. A ton of financial software systems (HFT, trading in general) deal with money in a more abstract way in most of their code, and the particular kinds of imprecision that FP introduces doesn’t result in bad business outcomes that outweigh its convenience and other benefits.
https://learn.microsoft.com/en-us/office/troubleshoot/excel/...
(It nevertheless happens to work just fine for most of what Excel is used for.)
I always have a wrapper class to put the logic of converting to whole currency units when and if needed, as well as when requirements change and now you need 4 digits past the decimal instead of 2, etc.
You can think of fixed point as equivalent to ieee754 floats with a fixed exponent and a two’s complement mantissa instead of a sign bit.
Make it work. Make it right. Make it fast.
Stop trying. Let their story unfold. Let the pain commence.
Wait 30 years and see them being frustrated trying to tell the next generation.
A similar warning applies to -O3. If an optimization in -O3 were to reliably always give better results, it wouldn't be in -O3; it'd be in -O2. So blindly compiling with -O3 also doesn't seem like a great idea.
-Ofast is the 'dangerous' one. (It includes -ffast-math).
I didn't mean to imply that they result in incorrect results.
> they make a more aggressive space/speed tradeoff...
Right...so "better" becomes subjective, depends on the use case, so it doesn't make sense to choose -O3 blindly unless you understand the trade-offs and want that side of them for the particular builds you're doing. Things that everyone wants would be in -O2. That's all I'm saying.
Please don't fulminate.