Creating invariant floating-point accumulators (opens in new tab)

(solidpixel.github.io)

40 pointsPathogen-David1y ago30 comments

30 comments

This seems to keep coming up, and I see confusion in the comments. There is a standard: IEEE 754-2008. There are additional things people add like approximate reciprocals and approximate sqrt. But if you don't use those, and you don't make an association error, you get consistent results.

The question here with association for summation is what you want to match. OP chose to match the scalar for-loop equivalent. You can just as easily make an 8-wide or 16-wide "virtual vector" and use that instead.

I suspect that an 8-wide virtual vector is the right default for people currently, since systems since Haswell support it, all recent AMD, and if you're using vectorization, you can afford to pay some overhead on Arm with a double-width virtual vector. You don't often gain enough from AVX512 to make the default 16-wide, but if you wanted to focus on Skylake+ (really Cascadelake+) or Genoa+ systems, it would be a fine choice.

LegionMammal9781y ago

> OP chose to match the scalar for-loop equivalent.

Isn't it the other way around? The scalar for-loop was changed to match the vector loop's associativity. "To solve this problem for astcenc I decided to change our reference no-SIMD implementation to use 4-wide vectors."

subharmonicon1y ago

The latest standard is from 2019: https://ieeexplore.ieee.org/document/8766229

There is still some flexibility in implementation, for example how and whether FMAs are formed for a given compiler when FP_CONTRACT is ON, and in the standard itself in things like when tininess is detected.

But to your point if you stick to the basic operations in the standard, and don’t enable FP_CONTRACT and FENV_ACCESS in C/C++, have a bug free compiler and don’t use fast-math, you’re good to go.

[edit to add a caveat about compile-time constant folding which is a whole can of worms]

[edit again to point out that the C/C++ standards allow for implementations to compute intermediate results at higher precision, so a compliant implementation can use all 80 bits on x87 when computing expressions]

kardos1y ago

Exact floating point accumulating is more or less solved with xsum [1] -- would it work in this context?

[1] https://gitlab.com/radfordneal/xsum

lifthrasiir1y ago

Because it would be slower when the exact calculation is not necessary. The xsum paper does have performance numbers, but all of them came from at least decade-old processors and almost every result indicates that superaccumulators are still 4--8x slower than the simple sum (but faster than the traditional Kahan summation). Superaccumulators require extensive scatter-gather operations due to its large memory footprint and I think the gap should have been even widen today as they would be harder to vectorize efficiently.

kardos1y ago

My takeaway from the linked post is that the author is more concerned with floating point invariance across platforms than speed (although improvements in speed are of course welcome).

If the data confined to a certain range of exponents, one could reduce the size of the accumulator, perhaps significantly.

Re 4-8x -- the large option in xsum was benchmarked at less than 2x the cost of a direct sum. Not so bad?

lifthrasiir1y ago

The author wants the best possible performance as long as its result is reproducible. I think this is evident from the fact that the non-SIMD code uses 4-wide vectors, so the author is definitely willing to trade accuracy if higher performance with a reproducible result is possible.

> Re 4-8x -- the large option in xsum was benchmarked at less than 2x the cost of a direct sum. Not so bad?

I don't know where you did take that number, because xsum-paper.pdf clearly indicates larger performance difference. I'm specifically looking at the ratio between the minimum of any superaccumulator results and the minimum of any simple sum results, and among results I think relevant today (x86-64, no earlier than 2012), AMD Opteron 6348 is the only case where the actual difference is only about 1.5x and everything else hovers much higher.

waynecochran1y ago

Invariance w floating point arithmetic seems like a fool's errand. If the numbers one is working with are roughly on the same order of magnitude than I would consider integer / fixed point instead. You get the same results in this case (as long as you are careful).

someguydave1y ago

Seems crazy to try to paper over hardware implementation differences in software. Some org should be standardizing floating point intrinsics

eternityforest1y ago

I wonder if we'll eventually just have all the common platforms switch to RISCV?

baq1y ago

see also streflop (2006)

https://nicolas.brodu.net/en/programmation/streflop/

modulovalue1y ago

I'm still wondering if there could exist an alternative world where efficient addition over decimal numbers that we developers use on a day to day basis is associative. Is that even possible or is there perhaps some fundamental limit that forces us to trade associativity for performance?

It seems to me that non associative floating point operations force us into a local maximum. The operation itself might be efficient on modern machines, but could it be preventing us from applying other important high level optimizations to our programs due to its lack of associativity? A richer algebraic structure should always be amenable to a richer set of potential optimizations.

---

I've asked a question that is very much related to that topic on the programming language subreddit:

"Could numerical operations be optimized by using algebraic properties that are not present in floating point operations but in numbers that have infinite precision?"

https://www.reddit.com/r/ProgrammingLanguages/comments/145kp...

The responses there might be interesting to some people here.

bee_rider1y ago

I’m not 100% clear on what you are asking, but integer addition is associative, right? (With quibbles about over/underflow of course). If you have some limited range of decimal numbers that you care about, you can always just use integers and account for the shifted decimal places as needed.

Floats are mostly for when you need that dynamic range.

daemin1y ago

It could also be that a lot of languages don't actually have integer types and just use a 64bit floating point number instead - ala JavaScript etc.

baq1y ago

worth mentioning - js has had a built-in bigint type for quite a while now:

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

the problem is it's marginally useful, since everyone expects number

bee_rider1y ago

Really? That seems nuts. How do they index an array?

3 more replies

hansvm1y ago

> I'm still wondering if there could exist an alternative world where efficient addition over decimal numbers that we developers use on a day to day basis is associative. Is that even possible or is there perhaps some fundamental limit that forces us to trade associativity for performance?

You're always welcome to use a weaker notion of associativity than bitwise equality (e.g., -ffast-math pretends many operations are associative to reorder them for speed, and that only gives approximately correct results on well-conditioned problems).

In general though, yes, such a limit does exist. Imagine, for the sake of argument, an xxx.yyy fixed-point system. What's the result of 100 * 0.01 * 0.01? You either get 0.01 or 0, depending on where you place the parentheses.

The general problem is in throwing away information. Trashing bits doesn't necessarily mean your operations won't be associative (imagine as a counter-example the infix operator x+y==1 for all x,y). It doesn't take many extra conditions to violate associativity though, and trashed bits for addition and multiplication are going to fit that description.

How do you gain associativity then? At a minimum, you can't throw information away. Your fast machine operations use an unbounded amount of RAM and don't fit in registers. Being floating-point vs fixed-point only affects that conclusion in extremely specialized cases (like only doing addition without overflow -- which sometimes applies to the financial industry, but even then you need to think twice about the machine representation of what you're doing).

modulovalue1y ago

> How do you gain associativity then? At a minimum, you can't throw information away.

That's an interesting perspective that I haven't considered before, thank you.

Now I'm wondering, could we throw away some information in just the right way and still maintain associativity? That is, it doesn't seem like throwing information away is fundamentally what's preventing us from having an associative operation, since we can throw information away and still maintain associativity by, for example, converting each summand to a 0 and adding them, and that operation would be associative. However, we would have thrown all information away, which is not useful, but we would have an associative operation.

zokier1y ago

Herbie, a fp optimizer, might be of interest to you

https://herbie.uwplse.org/

tialaramex1y ago

There's a link in there to future directions for Herbie which talks about the intriguing idea of out-sourcing the translation of a high level "I want to do this real number math" to the lower level "Here's some floating point arithmetic" via Herbie.

That is, the physicist writes the two line equation they want for electromagnetic force into their program, the same way they'd write a for-each style loop in the program if that's what they needed.

Obviously the CPU doesn't understand how to compute the appropriate approximation for this electromagnetic force equation, but nor does it understand how to iterate over each item in a container. Tools convert the for-each loop into machine code, why shouldn't other, smart, tools convert the physicist's equation into the FP instructions ?

Today the for-each loop thing just works, loads of programming languages do that, if a language can't do it (e.g. C) that's because it is old or only intended for experts or both.

But every popular language insists that physicist should laboriously convert the equation into the code to compute an approximation, which isn't really their skill set, so why not automate that problem?

jcranmer1y ago

> Could numerical operations be optimized by using algebraic properties that are not present in floating point operations but in numbers that have infinite precision?

... are you not aware of -ffast-math? There are several fast-math optimizations that are basically "assume FP operations have this algebraic property, even though they don't" (chiefly, -fassociative-math assumes associative and distributive laws hold).

j / k navigate · click thread line to collapse

30 comments

boulos1y ago

LegionMammal9781y ago

> OP chose to match the scalar for-loop equivalent.

subharmonicon1y ago

The latest standard is from 2019: https://ieeexplore.ieee.org/document/8766229

[edit to add a caveat about compile-time constant folding which is a whole can of worms]

kardos1y ago

Exact floating point accumulating is more or less solved with xsum [1] -- would it work in this context?

[1] https://gitlab.com/radfordneal/xsum

lifthrasiir1y ago

kardos1y ago

My takeaway from the linked post is that the author is more concerned with floating point invariance across platforms than speed (although improvements in speed are of course welcome).

If the data confined to a certain range of exponents, one could reduce the size of the accumulator, perhaps significantly.

Re 4-8x -- the large option in xsum was benchmarked at less than 2x the cost of a direct sum. Not so bad?

lifthrasiir1y ago

> Re 4-8x -- the large option in xsum was benchmarked at less than 2x the cost of a direct sum. Not so bad?

waynecochran1y ago

someguydave1y ago

Seems crazy to try to paper over hardware implementation differences in software. Some org should be standardizing floating point intrinsics

eternityforest1y ago

I wonder if we'll eventually just have all the common platforms switch to RISCV?

baq1y ago