Why is 2 * (i * i) faster than 2 * i * i in Java? (opens in new tab)

(stackoverflow.com)

424 pointstrequartista7y ago100 comments

100 comments

So it's an issue of the optimizer; as is often the case, it unrolls too aggressively and shoots itself in the foot, all the while missing out on various other opportunities.

In my experience, loop unrolling should basically never be done except in extremely degenerate cases; I remember not long ago someone I know who also optimises Asm remarking "it should've died along with the RISC fad". The original goal was to reduce per-iteration overhead associated with checking for end-of-loop, but any superscalar/OoO/speculative processor can "execute past" those instructions anyway; all that unrolling will do is bloat the code and work against caching. Memory bandwidth is often the bottleneck, not the core.

pcwalton7y ago

> In my experience, loop unrolling should basically never be done except in extremely degenerate cases

Not true. Like many such optimizations, loop unrolling can be useful because it makes downstream loads constant.

For example:

    float identity[4][4];
    for (unsigned y = 0; y < 4; y++)
        for (unsigned x = 0; x < 4; x++)
            identity[y][x] = y == x ? 1 : 0;
    ... do some matrix math ...

In this case, the compiler probably wants to unroll the loops so that it can straightforwardly forward the constant matrix entries directly to the matrix arithmetic. It'll likely be able to eliminate lots of operations that way.

(You might ask "who would write this code?" As Schemers say: "macros do.")

See LLVM's heuristics: http://llvm.org/doxygen/LoopUnrollPass_8cpp.html#ad7c38776d7...

kannanvijayan7y ago

> (You might ask "who would write this code?" As Schemers say: "macros do.")

To expand on this point - in the more prosaic world of C++ - this sort of code comes about all the time in templated code. For example, the above loop you posted might have been found in something like:

``` template <unsigned N, unsigned M> class Matrix { static Matrix Identity() { ... } }

    auto m = Matrix<4, 4>::Identity();

```

The other major source of these sorts of constants leading to DCE oppportunities is inlining. Consider a more classical, matrix implementation that is not templated and doesn't lift its dimensions into the type:

``` class Matrix { unsigned n; unsigned m; static Matrix Identity(unsigned n, unsigned m) { ... } }

    // Somewhere else
    Matrix m = Matrix::Identity(4, 4);

```

Here, the inlining of the call to `Identity` at the call-site will turn the `n` and `m` in the body of `Identity` into the constant 4.

If I had to make an educated guess - inlining typically generates these (i.e. partial evaluation, constant folding, an DCE) situations most often in compilers. An incredible amount of information can flow from caller to callee when you specialize the callee for that call-site.

bjoli7y ago

I didn't understand dead elimination until I wrote enough macros. It is a lot easier to generate code and have the optimizer fix it than to make sure to always generate efficient code.

This is also how compilers do things, but it is only that we schemers can see the intermediate result much easier using simple source->source transformations.

1 more reply

stochastic_monk7y ago

I had a case where I was counting occurrences of 8-bit integer values. Manual loop unrolling provided a 33% speed up.

tomsmeding7y ago

In addition to the optimisations already mentioned, loop unrolling also typically enables vectorisation in compilers. You might argue that for vectorisation it is not exactly necessary to have the relevant oerations next to each other in a continuous instruction stream, but it makes the vectorisation pass a lot nicer and simpler (if it can be called that to begin with).

Sharlin7y ago

Assuming you unroll just enough to fill a SIMD lane. As mentioned, in this case aggressive (16-fold) unfolding actually appears to have prevented vectorization. (A smart enough vectorizer could of course handle this but unrolling just to ”re-roll” in a later pass doesn’t sound very smart.)

nonsince7y ago

Loop unrolling is useful only when it enables other optimisations. The most common is constant folding.

dragontamer7y ago

Fully agree.

In most cases on modern systems, small loops should remain compact as possible, to stay in the uop cache. The "for" loop overhead (the inc, cmp, and jmp instructions) effectively execute in parallel. Modern systems are highly out-of-order and the for-loop overhead is virtually nil.

BeeOnRope7y ago

Actually unrolling is often very important. In some cases it is even more important with modern high speed out-of-order cores. For example, you might need several accumulators to handle instruction or memory latency.

For small loops, unrolling is the most important of all, since loop carried dependency chains are dense, and the loop overhead is a high fraction of the overall work.

It is easy to get a 2x speedup by unrolling a small loop, and even larger speedups are not uncommon.

So this "unrolling rarely helps" idea is just as much of a myth as "unrolling never helps". The main problem with unrolling is that the compilers usually don't do it intelligently - usually loops are unrolling if some kind of threshold is met, depending on compiler options - but this always happens in kind of a feed-forward way, rather thank a feed-back way, which would involve unrolling the loop and analyzing the benefit and costs after further optimization passes.

1 more reply

nwmcsween7y ago

Well I've seen the opposite, try a naive string function of some sort (strlen, etc) now manually unroll.

astn7y ago

Yeah, sometimes the compiler unrolls too much and innocent looking one-liner can be compiled into a monstrosity like this:

https://godbolt.org/z/aKtko5

londons_explore7y ago

Are there no compilers which attempt to look at that code, decide 'that looks like 1<<(num-2) when n>=2', and replace the code entirely?

There must be so many examples of bubble sort where quicksort would be better, and other code patterns which can be identified and replaced with something orders of magnitude faster.

pertymcpert7y ago

Is that more performant?

jepler7y ago

You should translate your program to C++ and build with clang ; it turns the loop into a single constant load. https://godbolt.org/z/slznbU

cryptonector7y ago

Did you read TFA? The author did that (though using GCC), and the reason the optimizer does what you see is undefined behavior due to signed integer overflow.

kccqzy7y ago

Did you understand the comment? The author used GCC, and GCC is only able to vectorize the loop. But clang on the other hand, essentially turned this O(n) algorithm to calculate a particular sum into an O(1) result.

> the reason the optimizer does what you see is undefined behavior due to signed integer overflow

Yes undefined behavior gives the optimizer the right in this case to transform the code into anything, including a nonsense answer, or a trap instruction. But the optimizer did not; it produced the right answer under 2's complement arithmetic.

1 more reply

xeeeeeeeeeeenu7y ago

There is no undefined behaviour, '#pragma GCC optimize("wrapv")' takes care of that.

EDIT: It seems that clang doesn't support #pragma GCC optimize, so it's a no-op in that snippet. It doesn't change the result though. If you pass -fwrapv flag to clang, it will be optimized in exactly the same way.

geezerjay7y ago

> does what you see is undefined behavior

Just to be clear, undefined behavior means the standard allows implementations to do what they they feel is the right thing to do under that scenario, and the outcome will still comply with the standard.

3 more replies

utopcell7y ago

GCC applies the same optimization and compiles to a single constant if overflow does not occur. To see this, you can change '1000000000' to '1000'.

jepler7y ago

Updated version that specifies -fwrapv on the commandline to turn the integer overflow into defined behavior: https://godbolt.org/z/K-Ijl0

ychen3067y ago

It's usually a good idea to turn loop bound into a variable when benchmarking a compiler, lest it optimizes the whole thing away like in this case.

archgoon7y ago

Nope; doesn't work for clang. Clang actually detects and compiles the algebraic closed form sum(i^2, n) for a bound n.

1 more reply

Too7y ago

So if the compiler is too good you want to trick it to produce less optimal code so you can benchmark it fairly? Isn't it part of the benchmark to allow the compiler reduce the whole expression to a compile time constant?

2 more replies

beeforpork7y ago

With all the optimisations being implemented in compilers today, it is impressive to see how this opportunity to optimise is missed. Put differently, compiler writers bother about optimisations that gain 0.1% performance in some special cases, but others that could gain 20% performance are not implemented.

Why? Is this optimisation particularly difficult to implement? Or is it just missed low-hanging fruit? It sure looks easy (like: rearrange expressions to keep the expression tree shallow and left-branching to avoid stack operations).

acdha7y ago

Compiler developers have tons of benchmarks which they run. I’d bet that this is as simple as not being significant in their test suite, with a good chance that it’s both not as simple as it might seem or that there are impacts on more complicated code which is in their benchmark suite or a big customer’s app.

DannyBee7y ago

The truth is that the hotspot computer is pretty old at this point and never really implemented a lot of good, robust, and thorough optimizations (I've read the source every year or two). It does some stuff and hopes for the best.

This is why there is a real commercial jvm market with azul.

yifanl7y ago

It's possible that they're working in the frame of mind that there aren't any low-hanging fruit left after so many years of compiler optimizations and forget to even try.

techopoly7y ago

That just might be the most dedicated answer I've ever seen on Stack Overflow.

azhenley7y ago

It is a good answer, but my favorite by far is an answer about branch prediction to explain why processing a sorted array is faster than unsorted: https://stackoverflow.com/q/11227809/938695

fma7y ago

I find it interesting that there are developers out there that know to look at these nuances when respond to Stack Overflow questions. I'm been developing professionally for 10 years and probably went over branch prediction in my computer architecture class in college (I'm guessing I did, if I didn't then I never encountered it at all!).

The person who answered the multiple question dove into byte code...but also answered questions on Angular.

I am unworthy...and this is what impostor syndrome looks like.

3 more replies

garmaine7y ago

Would be awesome if that answer was updated to explain Spectre (it’s 85% of the way there).

dopamean7y ago

Wow that was a great read.

foobaw7y ago

truly full-stack!

falcor847y ago

It really was, but as others mentioned, there's a lot of really good stuff on Stack Overflow and Stack Exchange in general. This is my favorite:

https://codegolf.stackexchange.com/questions/11880/build-a-w...

pmarreck7y ago

Oh my god.

https://copy.sh/life/?pattern=TetrisOTCAMP.mc

OH MY GOD!

My eyes are watering and I can’t stop deeply chuckling at the sheer collaborative esoteric audacity

Veedrac7y ago

Then you'll love https://stackoverflow.com/questions/37361145/deoptimizing-a-...

pmarreck7y ago

Takeaway phrases from this I love:

“Pessimization”

“Diabolical incompetence”

dreamcompiler7y ago

I thought at first this was because integer squaring is potentially faster than general integer multiplication and the compiler wasn't seeing the square operation in the second case, but that's not the explanation here.

garmaine7y ago

There isn’t an integer square opcode on any major processor architecture though, right?

dreamcompiler7y ago

Not that I know of. It's not really worth it for short integers (64 bits or less). But it's helpful with bignums.

ww5207y ago

I'm surprised it's not doing a left shift for the x2.

jcdavis7y ago

It is in the first example (the sal instruction)

DannyBee7y ago

However, if you look at the second, you won't see any left shifts, which is also interesting

1 more reply

podsnap7y ago

The graal behavior is a lot more sane:

    graal:
    [info] SoFlow.square_i_two   10000  avgt   10  5338.492 ± 36.624  ns/op   // 2 *\sum i * i
    [info] SoFlow.two_i_         10000  avgt   10  6421.343 ± 34.836  ns/op   // \sum 2 * i * i
    [info] SoFlow.two_square_i   10000  avgt   10  6367.139 ± 34.575  ns/op   // \sum 2 * (i * i)
    regular 1.8:
    [info] SoFlow.square_i_two   10000  avgt   10  6393.422 ± 27.679  ns/op
    [info] SoFlow.two_i_         10000  avgt   10  8870.908 ± 35.715  ns/op
    [info] SoFlow.two_square_i   10000  avgt   10  6221.205 ± 42.408  ns/op

The graal-generated assembly for the first two cases is nearly identical, featuring unrolled repetitions of sequences like

    [info]   0x000000011433ec03: mov    %r8d,%ecx
    [info]   0x000000011433ec06: shl    %ecx               ;*imul {reexecute=0 rethrow=0 return_oop=0}
    [info]                                                 ; - add.SoFlow::test_two_i_@15 (line 41)
    [info]   0x000000011433ec08: imul   %r8d,%ecx          ;*imul {reexecute=0 rethrow=0 return_oop=0}
    [info]                                                 ; - add.SoFlow::test_two_i_@17 (line 41)
    [info]   0x000000011433ec0c: add    %ecx,%r9d          ;*iadd {reexecute=0 rethrow=0 return_oop=0}
    [info]                                                 ; - add.SoFlow::test_two_i_@18 (line 41)
    [info]   0x000000011433ec0f: lea    0x5(%r11),%r8d     ;*iinc {reexecute=0 rethrow=0 return_oop=0}
    [info]                                                 ; - add.SoFlow::test_two_i_@20 (line 40)

while the third case does a single shl at the end.

    [info]   0x000000010e2918bb: imul   %r8d,%r8d          ;*imul {reexecute=0 rethrow=0 return_oop=0}
    [info]                                                 ; - add.SoFlow::test_square_i_two@15 (line 32)
    [info]   0x000000010e2918bf: add    %r8d,%ecx          ;*iadd {reexecute=0 rethrow=0 return_oop=0}
    [info]                                                 ; - add.SoFlow::test_square_i_two@16 (line 32)
    [info]   0x000000010e2918c2: lea    0x3(%r11),%r8d     ;*iinc {reexecute=0 rethrow=0 return_oop=0}
    [info]                                                 ; - add.SoFlow::test_square_i_two@18 (line 31)

Both graal and C2 inline, but as usual the graal output is a lot more comprehensible.

bnegreve7y ago

I don't see how generating different code for the same mathematical expression can be a good thing.

The compiler should detect that the two expressions are strictly equivalent and generate whatever code it believes is the fastest.

Any idea why it is this way?

gnuvince7y ago

Because of integer overflows and floating-point operations, the notion of equivalent mathematical expressions is tricky.

    fn main() {
        let a: i8 = 125;
        let b: i8 = 3;
        let c: i8 = (a + b) / 2;
        let d: i8 = b + ((a - b) / 2);
        println!("{} {}", c, d);
    }

This program outputs `-64 64` although the computations of `c` and `d` are equivalent.

Here's another example using floating point numbers:

    fn main() {
        let mut total1: f32 = 0.0;
        let mut total2: f32 = 0.0;
        let mut counter1: f32 = 0.0;
        let mut counter2: f32 = 100.0;

        for _ in 0 .. 10001 {
            total1 += counter1;
            total2 += counter2;
            counter1 += 0.01;
            counter2 -= 0.01;
        }
        println!("{} {}", total1, total2);
    }

The output of this program is `500041.16 500012.16`, a difference of 25 for a program that computes the same result (unless I made a mistake).

bnegreve7y ago

Right! thanks

liftbigweights7y ago

The difference is that with fp ops, it's part of the design and understood that you should never directly compare the equality of fp numbers since they are estimates. You should check for equality of fp numbers by checking their difference according to your needs.

Whereas for int ops, equality works within the limits of the design.

In short equality means something different in fp by design. For int, it means what we think it means within its limits. When we overflow, then things get screwy.

fragmede7y ago

Ideologically, yes, the compiler should generate the fastest code possible for the same math expression.

However, the compiler ('s optimization step) is not magic and produces suboptimal code sometimes. Back when C was young, this was frequently the case (1970's and 1980's), so dropping into assembly to hand-code performance critical sections is just what people did, in order to get software to run smoothly.

Thankfully this is largely no longer the case, however it does still happen.

In Java's case, the JVM runs on top of multiple different architectures which makes optimization even more complicated.

Low-level instruction generation and optimization is just one topic under the umbrella of compiler design, which is a huge (and fascinating!) discipline to get into.

pjmlp7y ago

While true, and I still remember those days, when many C code bases were full functions whose body was a big asm { ... } block, there were also compilers which were much better dealing with optimizations than those C compilers were capable of.

"An overview of the PL.8 compiler"

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.453...

Notice the architecture, quite similar to the layers and compiler phases used in modern compiler toolchains like LLVM.

The secret sauce, if one can call it as such, was that PL.8 had a richer type system, and the System/370 was a bit beefier than most platforms adopting C compilers.

bnegreve7y ago

> However, the compiler ('s optimization step) is not magic and produces suboptimal code sometimes.

I agree that compilers are not always perfect, but in this particular case the two expressions are trivially equivalent from the associativity of the multiplication so the distinction had to be intentional.

But as gnuvince pointed out, the two expressions are not equivalent when you consider integer overflow.

2 more replies

amelius7y ago

Because it's more work for the compiler to reduce the expression to something canonical (and it might even be impossible).

Also what good will it bring? What if the canonical expression triggers the slow path? Now you have no means to change it into the fast version.

Further, in the case of floating point operations, operation order matters for rounding. And with integer operations, the actual form used can be important for preventing overflow (of intermediate results).

crb0027y ago

TIL about printing ASM from debug JVMs.

pjmlp7y ago

If you use Oracle Studio you can even see it on the IDE.

https://www.youtube.com/watch?v=_cFwDnKvgfw

There are also other tools like JITWatch.

https://github.com/AdoptOpenJDK/jitwatch/wiki/Videos-and-Sli...

https://vimeo.com/181925278

alkonaut7y ago

Is Overflow UB so the compiler can choose to ignore the fact that 2x(i x i) could overflow differently from 2 x i x i?

I’m not sure it does overflow differently but I would expect overflow to behave consistently as written, and not be dependent on optimization, is that not the case?

BeeOnRope7y ago

Nothing you can do in pure Java code is UB in the C/C++ sense.

alkonaut7y ago

Without UB it must be very hard for the compiler to optimize arithmetic. Even obvious things like (2 x A) x B vs 2 x (A x B) are only equivalent without overflow. I guess it can be specified as being up to the jitter to decide - so not UB but not known from looking at the source either? Would be interesting to know what .NET and Java specifications say on it

1 more reply

Koshkin7y ago

At first, I thought it was because i * i == -1.

microcolonel7y ago

I guess they do not use value numbering, which is typically how you get equivalent results for cases like this.

qwerty4561277y ago

IMHO some kind of logic preprocesor should take care of this before the actual compilation.

isbvhodnvemrwvn7y ago

How? Java is compiled to bytecode, you don't know the architecture of the system the code is going to run on. It's one of the reasons javac only implements the simplest optimizations possible (constants folding and the like)

pjmlp7y ago

Compiling to bytecode is just one of the possibilities.

Since the early days of Java, OEM vendors targeting embedded targets do support AOT compilation, with possible PGO feedback.

Some vendors like IBM, also provide similar capabilities on their regular Java toolchains.

And Maxime finally graduated as Graal/Substrate, which is also another way of compiling Java.

But all in all, everyone is transitioning to the benefits of bytecode as intermediate executable format.

Even some cool LLVM optimizations, like ThinLTO, are only possible thanks to using bytecode.

polskibus7y ago

I wonder if the same applies to .net (fx/core).

pjmlp7y ago

Depends on the runtime.

You have the old JIT, replaced by RyuJIT on .NET 4.6 and .NET Core.

Then .NET Native, which does AOT compilation via the same backend as Visual C++.

Followed by Mono's JIT/AOT implementation.

Windows/Windows Phone 8.x used a Bartok derived compiler for the MDIL format.

Same applies to Java though, as the answer only goes through what Hotspot does, but there are many other JIT/AOT compilers for Java as well.

networkimprov7y ago

Has anyone tried this with Go?

saagarjha7y ago

Tried what specifically? This particular example, or something similar where the compiler generates code with different speeds for seemingly equivalent code?

pmarreck7y ago

No, because come back when you’re a real language with a runtime error handler

networkimprov7y ago

Working on it!

Requirements to Consider for Go 2 Error Handling

https://gist.github.com/networkimprov/961c9caa2631ad3b95413f...

sabujp7y ago

thank you for this!

1 more reply

JohnL47y ago

The database is fast enough for a few extra trips to it, so this is definitely what we should be focusing on.

(My cup of bitterness doth overflow.)

j / k navigate · click thread line to collapse

100 comments

userbinator7y ago

So it's an issue of the optimizer; as is often the case, it unrolls too aggressively and shoots itself in the foot, all the while missing out on various other opportunities.

pcwalton7y ago

> In my experience, loop unrolling should basically never be done except in extremely degenerate cases

Not true. Like many such optimizations, loop unrolling can be useful because it makes downstream loads constant.

For example:

    float identity[4][4];
    for (unsigned y = 0; y < 4; y++)
        for (unsigned x = 0; x < 4; x++)
            identity[y][x] = y == x ? 1 : 0;
    ... do some matrix math ...

(You might ask "who would write this code?" As Schemers say: "macros do.")

See LLVM's heuristics: http://llvm.org/doxygen/LoopUnrollPass_8cpp.html#ad7c38776d7...

kannanvijayan7y ago

> (You might ask "who would write this code?" As Schemers say: "macros do.")

``` template <unsigned N, unsigned M> class Matrix { static Matrix Identity() { ... } }

    auto m = Matrix<4, 4>::Identity();

```

``` class Matrix { unsigned n; unsigned m; static Matrix Identity(unsigned n, unsigned m) { ... } }

    // Somewhere else
    Matrix m = Matrix::Identity(4, 4);

```

Here, the inlining of the call to `Identity` at the call-site will turn the `n` and `m` in the body of `Identity` into the constant 4.

bjoli7y ago

I didn't understand dead elimination until I wrote enough macros. It is a lot easier to generate code and have the optimizer fix it than to make sure to always generate efficient code.

This is also how compilers do things, but it is only that we schemers can see the intermediate result much easier using simple source->source transformations.

1 more reply

stochastic_monk7y ago

I had a case where I was counting occurrences of 8-bit integer values. Manual loop unrolling provided a 33% speed up.

tomsmeding7y ago

Sharlin7y ago

nonsince7y ago

Loop unrolling is useful only when it enables other optimisations. The most common is constant folding.

dragontamer7y ago

Fully agree.

BeeOnRope7y ago

For small loops, unrolling is the most important of all, since loop carried dependency chains are dense, and the loop overhead is a high fraction of the overall work.

It is easy to get a 2x speedup by unrolling a small loop, and even larger speedups are not uncommon.

1 more reply

nwmcsween7y ago

Well I've seen the opposite, try a naive string function of some sort (strlen, etc) now manually unroll.

astn7y ago

Yeah, sometimes the compiler unrolls too much and innocent looking one-liner can be compiled into a monstrosity like this:

https://godbolt.org/z/aKtko5

londons_explore7y ago

Are there no compilers which attempt to look at that code, decide 'that looks like 1<<(num-2) when n>=2', and replace the code entirely?

There must be so many examples of bubble sort where quicksort would be better, and other code patterns which can be identified and replaced with something orders of magnitude faster.

pertymcpert7y ago

Is that more performant?

jepler7y ago

You should translate your program to C++ and build with clang ; it turns the loop into a single constant load. https://godbolt.org/z/slznbU

cryptonector7y ago

Did you read TFA? The author did that (though using GCC), and the reason the optimizer does what you see is undefined behavior due to signed integer overflow.

kccqzy7y ago

> the reason the optimizer does what you see is undefined behavior due to signed integer overflow

1 more reply

xeeeeeeeeeeenu7y ago

There is no undefined behaviour, '#pragma GCC optimize("wrapv")' takes care of that.

geezerjay7y ago

> does what you see is undefined behavior

3 more replies

utopcell7y ago

GCC applies the same optimization and compiles to a single constant if overflow does not occur. To see this, you can change '1000000000' to '1000'.

jepler7y ago

Updated version that specifies -fwrapv on the commandline to turn the integer overflow into defined behavior: https://godbolt.org/z/K-Ijl0

ychen3067y ago

It's usually a good idea to turn loop bound into a variable when benchmarking a compiler, lest it optimizes the whole thing away like in this case.

archgoon7y ago

Nope; doesn't work for clang. Clang actually detects and compiles the algebraic closed form sum(i^2, n) for a bound n.

1 more reply

Too7y ago

2 more replies

beeforpork7y ago

acdha7y ago

DannyBee7y ago

This is why there is a real commercial jvm market with azul.

yifanl7y ago

It's possible that they're working in the frame of mind that there aren't any low-hanging fruit left after so many years of compiler optimizations and forget to even try.

techopoly7y ago

That just might be the most dedicated answer I've ever seen on Stack Overflow.

azhenley7y ago

It is a good answer, but my favorite by far is an answer about branch prediction to explain why processing a sorted array is faster than unsorted: https://stackoverflow.com/q/11227809/938695

fma7y ago

The person who answered the multiple question dove into byte code...but also answered questions on Angular.

I am unworthy...and this is what impostor syndrome looks like.

3 more replies

garmaine7y ago

Would be awesome if that answer was updated to explain Spectre (it’s 85% of the way there).

dopamean7y ago

Wow that was a great read.

foobaw7y ago

truly full-stack!

falcor847y ago

It really was, but as others mentioned, there's a lot of really good stuff on Stack Overflow and Stack Exchange in general. This is my favorite:

https://codegolf.stackexchange.com/questions/11880/build-a-w...

pmarreck7y ago

Oh my god.

https://copy.sh/life/?pattern=TetrisOTCAMP.mc

OH MY GOD!

My eyes are watering and I can’t stop deeply chuckling at the sheer collaborative esoteric audacity

Veedrac7y ago

Then you'll love https://stackoverflow.com/questions/37361145/deoptimizing-a-...

pmarreck7y ago

Takeaway phrases from this I love:

“Pessimization”

“Diabolical incompetence”

dreamcompiler7y ago

garmaine7y ago

There isn’t an integer square opcode on any major processor architecture though, right?

dreamcompiler7y ago

Not that I know of. It's not really worth it for short integers (64 bits or less). But it's helpful with bignums.

ww5207y ago

I'm surprised it's not doing a left shift for the x2.

jcdavis7y ago

It is in the first example (the sal instruction)

DannyBee7y ago

However, if you look at the second, you won't see any left shifts, which is also interesting

1 more reply

podsnap7y ago

The graal behavior is a lot more sane:

    graal:
    [info] SoFlow.square_i_two   10000  avgt   10  5338.492 ± 36.624  ns/op   // 2 *\sum i * i
    [info] SoFlow.two_i_         10000  avgt   10  6421.343 ± 34.836  ns/op   // \sum 2 * i * i
    [info] SoFlow.two_square_i   10000  avgt   10  6367.139 ± 34.575  ns/op   // \sum 2 * (i * i)
    regular 1.8:
    [info] SoFlow.square_i_two   10000  avgt   10  6393.422 ± 27.679  ns/op
    [info] SoFlow.two_i_         10000  avgt   10  8870.908 ± 35.715  ns/op
    [info] SoFlow.two_square_i   10000  avgt   10  6221.205 ± 42.408  ns/op

The graal-generated assembly for the first two cases is nearly identical, featuring unrolled repetitions of sequences like

    [info]   0x000000011433ec03: mov    %r8d,%ecx
    [info]   0x000000011433ec06: shl    %ecx               ;*imul {reexecute=0 rethrow=0 return_oop=0}
    [info]                                                 ; - add.SoFlow::test_two_i_@15 (line 41)
    [info]   0x000000011433ec08: imul   %r8d,%ecx          ;*imul {reexecute=0 rethrow=0 return_oop=0}
    [info]                                                 ; - add.SoFlow::test_two_i_@17 (line 41)
    [info]   0x000000011433ec0c: add    %ecx,%r9d          ;*iadd {reexecute=0 rethrow=0 return_oop=0}
    [info]                                                 ; - add.SoFlow::test_two_i_@18 (line 41)
    [info]   0x000000011433ec0f: lea    0x5(%r11),%r8d     ;*iinc {reexecute=0 rethrow=0 return_oop=0}
    [info]                                                 ; - add.SoFlow::test_two_i_@20 (line 40)

while the third case does a single shl at the end.

    [info]   0x000000010e2918bb: imul   %r8d,%r8d          ;*imul {reexecute=0 rethrow=0 return_oop=0}
    [info]                                                 ; - add.SoFlow::test_square_i_two@15 (line 32)
    [info]   0x000000010e2918bf: add    %r8d,%ecx          ;*iadd {reexecute=0 rethrow=0 return_oop=0}
    [info]                                                 ; - add.SoFlow::test_square_i_two@16 (line 32)
    [info]   0x000000010e2918c2: lea    0x3(%r11),%r8d     ;*iinc {reexecute=0 rethrow=0 return_oop=0}
    [info]                                                 ; - add.SoFlow::test_square_i_two@18 (line 31)

Both graal and C2 inline, but as usual the graal output is a lot more comprehensible.

bnegreve7y ago

I don't see how generating different code for the same mathematical expression can be a good thing.

The compiler should detect that the two expressions are strictly equivalent and generate whatever code it believes is the fastest.

Any idea why it is this way?

gnuvince7y ago

Because of integer overflows and floating-point operations, the notion of equivalent mathematical expressions is tricky.

    fn main() {
        let a: i8 = 125;
        let b: i8 = 3;
        let c: i8 = (a + b) / 2;
        let d: i8 = b + ((a - b) / 2);
        println!("{} {}", c, d);
    }

This program outputs `-64 64` although the computations of `c` and `d` are equivalent.

Here's another example using floating point numbers:

    fn main() {
        let mut total1: f32 = 0.0;
        let mut total2: f32 = 0.0;
        let mut counter1: f32 = 0.0;
        let mut counter2: f32 = 100.0;

        for _ in 0 .. 10001 {
            total1 += counter1;
            total2 += counter2;
            counter1 += 0.01;
            counter2 -= 0.01;
        }
        println!("{} {}", total1, total2);
    }

The output of this program is `500041.16 500012.16`, a difference of 25 for a program that computes the same result (unless I made a mistake).

bnegreve7y ago

Right! thanks

liftbigweights7y ago

Whereas for int ops, equality works within the limits of the design.

In short equality means something different in fp by design. For int, it means what we think it means within its limits. When we overflow, then things get screwy.

fragmede7y ago

Ideologically, yes, the compiler should generate the fastest code possible for the same math expression.

Thankfully this is largely no longer the case, however it does still happen.

In Java's case, the JVM runs on top of multiple different architectures which makes optimization even more complicated.

Low-level instruction generation and optimization is just one topic under the umbrella of compiler design, which is a huge (and fascinating!) discipline to get into.

pjmlp7y ago

"An overview of the PL.8 compiler"

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.453...

Notice the architecture, quite similar to the layers and compiler phases used in modern compiler toolchains like LLVM.

The secret sauce, if one can call it as such, was that PL.8 had a richer type system, and the System/370 was a bit beefier than most platforms adopting C compilers.

bnegreve7y ago

> However, the compiler ('s optimization step) is not magic and produces suboptimal code sometimes.

But as gnuvince pointed out, the two expressions are not equivalent when you consider integer overflow.

2 more replies

amelius7y ago

Because it's more work for the compiler to reduce the expression to something canonical (and it might even be impossible).

Also what good will it bring? What if the canonical expression triggers the slow path? Now you have no means to change it into the fast version.

crb0027y ago

TIL about printing ASM from debug JVMs.

pjmlp7y ago

If you use Oracle Studio you can even see it on the IDE.

https://www.youtube.com/watch?v=_cFwDnKvgfw

There are also other tools like JITWatch.

https://github.com/AdoptOpenJDK/jitwatch/wiki/Videos-and-Sli...

https://vimeo.com/181925278

alkonaut7y ago

Is Overflow UB so the compiler can choose to ignore the fact that 2x(i x i) could overflow differently from 2 x i x i?

I’m not sure it does overflow differently but I would expect overflow to behave consistently as written, and not be dependent on optimization, is that not the case?

BeeOnRope7y ago

Nothing you can do in pure Java code is UB in the C/C++ sense.

alkonaut7y ago

1 more reply

Koshkin7y ago

At first, I thought it was because i * i == -1.

microcolonel7y ago

I guess they do not use value numbering, which is typically how you get equivalent results for cases like this.

qwerty4561277y ago

IMHO some kind of logic preprocesor should take care of this before the actual compilation.

isbvhodnvemrwvn7y ago

pjmlp7y ago

Compiling to bytecode is just one of the possibilities.

Since the early days of Java, OEM vendors targeting embedded targets do support AOT compilation, with possible PGO feedback.

Some vendors like IBM, also provide similar capabilities on their regular Java toolchains.

And Maxime finally graduated as Graal/Substrate, which is also another way of compiling Java.

But all in all, everyone is transitioning to the benefits of bytecode as intermediate executable format.

Even some cool LLVM optimizations, like ThinLTO, are only possible thanks to using bytecode.

polskibus7y ago

I wonder if the same applies to .net (fx/core).

pjmlp7y ago

Depends on the runtime.

You have the old JIT, replaced by RyuJIT on .NET 4.6 and .NET Core.

Then .NET Native, which does AOT compilation via the same backend as Visual C++.

Followed by Mono's JIT/AOT implementation.

Windows/Windows Phone 8.x used a Bartok derived compiler for the MDIL format.

Same applies to Java though, as the answer only goes through what Hotspot does, but there are many other JIT/AOT compilers for Java as well.

networkimprov7y ago

Has anyone tried this with Go?

saagarjha7y ago

Tried what specifically? This particular example, or something similar where the compiler generates code with different speeds for seemingly equivalent code?

pmarreck7y ago

No, because come back when you’re a real language with a runtime error handler

networkimprov7y ago

Working on it!

Requirements to Consider for Go 2 Error Handling

https://gist.github.com/networkimprov/961c9caa2631ad3b95413f...

sabujp7y ago

thank you for this!

1 more reply

JohnL47y ago

The database is fast enough for a few extra trips to it, so this is definitely what we should be focusing on.

(My cup of bitterness doth overflow.)

j / k navigate · click thread line to collapse