In my experience, loop unrolling should basically never be done except in extremely degenerate cases; I remember not long ago someone I know who also optimises Asm remarking "it should've died along with the RISC fad". The original goal was to reduce per-iteration overhead associated with checking for end-of-loop, but any superscalar/OoO/speculative processor can "execute past" those instructions anyway; all that unrolling will do is bloat the code and work against caching. Memory bandwidth is often the bottleneck, not the core.
Not true. Like many such optimizations, loop unrolling can be useful because it makes downstream loads constant.
For example:
float identity[4][4];
for (unsigned y = 0; y < 4; y++)
for (unsigned x = 0; x < 4; x++)
identity[y][x] = y == x ? 1 : 0;
... do some matrix math ...
In this case, the compiler probably wants to unroll the loops so that it can straightforwardly forward the constant matrix entries directly to the matrix arithmetic. It'll likely be able to eliminate lots of operations that way.(You might ask "who would write this code?" As Schemers say: "macros do.")
See LLVM's heuristics: http://llvm.org/doxygen/LoopUnrollPass_8cpp.html#ad7c38776d7...
To expand on this point - in the more prosaic world of C++ - this sort of code comes about all the time in templated code. For example, the above loop you posted might have been found in something like:
``` template <unsigned N, unsigned M> class Matrix { static Matrix Identity() { ... } }
auto m = Matrix<4, 4>::Identity();
```The other major source of these sorts of constants leading to DCE oppportunities is inlining. Consider a more classical, matrix implementation that is not templated and doesn't lift its dimensions into the type:
``` class Matrix { unsigned n; unsigned m; static Matrix Identity(unsigned n, unsigned m) { ... } }
// Somewhere else
Matrix m = Matrix::Identity(4, 4);
```Here, the inlining of the call to `Identity` at the call-site will turn the `n` and `m` in the body of `Identity` into the constant 4.
If I had to make an educated guess - inlining typically generates these (i.e. partial evaluation, constant folding, an DCE) situations most often in compilers. An incredible amount of information can flow from caller to callee when you specialize the callee for that call-site.
This is also how compilers do things, but it is only that we schemers can see the intermediate result much easier using simple source->source transformations.
In most cases on modern systems, small loops should remain compact as possible, to stay in the uop cache. The "for" loop overhead (the inc, cmp, and jmp instructions) effectively execute in parallel. Modern systems are highly out-of-order and the for-loop overhead is virtually nil.
For small loops, unrolling is the most important of all, since loop carried dependency chains are dense, and the loop overhead is a high fraction of the overall work.
It is easy to get a 2x speedup by unrolling a small loop, and even larger speedups are not uncommon.
So this "unrolling rarely helps" idea is just as much of a myth as "unrolling never helps". The main problem with unrolling is that the compilers usually don't do it intelligently - usually loops are unrolling if some kind of threshold is met, depending on compiler options - but this always happens in kind of a feed-forward way, rather thank a feed-back way, which would involve unrolling the loop and analyzing the benefit and costs after further optimization passes.
There must be so many examples of bubble sort where quicksort would be better, and other code patterns which can be identified and replaced with something orders of magnitude faster.
> the reason the optimizer does what you see is undefined behavior due to signed integer overflow
Yes undefined behavior gives the optimizer the right in this case to transform the code into anything, including a nonsense answer, or a trap instruction. But the optimizer did not; it produced the right answer under 2's complement arithmetic.
EDIT: It seems that clang doesn't support #pragma GCC optimize, so it's a no-op in that snippet. It doesn't change the result though. If you pass -fwrapv flag to clang, it will be optimized in exactly the same way.
Just to be clear, undefined behavior means the standard allows implementations to do what they they feel is the right thing to do under that scenario, and the outcome will still comply with the standard.
Why? Is this optimisation particularly difficult to implement? Or is it just missed low-hanging fruit? It sure looks easy (like: rearrange expressions to keep the expression tree shallow and left-branching to avoid stack operations).
This is why there is a real commercial jvm market with azul.
The person who answered the multiple question dove into byte code...but also answered questions on Angular.
I am unworthy...and this is what impostor syndrome looks like.
https://codegolf.stackexchange.com/questions/11880/build-a-w...
https://copy.sh/life/?pattern=TetrisOTCAMP.mc
OH MY GOD!
My eyes are watering and I can’t stop deeply chuckling at the sheer collaborative esoteric audacity
“Pessimization”
“Diabolical incompetence”
graal:
[info] SoFlow.square_i_two 10000 avgt 10 5338.492 ± 36.624 ns/op // 2 *\sum i * i
[info] SoFlow.two_i_ 10000 avgt 10 6421.343 ± 34.836 ns/op // \sum 2 * i * i
[info] SoFlow.two_square_i 10000 avgt 10 6367.139 ± 34.575 ns/op // \sum 2 * (i * i)
regular 1.8:
[info] SoFlow.square_i_two 10000 avgt 10 6393.422 ± 27.679 ns/op
[info] SoFlow.two_i_ 10000 avgt 10 8870.908 ± 35.715 ns/op
[info] SoFlow.two_square_i 10000 avgt 10 6221.205 ± 42.408 ns/op
The graal-generated assembly for the first two cases is nearly identical, featuring unrolled repetitions of sequences like [info] 0x000000011433ec03: mov %r8d,%ecx
[info] 0x000000011433ec06: shl %ecx ;*imul {reexecute=0 rethrow=0 return_oop=0}
[info] ; - add.SoFlow::test_two_i_@15 (line 41)
[info] 0x000000011433ec08: imul %r8d,%ecx ;*imul {reexecute=0 rethrow=0 return_oop=0}
[info] ; - add.SoFlow::test_two_i_@17 (line 41)
[info] 0x000000011433ec0c: add %ecx,%r9d ;*iadd {reexecute=0 rethrow=0 return_oop=0}
[info] ; - add.SoFlow::test_two_i_@18 (line 41)
[info] 0x000000011433ec0f: lea 0x5(%r11),%r8d ;*iinc {reexecute=0 rethrow=0 return_oop=0}
[info] ; - add.SoFlow::test_two_i_@20 (line 40)
while the third case does a single shl at the end. [info] 0x000000010e2918bb: imul %r8d,%r8d ;*imul {reexecute=0 rethrow=0 return_oop=0}
[info] ; - add.SoFlow::test_square_i_two@15 (line 32)
[info] 0x000000010e2918bf: add %r8d,%ecx ;*iadd {reexecute=0 rethrow=0 return_oop=0}
[info] ; - add.SoFlow::test_square_i_two@16 (line 32)
[info] 0x000000010e2918c2: lea 0x3(%r11),%r8d ;*iinc {reexecute=0 rethrow=0 return_oop=0}
[info] ; - add.SoFlow::test_square_i_two@18 (line 31)
Both graal and C2 inline, but as usual the graal output is a lot more comprehensible.The compiler should detect that the two expressions are strictly equivalent and generate whatever code it believes is the fastest.
Any idea why it is this way?
fn main() {
let a: i8 = 125;
let b: i8 = 3;
let c: i8 = (a + b) / 2;
let d: i8 = b + ((a - b) / 2);
println!("{} {}", c, d);
}
This program outputs `-64 64` although the computations of `c` and `d` are equivalent.Here's another example using floating point numbers:
fn main() {
let mut total1: f32 = 0.0;
let mut total2: f32 = 0.0;
let mut counter1: f32 = 0.0;
let mut counter2: f32 = 100.0;
for _ in 0 .. 10001 {
total1 += counter1;
total2 += counter2;
counter1 += 0.01;
counter2 -= 0.01;
}
println!("{} {}", total1, total2);
}
The output of this program is `500041.16 500012.16`, a difference of 25 for a program that computes the same result (unless I made a mistake).Whereas for int ops, equality works within the limits of the design.
In short equality means something different in fp by design. For int, it means what we think it means within its limits. When we overflow, then things get screwy.
However, the compiler ('s optimization step) is not magic and produces suboptimal code sometimes. Back when C was young, this was frequently the case (1970's and 1980's), so dropping into assembly to hand-code performance critical sections is just what people did, in order to get software to run smoothly.
Thankfully this is largely no longer the case, however it does still happen.
In Java's case, the JVM runs on top of multiple different architectures which makes optimization even more complicated.
Low-level instruction generation and optimization is just one topic under the umbrella of compiler design, which is a huge (and fascinating!) discipline to get into.
"An overview of the PL.8 compiler"
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.453...
Notice the architecture, quite similar to the layers and compiler phases used in modern compiler toolchains like LLVM.
The secret sauce, if one can call it as such, was that PL.8 had a richer type system, and the System/370 was a bit beefier than most platforms adopting C compilers.
I agree that compilers are not always perfect, but in this particular case the two expressions are trivially equivalent from the associativity of the multiplication so the distinction had to be intentional.
But as gnuvince pointed out, the two expressions are not equivalent when you consider integer overflow.
Also what good will it bring? What if the canonical expression triggers the slow path? Now you have no means to change it into the fast version.
Further, in the case of floating point operations, operation order matters for rounding. And with integer operations, the actual form used can be important for preventing overflow (of intermediate results).
https://www.youtube.com/watch?v=_cFwDnKvgfw
There are also other tools like JITWatch.
https://github.com/AdoptOpenJDK/jitwatch/wiki/Videos-and-Sli...
I’m not sure it does overflow differently but I would expect overflow to behave consistently as written, and not be dependent on optimization, is that not the case?
Since the early days of Java, OEM vendors targeting embedded targets do support AOT compilation, with possible PGO feedback.
Some vendors like IBM, also provide similar capabilities on their regular Java toolchains.
And Maxime finally graduated as Graal/Substrate, which is also another way of compiling Java.
But all in all, everyone is transitioning to the benefits of bytecode as intermediate executable format.
Even some cool LLVM optimizations, like ThinLTO, are only possible thanks to using bytecode.
You have the old JIT, replaced by RyuJIT on .NET 4.6 and .NET Core.
Then .NET Native, which does AOT compilation via the same backend as Visual C++.
Followed by Mono's JIT/AOT implementation.
Windows/Windows Phone 8.x used a Bartok derived compiler for the MDIL format.
Same applies to Java though, as the answer only goes through what Hotspot does, but there are many other JIT/AOT compilers for Java as well.
Requirements to Consider for Go 2 Error Handling
https://gist.github.com/networkimprov/961c9caa2631ad3b95413f...
(My cup of bitterness doth overflow.)