Will It Optimize? See how well you can anticipate gcc's optimizer (2010) (opens in new tab)

(ridiculousfish.com)

98 pointsalexbowe13y ago32 comments

32 comments

Err, "GCC does this optimization, because strlen is a "built-in function:"

No. I wrote the optimization pass that does this (GVN PRE with SCC based value numbering).

It does it to any pure/const function. Const ones no matter what, and pure ones if it can prove the global memory state does not otherwise change between calls in a way that impacts that pure call (IE there are times it can prove the calls that happen in between don't matter, and will still do the elimination). You don't usually have to mark the functions const/pure if they are visible to GCC, it does interprocedural analysis to prove they are pure/const and mark them for you.

It will even do it through function pointers if the value numbering or alias analysis can prove what they point to, or that they don't change in between the calls.

  double cos (double) __attribute__ ((const));
  double sin (double) __attribute__ ((const));
  double f(double a)
  {
    double b;
    double c,d;
    double (*fp) (double) __attribute__ ((const));
    /* Partially redundant call */
    if (a < 2.0)
      {
        fp = sin;
        c = fp (a);
      }
    else
      {
        c = 1.0;
        fp = cos;
      }
    d = fp (a);
    return d + c;
  }

In this example, it will eliminate the unconditional fp(a) call at the end by reusing the value of c in the first if block, and storing the result of calling cos(a) in the else block. IE:

  double cos (double) __attribute__ ((const));
  double sin (double) __attribute__ ((const));
  double f(double a)
  {
    double b;
    double c,d;
    double (*fp) (double) __attribute__ ((const));
    /* Partially redundant call */
    if (a < 2.0)
      {
        fp = sin;
        c = fp (a);
        temp = c
      }
    else
      {
        c = 1.0;
        fp = cos;
        temp = cos(a)
      }
    d = temp;
    return d + c;
  }

(attribute pure is ignored on function pointers, so you can't just s/const/pure/ and expect it to work)

Loop code hoisting is actually a special case of this partial redundancy elimination.

  double sin (double) __attribute__ ((const));
  double f (double a)
  {
    int i;
    double c, d;
    double (*fp) (double) __attribute__ ((const));
    fp = sin;
    for (i = 0; i < 50; i++)
      {
        c = fp (a);
      }
    d = fp (a);
    return d + c;
  }

The call to fp(a) in the loop will be hoisted above the loop through redundancy elimination, the call d = fp(a) will be eliminated in favor of that hoisted value.

Basically, the optimization is a lot more powerful than he describes.

ridiculous_fish13y ago

OP here. Thanks for the great description of loop hoisting of invariant code, and it's awesome to hear from the guy who wrote it.

However, what I wrote was not wrong. strlen is not marked as pure or const on OS X, and yet it is hoisted. glibc does happen to mark strlen as pure in string.h, but the optimization occurs even if you do not include that header - even if there is no declaration for strlen in scope at all!

So this optimization really does proceed because strlen is a builtin function, and does not depend on strlen being marked as pure or const.

Under the hood, it may be that strlen is rewritten as __builtin_strlen, which is marked as pure and therefore hoisted like any pure expression. Or it may be due to some of the other magic, like the pass that optimizes strlen("foo") to 3. I don't know which optimization pass is responsible, and it didn't seem important, so I kept my mouth shut on that aspect.

DannyBee13y ago

"So this optimization really does proceed because strlen is a builtin function, and does not depend on strlen being marked as pure or const."

No. It's exactly the other way around. We could not give a crap whether it is a built in function, only whether it is pure or const. We do not special case built-in functions anywhere near this optimization.

  ~/sources/gcc/gcc (git)-[master]- :) $ grep BUILT_IN tree-ssa-sccvn.c
      && DECL_BUILT_IN (TREE_OPERAND (op->op0, 0))
      && gimple_call_builtin_p (def_stmt, BUILT_IN_MEMSET)
	   && (gimple_call_builtin_p (def_stmt, BUILT_IN_MEMCPY)
	       || gimple_call_builtin_p (def_stmt, BUILT_IN_MEMPCPY)
	       || gimple_call_builtin_p (def_stmt, BUILT_IN_MEMMOVE))

  ~/sources/gcc/gcc (git)-[master]- :) $ grep BUILT_IN tree-ssa-pre.c
  ~/sources/gcc/gcc (git)-[master]- :( $

The special casing you see in the first part is trying to constant fold a few built-in calls in a utility function, and trying to see through memcpys for memory state.

The reason the optimization proceeds is because strlen gets marked as pure by the compiler if there is no non-pure definition that overrides it.

Basically, the compiler defines a function named "strlen" that is pure and nothrow behind your back, but you can override it by providing your own definition. This is unrelated to whether it is a builtin (because the builtin version is __builtin_strlen)

2 more replies

LeafStorm13y ago

This reminds me of something my assembly instructor said in class once (paraphrased):

"Once, I was interested in finding out how a C compiler would handle the div instruction. So, I opened up my editor, and wrote this C program:

    void main () {
        printf("%f", 14.0 / 3.0);
    }

I compiled it, and took a look at the generated assembly. How many div instructions do you think it used?

[Various guesses, mostly "one."]

The correct answer is "zero." The compiler figured it out at compile time, and put a constant in instead. So, I decided to put it in a function:

    float divide (float a, float b) {
        return a / b;
    }
    
    void main () {
        printf("%f", divide(a, b));
    }

Guess what happened next.

[Various guesses.]

The f--king compiler optimized it out again!"

SilasX13y ago

It didn't throw an "undefined variable" error?

mitchi13y ago

I've done this many times. This is special case that happens if you test with values inside the code. The compiler that I use, Visual C++, is smart enough to know that the result can be known at compile time, so it does just that in (Release). What you need to do is to read the arguments from a file or standard input.

JshWright13y ago

I don't spend a lot of time looking at complier outputs, but the optimization is quite obvious in the first case. I don't have a clue how it would optimize it out in the second case though...

praptak13y ago

I believe the second case is quite obvious too, with function inlining (assuming that GP post just forgot to define a and b as constants.)

Symmetry13y ago

In the second case the divide function isn't being exported from the compilation unit and the compiler can tell[1], so the compiler is free to do whatever it wants with the calling convention and in this case it will choose to inline. And once it's inlined finding the correct answer at compile time is still obvious.

[1] Normally you have to declare a function as 'static' to tell the compiler that the function isn't being exported form the compilation unit, since the compiler isn't supposed to be able to tell the difference between function declarations in the .c file itself and function declarations that were included from .h files. But in this case the function is just defined and never declared, so I presume that's the same as a static declaration.

2 more replies

nemetroid13y ago

I think LeafStorm meant to put some constants in the divide call.

1 more reply

groovy2shoes13y ago

Lasher is a badass.

praptak13y ago

I did some similar experiments myself a long time ago. Something mildly surprising for me at that time was that this got optimized to a single roll instruction:

    uint32_t roll(uint32_t const x) {
        return (x << 15) | (x >> 17);
    }

I haven't managed to make GCC generate a non-constant roll (i.e. (x << c) | (x >> (32-c)) ), probably because it couldn't infer that c is in the [0,31] range.

Jabbles13y ago

If c were outside that range, the result would be undefined, and the compiler would not need to consider that possibility.

1 more reply

colanderman13y ago

What version of GCC are you using? The non-constant case compiles to "roll" as far back as GCC 4.4.

praptak13y ago

It was a really long time ago - around the time of the NIST contest for the AES, so around 2001. And thanks for the "roll" update :)

ISL13y ago

Question 4 points at a huge surprise. With a scientific computing project, I got (if I recall correctly), correct and expected output at no optimization and -O1, and patently incorrect output at -O2 and -O3.

Cost: one sleepless night in college."What do you mean, optimization doesn't just make it faster?!?"

haberman13y ago

It's possible that both were "correct" but that -O2 and -O3 gave different results due to x86 extended precision. For some interesting reading, see:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=323 (a centi-bug of back and forth between "it's broken!" and "no it's not, and we're not going to change the compiler.")

http://gcc.gnu.org/ml/gcc/2003-08/msg01183.html (a really informative thread about the issue)

widdershins13y ago

Really? Wow, I thought optimisations had to provably avoid affecting the correctness of your program. Hence:

"This need to avoid imprecise intermediate results often prevents compilers from exploiting mathematical properties like associativity when optimizing FP arithmetic."

from the explanation under Q4.

octo_t13y ago

I am surprised at number 3, I would have thought that it would be optimised to x << 1, rather than the addition?

Edit: No. 5 explains why.

a1k0n13y ago

Not really; there's no rounding problem with x << 1. x * 2, x+x, x<<1 are all equivalent. The real reason it doesn't use the shift instruction on x86 is because it takes 3 bytes to encode the instruction whereas adding a register to itself takes two. But like it says, for powerpc it does use a shift as there's no difference in instruction lengths on a RISC.

random aside: gcc usually uses the LEA instruction for this sort of thing, which lets you compute any combination of {1,2,4,8} * x + y + n in one cycle where x and y are registers and n is a constant number. So even with the LEA instruction you can use either lea eax, [eax*2] or lea eax, [eax+eax] to compute this. And so on. I think add eax, eax is most likely what it does though.

acqq13y ago

Compilers for x86 prefer adds to shifts because there are processors (x86 exists for so long) on which shift costs more cycles. LEAs are always fast as they are needed for always used addressings.

gsg13y ago

x86 has encodings for shifts by 1 that are two bytes, so there's no code size advantage.

1 more reply

simarpreet00713y ago

What an interesting read. Thanks OP. I got stumbled on the floating point question. I remember a few weeks ago in class our instructor taught us about floating point precision loss and I thought the compiler might take care into that and not optimize it, but I guess I was wrong.

Would you explain a bit more on that? I'm really interested! Thanks!

gus_massa13y ago

Previous submission: https://news.ycombinator.com/item?id=1540567 (335 points, 989 days ago, 61 comments)

j / k navigate · click thread line to collapse

32 comments

DannyBee13y ago

Err, "GCC does this optimization, because strlen is a "built-in function:"

No. I wrote the optimization pass that does this (GVN PRE with SCC based value numbering).

It will even do it through function pointers if the value numbering or alias analysis can prove what they point to, or that they don't change in between the calls.

  double cos (double) __attribute__ ((const));
  double sin (double) __attribute__ ((const));
  double f(double a)
  {
    double b;
    double c,d;
    double (*fp) (double) __attribute__ ((const));
    /* Partially redundant call */
    if (a < 2.0)
      {
        fp = sin;
        c = fp (a);
      }
    else
      {
        c = 1.0;
        fp = cos;
      }
    d = fp (a);
    return d + c;
  }

In this example, it will eliminate the unconditional fp(a) call at the end by reusing the value of c in the first if block, and storing the result of calling cos(a) in the else block. IE:

  double cos (double) __attribute__ ((const));
  double sin (double) __attribute__ ((const));
  double f(double a)
  {
    double b;
    double c,d;
    double (*fp) (double) __attribute__ ((const));
    /* Partially redundant call */
    if (a < 2.0)
      {
        fp = sin;
        c = fp (a);
        temp = c
      }
    else
      {
        c = 1.0;
        fp = cos;
        temp = cos(a)
      }
    d = temp;
    return d + c;
  }

(attribute pure is ignored on function pointers, so you can't just s/const/pure/ and expect it to work)

Loop code hoisting is actually a special case of this partial redundancy elimination.

  double sin (double) __attribute__ ((const));
  double f (double a)
  {
    int i;
    double c, d;
    double (*fp) (double) __attribute__ ((const));
    fp = sin;
    for (i = 0; i < 50; i++)
      {
        c = fp (a);
      }
    d = fp (a);
    return d + c;
  }

The call to fp(a) in the loop will be hoisted above the loop through redundancy elimination, the call d = fp(a) will be eliminated in favor of that hoisted value.

Basically, the optimization is a lot more powerful than he describes.

ridiculous_fish13y ago

OP here. Thanks for the great description of loop hoisting of invariant code, and it's awesome to hear from the guy who wrote it.

So this optimization really does proceed because strlen is a builtin function, and does not depend on strlen being marked as pure or const.

DannyBee13y ago

"So this optimization really does proceed because strlen is a builtin function, and does not depend on strlen being marked as pure or const."

  ~/sources/gcc/gcc (git)-[master]- :) $ grep BUILT_IN tree-ssa-sccvn.c
      && DECL_BUILT_IN (TREE_OPERAND (op->op0, 0))
      && gimple_call_builtin_p (def_stmt, BUILT_IN_MEMSET)
	   && (gimple_call_builtin_p (def_stmt, BUILT_IN_MEMCPY)
	       || gimple_call_builtin_p (def_stmt, BUILT_IN_MEMPCPY)
	       || gimple_call_builtin_p (def_stmt, BUILT_IN_MEMMOVE))

  ~/sources/gcc/gcc (git)-[master]- :) $ grep BUILT_IN tree-ssa-pre.c
  ~/sources/gcc/gcc (git)-[master]- :( $

The special casing you see in the first part is trying to constant fold a few built-in calls in a utility function, and trying to see through memcpys for memory state.

The reason the optimization proceeds is because strlen gets marked as pure by the compiler if there is no non-pure definition that overrides it.

2 more replies

LeafStorm13y ago

This reminds me of something my assembly instructor said in class once (paraphrased):

"Once, I was interested in finding out how a C compiler would handle the div instruction. So, I opened up my editor, and wrote this C program:

    void main () {
        printf("%f", 14.0 / 3.0);
    }

I compiled it, and took a look at the generated assembly. How many div instructions do you think it used?

[Various guesses, mostly "one."]

The correct answer is "zero." The compiler figured it out at compile time, and put a constant in instead. So, I decided to put it in a function:

    float divide (float a, float b) {
        return a / b;
    }
    
    void main () {
        printf("%f", divide(a, b));
    }

Guess what happened next.

[Various guesses.]

The f--king compiler optimized it out again!"

SilasX13y ago

It didn't throw an "undefined variable" error?

mitchi13y ago

JshWright13y ago

I don't spend a lot of time looking at complier outputs, but the optimization is quite obvious in the first case. I don't have a clue how it would optimize it out in the second case though...

praptak13y ago

I believe the second case is quite obvious too, with function inlining (assuming that GP post just forgot to define a and b as constants.)

Symmetry13y ago

2 more replies

nemetroid13y ago

I think LeafStorm meant to put some constants in the divide call.

1 more reply

groovy2shoes13y ago

Lasher is a badass.

praptak13y ago

I did some similar experiments myself a long time ago. Something mildly surprising for me at that time was that this got optimized to a single roll instruction:

    uint32_t roll(uint32_t const x) {
        return (x << 15) | (x >> 17);
    }

I haven't managed to make GCC generate a non-constant roll (i.e. (x << c) | (x >> (32-c)) ), probably because it couldn't infer that c is in the [0,31] range.

Jabbles13y ago

If c were outside that range, the result would be undefined, and the compiler would not need to consider that possibility.

1 more reply

colanderman13y ago

What version of GCC are you using? The non-constant case compiles to "roll" as far back as GCC 4.4.

praptak13y ago

It was a really long time ago - around the time of the NIST contest for the AES, so around 2001. And thanks for the "roll" update :)

ISL13y ago

Cost: one sleepless night in college."What do you mean, optimization doesn't just make it faster?!?"

haberman13y ago

It's possible that both were "correct" but that -O2 and -O3 gave different results due to x86 extended precision. For some interesting reading, see:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=323 (a centi-bug of back and forth between "it's broken!" and "no it's not, and we're not going to change the compiler.")

http://gcc.gnu.org/ml/gcc/2003-08/msg01183.html (a really informative thread about the issue)

widdershins13y ago

Really? Wow, I thought optimisations had to provably avoid affecting the correctness of your program. Hence:

"This need to avoid imprecise intermediate results often prevents compilers from exploiting mathematical properties like associativity when optimizing FP arithmetic."

from the explanation under Q4.

octo_t13y ago

I am surprised at number 3, I would have thought that it would be optimised to x << 1, rather than the addition?

Edit: No. 5 explains why.

a1k0n13y ago

acqq13y ago

Compilers for x86 prefer adds to shifts because there are processors (x86 exists for so long) on which shift costs more cycles. LEAs are always fast as they are needed for always used addressings.

gsg13y ago

x86 has encodings for shifts by 1 that are two bytes, so there's no code size advantage.

1 more reply

simarpreet00713y ago

Would you explain a bit more on that? I'm really interested! Thanks!

gus_massa13y ago

Previous submission: https://news.ycombinator.com/item?id=1540567 (335 points, 989 days ago, 61 comments)

j / k navigate · click thread line to collapse