Optimizations in C++ Compilers (opens in new tab)

(queue.acm.org)

102 pointsguodong5y ago20 comments

20 comments

> I went home that evening and created Compiler Explorer.

Nice try. You can’t escape being known as a verb now.

Everyone knows the tool as godbolt.

Yep, it honestly never occurred to me that the name of the tool wasn't godbolt. I just went to the site and noticed "Compiler Explorer" for the first time, which seems so generic that I never thought of it as a name.

2bluesc5y ago

Interesting tool I've never used, for the uninitiated:

https://godbolt.org/

dr_zoidberg5y ago

Interesting, but broken in Firefox. Could not get it to run and then I thought "maybe in Chrome..." and then got it to work.

1 more reply

projektfu5y ago

He pulled an ebay. The product was called AuctionWeb but the URL was http://ebay.com/aw/

gpderetta5y ago

Yes, it is somehow also an awesome name for a tool.

guodongOP5y ago

That's my first impression, too, until I'm curious about what 'godbolt' means, and I googled it...

usefulcat5y ago

Compiler Explorer (aka godbolt) is awesome; I use it at least weekly.

It's amazing how many more code generation questions occur to me now that there's so much less friction in getting the answers.

beached_whale5y ago

I host it locally for myself too and got the CLion plugin, it is amazing. Seeing actual codegen in projets with multiple files is really something and having the same scrolling with ASM or source too.

jedbrown5y ago

The floating point comment leaves out that one can use

    #pragma omp simd reduction(+:res)

as a more precise way to achieve vectorization in the reduction (compile with -fopenmp-simd to only use it for SIMD without linking an OpenMP library): https://godbolt.org/z/17oTz1

Unfortunately, the pragma is not supported with the new-style class iterators in a released compiler, though it works in clang-trunk: https://godbolt.org/z/hbP11W Note that Clang disables floating point contraction by default (so no vfmadd instructions), despite them being more accurate. One usually wants this globally (-ffp-contract=fast) except when trying to bitwise reproduce software compiled for pre-Haswell.

manch235y ago

> I hope that some of these optimizations are a pleasant surprise and will factor in your decisions to write clear, intention-revealing code and leave it to the compiler to do the right thing.

This was my key takeaway from this article. Writing clear code that is easier to maintain will have good enough performance most of the time. I was particularly impressed with the devirtualization optimizations and will be less likely to shy away from using polymorphism in future due to performance concerns.

fluffything5y ago

> Tail call removal. A recursive function that ends in a call to itself can often be rewritten as a loop, reducing call overhead and reducing the chance of stack overflow.

Most important: this optimization enables pipelined execution.

When people talk about a CPU executing an integer add instruction in ~1 cycle, what they actually mean is that the add has this latency when the CPU pipelines are full.

If you have an 11 stage pipeline... the add can often have a latency of ~11 cycles... if you write the _right_ code for it.

gpderetta5y ago

FWIW, at least on intel cpu call instructions are fully pipelined and effectively zero latency (although, like all jump instructions there is a limit of one call every other cycle).

ladberg5y ago

Interestingly, I've seen infinite recursions get optimized to infinite tail calls. This makes debugging harder because instead of a stack overflow you just have an infinite loop and have to manually go kill the process and get a breakpoint.

Then, looking at the code it's not obvious where the infinite loop occurs.

tomp5y ago

Tail call removal is much broader than that. Compilers will remove even indirect tail calls, which is useful when building a fast interpreter. I tried this with a small example (using Godbolt, of course), now let’s see if it works for a full interpreter.

amelius5y ago

That's a cool bag of tricks. But I'm impressed when compilers start optimizing programs in the big-O sense.

stephenbennyhat5y ago

Did you see the sumToX() example in the paper?

mrlonglong5y ago

Godbolt sounds like a top quality brand of quidditch broom :-D

j / k navigate · click thread line to collapse

20 comments

orangepanda5y ago

> I went home that evening and created Compiler Explorer.

Nice try. You can’t escape being known as a verb now.

Everyone knows the tool as godbolt.

ladberg5y ago

2bluesc5y ago

Interesting tool I've never used, for the uninitiated:

https://godbolt.org/

dr_zoidberg5y ago

Interesting, but broken in Firefox. Could not get it to run and then I thought "maybe in Chrome..." and then got it to work.

1 more reply

projektfu5y ago

He pulled an ebay. The product was called AuctionWeb but the URL was http://ebay.com/aw/

gpderetta5y ago

Yes, it is somehow also an awesome name for a tool.

guodongOP5y ago

That's my first impression, too, until I'm curious about what 'godbolt' means, and I googled it...

usefulcat5y ago

Compiler Explorer (aka godbolt) is awesome; I use it at least weekly.

It's amazing how many more code generation questions occur to me now that there's so much less friction in getting the answers.

beached_whale5y ago

jedbrown5y ago

The floating point comment leaves out that one can use

    #pragma omp simd reduction(+:res)

as a more precise way to achieve vectorization in the reduction (compile with -fopenmp-simd to only use it for SIMD without linking an OpenMP library): https://godbolt.org/z/17oTz1

manch235y ago

> I hope that some of these optimizations are a pleasant surprise and will factor in your decisions to write clear, intention-revealing code and leave it to the compiler to do the right thing.

fluffything5y ago

> Tail call removal. A recursive function that ends in a call to itself can often be rewritten as a loop, reducing call overhead and reducing the chance of stack overflow.

Most important: this optimization enables pipelined execution.

When people talk about a CPU executing an integer add instruction in ~1 cycle, what they actually mean is that the add has this latency when the CPU pipelines are full.

If you have an 11 stage pipeline... the add can often have a latency of ~11 cycles... if you write the _right_ code for it.

gpderetta5y ago

FWIW, at least on intel cpu call instructions are fully pipelined and effectively zero latency (although, like all jump instructions there is a limit of one call every other cycle).

ladberg5y ago

Then, looking at the code it's not obvious where the infinite loop occurs.

tomp5y ago

amelius5y ago

That's a cool bag of tricks. But I'm impressed when compilers start optimizing programs in the big-O sense.

stephenbennyhat5y ago

Did you see the sumToX() example in the paper?

mrlonglong5y ago

Godbolt sounds like a top quality brand of quidditch broom :-D

j / k navigate · click thread line to collapse