Evaluation of CPU profiling tools: gperftools, Valgrind and gprof (2015) (opens in new tab)

(gernotklingler.com)

143 pointsadgnaf6y ago28 comments

28 comments

Also worth mentioning here is perf[1], which is great for low overhead profiling. Also, perf profiles can be turned into profiles compatible with GCC and LLVM PGO to build optimized binaries based on production runs, using autofdo[2]. In my use case, the instrumentation overhead was too high to use regular profiling on production workloads.

[1]: https://en.m.wikipedia.org/wiki/Perf_%28Linux%29

[2]: https://github.com/google/autofdo

atq21196y ago

Don't forget Hotspot as a way to visualize perf results, which vanilla perf is unfortunately a bit lacking in: https://github.com/KDAB/hotspot

the84726y ago

Thank you, I have been looking for something that could display perf recordings as a swimming lane view for threads.

scott_s6y ago

For anyone looking to use perf, Brendan Gregg's page on it is the best resource I know of: http://www.brendangregg.com/perf.html

seminatl6y ago

Fun fact: the pprof tool that comes in gperftools can read perf data files, so you can use them together if you prefer pprof’s reporting.

gnufx6y ago

perf and its ilk are obviously useful, but you need to be aware of several cans of worms with sampling hardware counters, in particular. These include the timing mechanism for sampling, the documentation and intrinsic usefulness of particular counters, and issues with multiplexing more than what can be used simultaneously. For multiplexing see, for instance, https://www.research.manchester.ac.uk/portal/files/59933625/...

eyegor6y ago

I'm just going to drop coz[1] as another suggestion. Ever since their talk/paper I expected other implementations of a causal profiler but for some reason everyone is steeped in the old ways. The concept just seems like such a huge efficiency boost compared to raw flame graphs. If you have time to watch their talk, it's linked on the github readme.

[1] https://github.com/plasma-umass/coz/blob/master/README.md

zimbatm6y ago

coz requires the user to instrument the code. So it's interesting but also much more costly to run experiments with.

I suspect it's cheaper to start with callgring to get an idea of the hot spots in the code base, find the low hanging fruits. Then switch to coz if you really need to squeeze out the last performance juice.

vanderZwan6y ago

If you watch the talk, he finds a number of examples of already implemented optimizations that don't work. I wouldn't be surprised if it was better to immediately start with coz. If nothing else, it forces you to model the problem better so there's overlap with test driven design there, no?

foota6y ago

I absolutely love flamegraphs for analysing performance. If you haven't used one before and you're interested in optimization (in particular, of large programs you're unfamiliar with) then check them out! I also find them to be an easy way to get a grasp on complicated call stacks, since the addition of method linking on the call stacks makes it really easy to follow.

gnufx6y ago

I've asked before without luck: How are flamegraphs preferable to the well-established sorts of visualizations in the common HPC performance tools, like CUBE, Paraver, and TAU, say? They typically provide at least inclusive or exclusive function/region views with choices of metrics for profiling and/or tracing over serial, threaded, or distributed execution.

foota6y ago

Well, I'll start by saying I'm not familiar with any of those tools. Took a quick look at them though. It looks like Paraver offers a time domain look at performance? And cube seems to offer time based and a graphviz of the call tree.

In a flame graph the width of a stack frame correlates to the % of CPU time spent in that stack frame, and the y is the particular call stack.

This means that you can quickly tell what functions, and from what call sites, are the most expensive.

The only visualization I know of that matches the ability to quickly zero in on things, while maintaining context, is a graph of call stack with frames colored by cumulative CPU time, but that has the issue that laying out the graph is hard, and seeing everything at once is difficult.

gnufx6y ago

That may be OK in simple cases where you can easily eyeball it, if you're only interested in aggregated CPU time as a metric, and if you win most from optimizing the obvious function in all modes of the program. That's not necessarily the case in complex scientific codes, for instance, especially parallel ones.

1 more reply

praveenster6y ago

Julia Evans has this really nice ezine regarding Linux debugging tools: https://wizardzines.com/zines/debugging/

ss2486y ago

The pictures in the article don't work (at least for me). Here is the wayback machine snapshot where everything displays correctly.

https://web.archive.org/web/20160718172225/http://gernotklin...

frumiousirc6y ago

A problem with google perftools is the `SIGPROF` signal used in sampling will interrupt polling such as used in ZeroMQ. Otherwise, it is a good tool in the toolbox.

vectorEQ6y ago

that is/was a problem with zeroMQ, not gperftools:

   27    SIGPROF      terminate process    profiling timer alarm (see setitimer(2))

its use is specifically what that signal is for.

Additionally, it seems fixed in zeroMQ for some things already in 2016... so i doubt thats still valid issue.

pmoriarty6y ago

The most interesting tool that I've found along these lines is sysdig.

The amount of different ways and the ease with which it can be used to dig in to and evaluate performance and other characteristics is truly awesome.

It can't do everything. But it can do a lot.

pjc506y ago

Also worth considering is Oprofile: https://en.m.wikipedia.org/wiki/OProfile

redis_mlc6y ago

I used to use oprofile, but I don't think it works on current kernels.

So I use "perf stat" now. :)

Thaxll6y ago

No love for VTune?

mdani6y ago

It was great on physical hosts but could not get it working in EC2 VMs

gnufx6y ago

It requires access to hardware counters you don't normally have in EC2, and at a privilege level I wouldn't want to enable in a multi-user compute system.

Narishma6y ago

Does it work on non-Intel CPUs?

gnufx6y ago

Ho, ho. There are assorted free performance tools that do, though -- at least on POWER and ARM64 for various HPC-focussed ones. I don't know much about VTune, but it's not clear to me what it does that I can't do with other tools on x86_64, and others allow me to measure serial and communication metrics together.

glouwbug6y ago

gcc has an address and thread sanitizer built in these days.

    gcc -fsanitize=address ...

    gcc -fsanitize=thread ...

They perform better than valgrind.

wyldfire6y ago

This article refers to valgrind's profiling features (callgrind) and not its more common/popular 'memcheck' feature.

Sanitizers and memcheck are unrelated to the profiling discussion here.

j / k navigate · click thread line to collapse

28 comments

woadwarrior016y ago

[1]: https://en.m.wikipedia.org/wiki/Perf_%28Linux%29

[2]: https://github.com/google/autofdo

atq21196y ago

Don't forget Hotspot as a way to visualize perf results, which vanilla perf is unfortunately a bit lacking in: https://github.com/KDAB/hotspot

the84726y ago

Thank you, I have been looking for something that could display perf recordings as a swimming lane view for threads.

scott_s6y ago

For anyone looking to use perf, Brendan Gregg's page on it is the best resource I know of: http://www.brendangregg.com/perf.html

seminatl6y ago

Fun fact: the pprof tool that comes in gperftools can read perf data files, so you can use them together if you prefer pprof’s reporting.

gnufx6y ago

eyegor6y ago

[1] https://github.com/plasma-umass/coz/blob/master/README.md

zimbatm6y ago

coz requires the user to instrument the code. So it's interesting but also much more costly to run experiments with.

vanderZwan6y ago

foota6y ago

gnufx6y ago

foota6y ago

In a flame graph the width of a stack frame correlates to the % of CPU time spent in that stack frame, and the y is the particular call stack.

This means that you can quickly tell what functions, and from what call sites, are the most expensive.

gnufx6y ago

1 more reply

praveenster6y ago

Julia Evans has this really nice ezine regarding Linux debugging tools: https://wizardzines.com/zines/debugging/

ss2486y ago

The pictures in the article don't work (at least for me). Here is the wayback machine snapshot where everything displays correctly.

https://web.archive.org/web/20160718172225/http://gernotklin...

frumiousirc6y ago

A problem with google perftools is the `SIGPROF` signal used in sampling will interrupt polling such as used in ZeroMQ. Otherwise, it is a good tool in the toolbox.

vectorEQ6y ago

that is/was a problem with zeroMQ, not gperftools:

   27    SIGPROF      terminate process    profiling timer alarm (see setitimer(2))

its use is specifically what that signal is for.

Additionally, it seems fixed in zeroMQ for some things already in 2016... so i doubt thats still valid issue.

pmoriarty6y ago

The most interesting tool that I've found along these lines is sysdig.

The amount of different ways and the ease with which it can be used to dig in to and evaluate performance and other characteristics is truly awesome.

It can't do everything. But it can do a lot.

pjc506y ago

Also worth considering is Oprofile: https://en.m.wikipedia.org/wiki/OProfile

redis_mlc6y ago

I used to use oprofile, but I don't think it works on current kernels.

So I use "perf stat" now. :)

Thaxll6y ago

No love for VTune?

mdani6y ago

It was great on physical hosts but could not get it working in EC2 VMs

gnufx6y ago

It requires access to hardware counters you don't normally have in EC2, and at a privilege level I wouldn't want to enable in a multi-user compute system.

Narishma6y ago

Does it work on non-Intel CPUs?

gnufx6y ago

glouwbug6y ago

gcc has an address and thread sanitizer built in these days.

    gcc -fsanitize=address ...

    gcc -fsanitize=thread ...

They perform better than valgrind.

wyldfire6y ago

This article refers to valgrind's profiling features (callgrind) and not its more common/popular 'memcheck' feature.

Sanitizers and memcheck are unrelated to the profiling discussion here.

j / k navigate · click thread line to collapse