Tracking down a memory leak in Ruby's EventMachine (opens in new tab)

(blog.nelhage.com)

158 pointsnelhage13y ago31 comments

31 comments

I found a bug in eventmachine during the time I spent writing em-zeromq, the eventmachine binding for zeromq. The important thing to understand here is that ZMQ is, in-essence, a userland socket. Normal sockets are efficiently monitored using the epoll system call (or one of its older variants, say select, or poll). However, as a user-land program

ZMQ 'sockets' aren't compatible with those calls, they use a userland equivalent of those kernel level edge triggered pollers. Integrating ZMQ with a traditional event library (like eventmachine) presents a problem at this point, as software like EM or Node typically require IO to be across real file descriptors from real sockets, something a userland library can't provide. The ZMQ devs however realized this was a hotly requested feature and so devised a way around this limitation.

The compatibility layer in ZMQ takes the form of performing some internal communication across traditional unix IPC, in the case using a pipe IIRC. In other words, for some of its internal messaging rather than simply use a function call, ZMQ will push data across a pipe. This pipe can then be exposed as a proxy for a ZMQ socket.

The downside of this strategy is that exposing FDs across software requires extreme care. Generally, it is assumed that one piece of software will have responsibility for an FD.

The actual issue in my case was that any ruby exception would cause the entire process to crash with an error about closing an already closed FD. What was happening was that given an exception both ZMQ and EM were trying to shut down all the FDs they knew about. Closing an FD that's already closed causes ZMQ to assert and crash instantly. It sounds simple once you're in the right frame of mind, but it took a good number of evenings to track down to that cause. It turned out the the EM option to not shut-down FDs was non-functional in the end. A one character patch provided the fix.

twoodfin13y ago

Like danso, I admire the detective-work here. I would like to point out, though, that XCode's Instruments utility has a fantastically useful "Leaks" mode that will identify leaked allocations, including a stack trace. It can attach to a running process and has a non-disastrous impact on performance, though like most such tools it's voracious for memory.

Other platforms likely have similar tools, though I have yet to stumble across one as easy to use.

nelhageOP13y ago

(author here) Yeah, there are a lot of such tools, and they can be invaluable. In my experience, though, if you're working with a large, complex system, including pieces like Ruby and OpenSSL which do some grody tricks internally, they'll flag lots of false positives and noise, and it takes a lot of work to get a development environment where they work well.

Also, we weren't yet sure whether this bug was reproducible in development/QA, or was only triggerable in production, so using techniques that worked directly on the running image was attractive.

FuzzyDunlop13y ago

I actually think there's something more remarkable about figuring this out by hand, in the command line. And writing about it, just the fact that you can capture the exact steps taken with a simple copy and paste blows the GUI approach well out of the water.

This was the work of someone who knew what they were doing, of course. Had it been, say, me, then the effect and utility of the piece would have been considerably lacking.

kevingadd13y ago

On Windows, there's kernel-mode support for similar allocation instrumentation, where every heap allocation can be tagged with a stacktrace.

I wrote a frontend: https://github.com/kevingadd/HeapProfiler but you can also just invoke it manually from the shell and then capture the state of the heap at your leisure.

gingerlime13y ago

"If you’ve stared at too many Linux coredumps, as I have, that number looks suspicious. Interpreted in little-endian, that is 0x00007f1b5358a800, which points near the top of the userspace portion of the address space on an amd64 Linux machine.
In fewer words: It’s most likely a pointer."

As someone who has not stared at any core dump for more than about 2 seconds, I admire this level of skill.

pc13y ago

I wish there were more posts like this.

Legion13y ago

This is what I think of as the ideal HN - posts that are "How I Made X" or "How I Debugged Y" (or even "How Someone Else Made X", like the Amazon menu post, though the author ended up making their own implementation too).

No product announcement links, no Valley gossip rag links, no endless picking apart of every tiny Apple and Google thing.

What makes HN special are posts about making things. Everything else is just the same chatter all the other tech sites have.

gruseom13y ago

Me too. I wonder what would make for more posts like this.

grey-area13y ago

Go here every day and upvote interesting articles after reading them.

http://news.ycombinator.com/newest

1 more reply

doktrin13y ago

Who says there aren't already? Note that what's on the front page is just up-voted content. Tons of stuff (some quite good) slips through the cracks every day.

1 more reply

lkrubner13y ago

This is slightly off-topic, but I worked on a Ruby project where we did something just like this:

"It was easy enough to work around the leak by adding monitoring and restarting the process whenever memory usage grew too large"

I was surprised, because I can not think of any other language and/or framework where "just restart the process" is done so often. I mean, this is not a common attitude among Java programmers, I don't think it is common among C programmers, and I don't think it is common among Python programmers. But it does seem to be fairly standard in the Ruby community. David Heinemeier Hansson admitted this used to happen with Basecamp:

http://david.heinemeierhansson.com/posts/31-myth-2-rails-is-...

Can anyone else tell me of a community where this is done so commonly?

mintplant13y ago

Hacker News itself does this.

> We restart HN every 5 or 6 days, or it gets slow (memory leaks). [1]

pg went into more depth about this somewhere, but I don't have the link on hand at the moment. Essentially, the software running it is riddled with memory leaks, but it's more time-efficient to simply reboot it every so often than it is to actually go in and fix it.

[1] https://news.ycombinator.com/item?id=4730251

koenigdavidmj13y ago

That might work for HN, but it doesn't really matter if HN is down for a while. For something like Basecamp (the original 'restart it every few minutes' example, it does not seem nearly as professional.

kernel_sanders13y ago

This isn't uncommon in the python community for long running processes - especially those that frequently that create large amounts of small objects. An application server sitting in front of mongodb does exactly this. It forwards queries to mongo, then pymongo json loads the result for you, the application server does what it wants with the objects, then json dumps them to serve to clients. The json loads/dumps calls create tons of objects and memory bloat. The GC cleans them all up, yet memory isn't returned to the OS. See here:

http://effbot.org/pyfaq/why-doesnt-python-release-the-memory...

In my tests of python 2.6.6, 2.7.3, and 3.3, python 2.7.3 was by far the worst in hanging on to memory. Yet the performance increase (due to integration of simplejson) is worth the memory penalty. We use gunicorn with sync workers to serve our WSGI app and a memory watchdog to signal the worker to gracefully retire after it handles a query large enough to leave it with a large memory footprint.

Performance has been awesome and we're very happy with the result.

dfischer13y ago

That was five years ago. That's not really relevant anymore. It used to be a quick fix but many things have changed since then.

spoiler13y ago

Windows. Turning it off and on again always fixes the BSOD.

danso13y ago

There are two things I really like about this walkthrough:

1. It shows how bugs can be something quite conceptually simple

2. It shows the value of logical, detective-like thinking in tracking these bugs.

Even as a programmer, I still think of bug-hunting as something requiring an encyclopedia knowledge of the trivial and arcane. Obviously, it looks easier in hindsight, but the OP does a great job of demonstrating how you can discover a much-overlooked flaw with the right deductive thinking (and experience with profiling tools)

icambron13y ago

Deduction is great, but what I read here was how much experience matters in debugging. You have to know what to look and test for to provide fodder for your reasoning capabilities, or else you're just shooting in the dark. The OP here had a plan informed by what I'm guessing are years of experience solving this kind of thing ("It's probably small objects. I should dump the core and look for repeating memory patterns."). Then there's a lot of accumulated wisdom that reprioritizes your search as you go along: "That looks like a flag because flags tend to look like that.", etc.

I have no doubt that the OP is good at reasoning logically, but the take-home lesson here is, if you want to be good at debugging, do a lot of debugging.

danso13y ago

Yes, this is true, so as I said, debugging looks easier in hindsight, especially someone else's hindsight :)

Summarizing the reasoning process:

1. The program's object space doesn't contain an absurd number of small objects, so inspect the core dump

2. 95% of the core dump is leaked objects, so a random sample should contain clues to the composition of the leaked objects.

3. A repeated pattern in every leaked object indicates a common pointer, i.e. a single type of object.

4. The signature helps find what file in the program is being referenced, which indicates that the pointer's object type is a BIO struct

5. This kind of leak isn't possible in straight Ruby. So between OpenSSL and EventMachine's C/C++ code, the latter is more likely to have something awry.

6. Search the EM code for BIO constructors

7. Check each constructor call to see if the BIO instance is ultimately freed

---

Steps 3 and 4 are the most arcane to me...I am pretty sure I do not know enough about memory to look at a hex signature and realize where in the address space it refers to, or even the significance of it.

2 more replies

gravitronic13y ago

THANK YOU - after looking at the heap perspective, and valgrind perspective, I've exhausted options looking at a memory leak in our production environment. This is another avenue. AWESOME.

pimeys13y ago

I noticed a same kind of leak in one of our apps and this post was more than worth of gold on how to hunt it down.

rurounijones13y ago

Now if there were an online course where you could learn this kind of stuff I would be signed up so fast physicists would be re-evaluating general relativity.

We need some online courses dedicated to not-beginners :)

vskr13y ago

Amazing post. How much time did it take to debug and identify the issue

spoiler13y ago

They said the overall process took 4 to 6 hours.

j / k navigate · click thread line to collapse

31 comments

andrewvc13y ago

The downside of this strategy is that exposing FDs across software requires extreme care. Generally, it is assumed that one piece of software will have responsibility for an FD.

twoodfin13y ago

Other platforms likely have similar tools, though I have yet to stumble across one as easy to use.

nelhageOP13y ago

Also, we weren't yet sure whether this bug was reproducible in development/QA, or was only triggerable in production, so using techniques that worked directly on the running image was attractive.

FuzzyDunlop13y ago

This was the work of someone who knew what they were doing, of course. Had it been, say, me, then the effect and utility of the piece would have been considerably lacking.

kevingadd13y ago

On Windows, there's kernel-mode support for similar allocation instrumentation, where every heap allocation can be tagged with a stacktrace.

I wrote a frontend: https://github.com/kevingadd/HeapProfiler but you can also just invoke it manually from the shell and then capture the state of the heap at your leisure.

gingerlime13y ago

As someone who has not stared at any core dump for more than about 2 seconds, I admire this level of skill.

pc13y ago

I wish there were more posts like this.

Legion13y ago

No product announcement links, no Valley gossip rag links, no endless picking apart of every tiny Apple and Google thing.

What makes HN special are posts about making things. Everything else is just the same chatter all the other tech sites have.

gruseom13y ago

Me too. I wonder what would make for more posts like this.

grey-area13y ago

Go here every day and upvote interesting articles after reading them.

http://news.ycombinator.com/newest

1 more reply

doktrin13y ago

Who says there aren't already? Note that what's on the front page is just up-voted content. Tons of stuff (some quite good) slips through the cracks every day.

1 more reply

lkrubner13y ago

This is slightly off-topic, but I worked on a Ruby project where we did something just like this:

"It was easy enough to work around the leak by adding monitoring and restarting the process whenever memory usage grew too large"

http://david.heinemeierhansson.com/posts/31-myth-2-rails-is-...

Can anyone else tell me of a community where this is done so commonly?

mintplant13y ago

Hacker News itself does this.

> We restart HN every 5 or 6 days, or it gets slow (memory leaks). [1]

[1] https://news.ycombinator.com/item?id=4730251

koenigdavidmj13y ago

kernel_sanders13y ago

http://effbot.org/pyfaq/why-doesnt-python-release-the-memory...

Performance has been awesome and we're very happy with the result.

dfischer13y ago

That was five years ago. That's not really relevant anymore. It used to be a quick fix but many things have changed since then.

spoiler13y ago

Windows. Turning it off and on again always fixes the BSOD.

danso13y ago

There are two things I really like about this walkthrough:

1. It shows how bugs can be something quite conceptually simple

2. It shows the value of logical, detective-like thinking in tracking these bugs.

icambron13y ago

I have no doubt that the OP is good at reasoning logically, but the take-home lesson here is, if you want to be good at debugging, do a lot of debugging.

danso13y ago

Yes, this is true, so as I said, debugging looks easier in hindsight, especially someone else's hindsight :)

Summarizing the reasoning process:

1. The program's object space doesn't contain an absurd number of small objects, so inspect the core dump

2. 95% of the core dump is leaked objects, so a random sample should contain clues to the composition of the leaked objects.

3. A repeated pattern in every leaked object indicates a common pointer, i.e. a single type of object.

4. The signature helps find what file in the program is being referenced, which indicates that the pointer's object type is a BIO struct

5. This kind of leak isn't possible in straight Ruby. So between OpenSSL and EventMachine's C/C++ code, the latter is more likely to have something awry.

6. Search the EM code for BIO constructors

7. Check each constructor call to see if the BIO instance is ultimately freed

---

2 more replies

gravitronic13y ago

THANK YOU - after looking at the heap perspective, and valgrind perspective, I've exhausted options looking at a memory leak in our production environment. This is another avenue. AWESOME.

pimeys13y ago

I noticed a same kind of leak in one of our apps and this post was more than worth of gold on how to hunt it down.

rurounijones13y ago

Now if there were an online course where you could learn this kind of stuff I would be signed up so fast physicists would be re-evaluating general relativity.

We need some online courses dedicated to not-beginners :)

vskr13y ago

Amazing post. How much time did it take to debug and identify the issue

spoiler13y ago

They said the overall process took 4 to 6 hours.

j / k navigate · click thread line to collapse