undefined | Better HN

0 pointsdrewcrawford13y ago0 comments

> this is the kind of performance we're getting out of a single Heroku dyno on a dynamic page with no caching

If you read the original article[0], you would know that this is a problem that only affects apps with large number of dynos.

I have not done queuing theory in a long time, but my initial sense is that the math on this one will be generalization of the birthday problem [1], which is Wiki-notable on the sole basis that the probability of sharing a birthday (or in our case, the probability of queueing a request) is far, far higher than ordinary people anticipate for N above 23. Assuming I've captured the essence of the problem correctly, you would see a sharp drop in performance when you start to saturate about 20-30 dynos.

Given that there's an entire Wikipedia article on the sole basis that the behavior of these mathematical functions are nonintuitive, I think it is pretty fair to give RapGenius a pass at being surprised by the math as well.

[0] http://rapgenius.com/James-somers-herokus-ugly-secret-lyrics [1] http://en.wikipedia.org/wiki/Birthday_problem

0 comments

jfim13y ago

I did read the original article. The problem is that their stack is not concurrent.

In a non-concurrent web application stack(like Rails), one request is processed at a time and further requests to the same node are queued. This means that if some request takes five seconds to answer, everybody that is queued on that node after that long request has to wait until the first request is fulfilled. That's the behavior they're seeing.

In a multithreaded or reactive web stack, other requests will get processed alongside the long request and, guess what, the problem doesn't happen unless all worker threads are processing long requests because the short requests will get processed alongside the long one by the other workers.

Assuming your stack has, say, 20 worker threads, the probability of your random load balancer overloading your node with 60 long requests given a large enough pool is small, assuming long requests are a small fraction of your load. If your concurrency level is 1, the probability of your node getting overwhelmed by long requests is much higher.

You can see it this way; if you have a stack that can only process one request at a time, the probability of that one single request processor getting backlogged is getting three heads in a row on a non-biased coin. If you have twenty request processors, the probability of that node being backlogged is getting three heads in a row for all twenty processors. Much less likely to happen.

They were told to run Unicorn, which from my understanding just forks the Ruby interpreter a couple of times to run in parallel. They decided not to (or were unable to).

They decided instead to whine about the problem and ask Heroku to build some magic load balancer that would solve all their problems. Even if they did have a load balancer that did least-conns, all of Heroku's traffic does not go through a single load balancer, meaning that separate load balancers could, through bad luck, allocate their requests to the same unfortunate node. [1]

What they did is amateurish; instead of looking at the problem and fixing it by either multithreading their code or switching away from RoR, they blamed their vendors, just like beginning programmers blame their bugs on the compiler or the libraries they use. When Twitter needed to scale, they moved some of their stuff away from Rails to Scala, Facebook wrote hiphop php, their PHP to C++ transpiler, etc.

Was Heroku completely in the clear? No. Their documentation was misleading and I believe they've admitted that. Was it a problem that New Relic didn't show all the metrics needed to isolate the performance issue? Yes.

We'll see how this whole story unfolds, but from my perspective, the more of a stink RapGenius raises, the more amateurish they look.

[1] http://aphyr.com/posts/277-timelike-a-network-simulator

j / k navigate · click thread line to collapse