I did read the original article. The problem is that their stack
is not concurrent.
In a non-concurrent web application stack(like Rails), one request is processed at a time and further requests to the same node are queued. This means that if some request takes five seconds to answer, everybody that is queued on that node after that long request has to wait until the first request is fulfilled. That's the behavior they're seeing.
In a multithreaded or reactive web stack, other requests will get processed alongside the long request and, guess what, the problem doesn't happen unless all worker threads are processing long requests because the short requests will get processed alongside the long one by the other workers.
Assuming your stack has, say, 20 worker threads, the probability of your random load balancer overloading your node with 60 long requests given a large enough pool is small, assuming long requests are a small fraction of your load. If your concurrency level is 1, the probability of your node getting overwhelmed by long requests is much higher.
You can see it this way; if you have a stack that can only process one request at a time, the probability of that one single request processor getting backlogged is getting three heads in a row on a non-biased coin. If you have twenty request processors, the probability of that node being backlogged is getting three heads in a row for all twenty processors. Much less likely to happen.
They were told to run Unicorn, which from my understanding just forks the Ruby interpreter a couple of times to run in parallel. They decided not to (or were unable to).
They decided instead to whine about the problem and ask Heroku to build some magic load balancer that would solve all their problems. Even if they did have a load balancer that did least-conns, all of Heroku's traffic does not go through a single load balancer, meaning that separate load balancers could, through bad luck, allocate their requests to the same unfortunate node. [1]
What they did is amateurish; instead of looking at the problem and fixing it by either multithreading their code or switching away from RoR, they blamed their vendors, just like beginning programmers blame their bugs on the compiler or the libraries they use. When Twitter needed to scale, they moved some of their stuff away from Rails to Scala, Facebook wrote hiphop php, their PHP to C++ transpiler, etc.
Was Heroku completely in the clear? No. Their documentation was misleading and I believe they've admitted that. Was it a problem that New Relic didn't show all the metrics needed to isolate the performance issue? Yes.
We'll see how this whole story unfolds, but from my perspective, the more of a stink RapGenius raises, the more amateurish they look.
[1] http://aphyr.com/posts/277-timelike-a-network-simulator