undefined | Better HN

0 pointsSujan13y ago0 comments

1. They acknowledged the problem

2. They wrote several blog posts explaining what happened and what is going to happen now (fixing) and in the future (more fixing)

3. They fixed their documentation

4. They helped a third party service to adapt their offering to better help their customers (NewRelic)

5. They offered their advice for better solutions for affected customers (Unicorn)

This sounds a lot like fixing to me.

And from what they did until know, this probably won't be the end of it. So why not just talk to them directly and see if it's enough for you - and if not just go somewhere else?

0 comments

ibdknox13y ago

We just seem to have different definitions of "fix". Fix, to me, implies the issue goes away. Are 1, 2, and 3 important? Yes. 4 should never have been an issue to begin with. And 5 is a non-solution given that simply adding more lines of execution does not address the root problem.

In no way have they solved the actual issue (a poor queuing strategy). And so even if you now know that you're getting awful performance due to queuing and you even try to get a multi-threaded strategy going per their suggestion, you will see the exact same issue at scale. That is not a fix.

Their stance on actually implementing a strategy that removes the root issue has been one of silence. Suggesting that "this probably won't be the end of it" isn't useful if you're running a business that relies on Heroku. If that isn't the end of it, then they should be far more communicative about the steps they're taking. Given their blog posts, we have no evidence that further solutions to this problem are being worked on or that they even acknowledge it's something they should fix.

So no, I do not agree with you that that is a lot of "fixing".

jules13y ago

> And 5 is a non-solution given that simply adding more lines of execution does not address the root problem.

Actually more threads of execution does solve the problem. The difference with just doubling the number of dynos is that on a single dyno requests can be routed intelligently. The reason why random routing sucks is that request processing times have a fat tailed distribution: there is a small but still significant chance that a request takes really long. If you have that request routed to a random single threaded dyno, then all further requests routed to that dyno have to wait very long before they can be processed. If however you had multiple threads of execution on the dyno, the other requests would simply go to the other thread of execution. So now there would only be blocking if a single dyno gets N really long requests at roughly the same time, where N is the number of concurrent threads the dyno is running. The probability of getting N expensive requests to the same dyno at approximately the same time decreases very fast with increasing N.

Hand waving ahead! Lets say the probability of an expensive request blocking a dyno is p = 2%. Then if you double the number of dynos the probability of blocking a dyno is now p/2 = 1%. If however you have two execution threads on each dyno, the probability of blocking a dyno is now p^2 = 0.01%. If you have 10 execution threads it is p^10 which is very small indeed.

Here is a paper about it which makes that intuition precise and shows that even N=2 is a massive improvement over N=1: http://www.eecs.harvard.edu/~michaelm/postscripts/handbook20...

The problem is that this only works if each concurrent process of your application doesn't use too much memory, since the available memory on one dyno is quite low. For many applications you can't easily have multiple threads of execution on one dyno. The real solution is to have some form of intelligent routing. As the hand waving and the paper above shows, you can make groups of dynos, and then the main router routes to a random group, and within each group requests are routed intelligently. You can take the size of a group to be a small constant, say 10 dynos. So there shouldn't be any scalability problems with this routing approach. If you take the group size small enough, you could even run each group of dynos on a single physical machine, which would make intelligent routing among them even simpler.

ibrahima13y ago

This post should be stickied at the top of every Heroku queueing thread. People keep acting like the "intelligent routing" system is trivial to build and has no overhead which are both patently false. It's clear that they can't go back to the old method with their newer (since 2011) architecture so the solution is for apps to fix their own performance issues.

1 more reply

SujanOP13y ago

We have, indeed.

I never expected them to completely rebuild their service because of some customers (very small minority, I assume) aren't totally happy and satisfied with their product. That clearly sucks for the affected people.

It's reason for them to leave the product and platform and go somewhere else, where the problem is not an integral part of the produt. But it's not a reason to be a dick.

wmf13y ago

I'm not so quick to think that Heroku can't improve their routing. www.rapgenius.com resolves to four IP addresses/routers; why not one or two?

mbell13y ago

> 4. They helped a third party service to adapt their offering to better help their customers (NewRelic)

Unless I misunderstand the situation, NewRelic's heroku reporting isn't some one sided third party service but rather something that at least seems to be jointly produced by Heroku and NewRelic.

NewRelic can't report something that isn't offered up and it would seem to me that Heroku needs to deliberately expose metrics to the NewRelic plugin for it to be able to pick them up.

As it seems to be that these queue times weren't reported anywhere developer accessible it also stands to reason that they weren't exposed to NewRelic.

So no heroku didn't fix some third party service, they fixed their own service (in this regard).

SujanOP13y ago

I'm not entirely sure if the headers the new version of the plugin uses were available before, but it sounded like they were. NR wasn't aware that the one they were using didn't report the queueing time before the dynos and Heroku now helped them to fix that.

So yeah, probably Heroku fixed their part and made sure NewRelic reflected that.

j / k navigate · click thread line to collapse

0 comments

ibdknox13y ago

So no, I do not agree with you that that is a lot of "fixing".

jules13y ago

> And 5 is a non-solution given that simply adding more lines of execution does not address the root problem.

Here is a paper about it which makes that intuition precise and shows that even N=2 is a massive improvement over N=1: http://www.eecs.harvard.edu/~michaelm/postscripts/handbook20...

ibrahima13y ago

1 more reply

SujanOP13y ago

We have, indeed.

It's reason for them to leave the product and platform and go somewhere else, where the problem is not an integral part of the produt. But it's not a reason to be a dick.

wmf13y ago

I'm not so quick to think that Heroku can't improve their routing. www.rapgenius.com resolves to four IP addresses/routers; why not one or two?

mbell13y ago

> 4. They helped a third party service to adapt their offering to better help their customers (NewRelic)

Unless I misunderstand the situation, NewRelic's heroku reporting isn't some one sided third party service but rather something that at least seems to be jointly produced by Heroku and NewRelic.

NewRelic can't report something that isn't offered up and it would seem to me that Heroku needs to deliberately expose metrics to the NewRelic plugin for it to be able to pick them up.

As it seems to be that these queue times weren't reported anywhere developer accessible it also stands to reason that they weren't exposed to NewRelic.

So no heroku didn't fix some third party service, they fixed their own service (in this regard).

SujanOP13y ago

So yeah, probably Heroku fixed their part and made sure NewRelic reflected that.

j / k navigate · click thread line to collapse