The point is that one company promised a level of service with their product that they did not deliver, and the difference was significant and persistent. The fact that the consumer could have used the product more efficiently is immaterial to that fact.
Other things that don't matter:
-that RG could/should move to another provider. That is of course their choice now, but it does not change the money they've spent and wasted with Heroku.
-that the routing problem is hard. If anything this makes it worse - it's a hard problem so people would pay a lot of money for a solution. What matters is that Heroku claimed to solve it and did not.
-that other consumers of the product managed to figure this out before RG. Heroku was still advertising through their documentation that they offered a routing solution, and they did not make clear to their customers that a significant feature of their product was now different.
Furthermore, Heroku appeared to obfuscate this fact and shift blame to the customer during the time RG was trying to diagnose their issues.
Now, by attacking RG's tone, Heroku have employed argument-level DH2 [1], which at least according to pg is not even worth considering. They have at least acknowledged their mistake, but to me that means that by extension they have sold something that they did not deliver on. The only honest way to move forward is for Heroku to offer some kind of compensation to the customers that were affected.
Isn't one of the great things about the Software startup scene that we can decide freely on what tools to use? Except for very niche markets we always have alternatives, even if it means a bit more work on our sides.
I haven't seen this to such a high degree with any other programming language/platform/technology community. Yes, there are developers in these communities who do prefer certain tools, but they're generally reasonable when it comes to criticism of these tools, or the suggestion of using alternatives. It's much rarer to see this when dealing with Ruby developers.
On more than one occasion, I've witnessed several different Ruby developers yell and scream in meetings when told they can't use a particular library or framework. I've never seen this kind of reaction from the many Java, C#, C, C++, Fortran, COBOL, Ada, Perl or Python developers I've worked with over the years, for instance.
I used to feel this way about Heroku, and I might again in the future, but I don’t right now.
I have a hard time understanding why, for all the money Rap Genius pays Heroku, they don't simply set up their own instances on EC2 and run the app there themselves. It seems like for a few days work with Puppet or Chef you could automate getting your code onto dozens of EC2 instances and installing the necessary tools/server processes, plus you don't have to complain anymore about how you can't run Unicorn.
Yes I get that there is a certain amount of value in being able to pay someone else to do all these things for you and saving time - but if you aren't happy with the result and the value given the money you are paying (and RG is not), then at a certain point it's time to just bite the bullet and fix things yourselves instead of continuing to be hamstrung by problems that the hosting provider won't/can't fix. There comes a point where you get large enough, and you are paying enough to Heroku, that it would be worth it to do things yourself and eliminate the problems.
This is why I always tell people that Heroku is actually NOT a good solution if you truly need scale. They're good for staging, launch, and an early traffic emergency or two. After that, ONCE YOU NEED TO SCALE, it's cheaper just to run your own servers, because the problem that Heroku is solving for you becomes a smaller and smaller percentage of your overall oeprations budget.
Who says they won't do that now?
Obviously when they started, they had no idea they'd have these problems or that they'd spend so much time diagnosing them, because Heroku told them that they wouldn't have these problems to begin with.
Ultimately RG's devs are responsible for their choice to leave all the admin work up to heroku.
Is this really accurate? 512mb is barely adequate for serving a single request at a time? I'm not a Rails developer, but that sounds terrible. I'm all for trading off some performance for rapid development, but that seems a bit extreme.
I'm currently running twelve Django apps on one 512MB Rackspace VM. It's a bit tight, and I don't get a lot of traffic on them, but it's basically fine. And that's with Apache worker mpm + mod_wsgi (with an Nginx reverse proxy in front) which probably isn't even the lightest approach. And having been writing apps in Erlang and Go recently, I'm starting to feel like Python/Django are unforgivably bloated in comparison.
If I were to toss down an average, seems like ~100mb is what I see most of the time for non-trivial rails apps.
512Mb for a single application sounds incredibly high to me.
EDIT: After looking at the docs, it seems like 512Mb isn't even a hard limit: https://devcenter.heroku.com/articles/dynos#memory-behavior
We also recommend a minimum 1 Gb RAM to host Discourse,
though it may work with slightly less.
~ http://www.discourse.org/faq/Guess it depends.
What are (edit:) Rails developers getting in exchange for these enormous penalties that makes it worth choosing?
Without knowing the specifics it's hard to say for sure, but I really think RG should try a comparison with running their own real VM (not a web worker on heroku) and see how well it runs. If they'd done that they'd probably find and fix the reasons that their processes are taking so long to respond and taking up such a huge amount of memory, because they'd feel more ownership of those problems, instead of playing a blame game with heroku.
This is not rocket science but it is a series of trade offs and heroku seem to have optimised for short running processes which don't take up lots of memory - many web apps run that way and would be happiest with random routing. Yes heroku could do better but at some point you have to take responsibility for your own ops instead of expecting some service to abstract away all the hard stuff, particularly if you're seeing performance issues and have a busy site. The amount they're paying heroku would easily pay for far more vps than they need.
So in summary, Heroku is not for everyone, and rails isn't really the problem here, so there are no enormous penalties for using it, just the sort of problems you see running any web app.
I remember doing CGI development in Perl back in the 1990s. We were lucky if our web servers had 32 MB of physical RAM, yet we could easily handle many requests per second to our CGI scripts with a single server. I don't think that the apps then were all that different from what we have today. They still had to interact with databases, perform string manipulation and other logic, and generate and emit HTML.
So it just seems really bad to me that Ruby on Rails requires so much more memory for doing basically the same task. Something is seriously wrong.
usually larger rails app can do 2-3 requests on a dyno. Just configure Unicorn workers to that and set it on your procfile. This is known since 2011 (a week after Cedar as announced)
Did you bother to read the article before spouting off half-cocked? RG apparently can't even run Unicorn because they don't have enough available memory.
They were literally ignoring our repeated customer service tickets pleading for assistance or a phone call or something. We were paying them hundreds of dollars per month at the time.
When we finally got through the only people we could get ahold of were salesman. Essentially we were made to believe that only for $1000/mo support contract would we receive customer support.
FWIW Our issue was frequent network timeouts to other ec2 services which were. They did eventually resolve those after months and never did they assist us.
Heroku's platform is a significant accelerator of development for a startup. Using the platform has enabled us to do things faster and better than we'd otherwise be able to do them for the money and time we've invested.
That being said, I look forward to they day they have a true/viable competitor and are forced to compete on service. I'm extremely bitter towards them at the moment as a result of my customer support torture experience.
I've always wondered whether the cut-off is time- or success-based. Maybe pg should write a Boolean return function for that. :P
Big props to Rap Genius for explaining the problem so plainly in the article. Unfortunately, many people of prominence in tech aren't even capable of talking about what they do to laymen.
Apparently, the fact that requests can be queued at Dyno level was common public knowledge back in 2011! Here's a quote from Stackoverflow answer:
"Your best indication if you need more dynos (aka processes on Cedar) is your heroku logs. Make sure you upgrade to expanded logging (it's free) so that you can tail your log.
You are looking for the heroku.router entries and the value you are most interested is the queue value - if this is constantly more than 0 then it's a good sign you need to add more dynos. Essentially this means than there are more requests coming in than your process can handle so they are being queued. If they are queued too long without returning any data they will be timed out."
Source: http://stackoverflow.com/a/8428998/276328
When you use a PaaS, it doesn't mean you don't need to be serious about it and completely forget about all technical aspects. Granted, it should have been included with New Relic from day one, but hardly justifies such a direct and persistent attack on Heroku.
Their logs are STILL incorrect. Here’s a sample line:
2013-03-02T15:41:24+00:00 heroku[router]: at=info method=GET path=/Asap-rocky-pretty-flacko-lyrics host=rapgenius.comfwd="157.55.33.98" dyno=web.234 queue=0 wait=0ms connect=3ms service=366ms status=200 bytes=25582
Those queue and wait parameters will always read 0, even if the actual value is 20000ms. And this has been the case for years.I tend to read (and trust) official documentation before Stack Overflow. I use Stack Overflow and it is great tool and all, but it can be really hit or miss. It doesn't cover every corner of every tech, and unless the answer is availible somewhere on the internet, or the person answering has first-hand experience it can lead to misleading wishy-washy answers.
Ultimately, pointing to SO you're lowering the expectations of a paid service from "the documentation reflects the product" so damn low to the point of "users should read everything googleable about the product they're using, and trust that OVER the official docs. Including mailing list posts from 2011 and a stack overflow question that asks a different question than you're asking"
There's a concept called a Bus Factor. Basically, it's the number of people who, if hit by a bus and made otherwise unusable, it would take to completely rail your business.
With $60k spent on a single sysadmin and an army of EC2, that's a pretty effing small bus factor - 1. So... that one guy gets taken out of action, and they're more or less toast? Yeah, no. Heroku gives them a massive bus factor for perhaps a little bit more money than it would take to cheap it themselves. It's a cheap way to avert risk.
They're probably at the size now where they could handle taking it in-house, but you've still then got to factor in hiring, developing the procedures for ops inhouse etc., and migrating. It's not easy to just flip the switch.
In any case, Heroku's behaviour is pretty shoddy. Though, knowing how much of a pain documentation is, I'm not surprised. I don't think they realised just how bad the change from intelligent to random routing actually was - and didn't treat it as such. This is giving them benefit of doubt though, because the other option is that they didn't publicise it precisely because they knew how bad it is. Scary thought.
$ ab -n 1000 -c 20 https://*****-staging.herokuapp.com/**********
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking *****-staging.herokuapp.com (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests
Server Software:
Server Hostname: *****-staging.herokuapp.com
Server Port: 443
SSL/TLS Protocol: TLSv1/SSLv3,AES256-SHA,2048,256
Document Path: /**********
Document Length: 9670 bytes
Concurrency Level: 20
Time taken for tests: 7.130 seconds
Complete requests: 1000
Failed requests: 0
Write errors: 0
Total transferred: 10034000 bytes
HTML transferred: 9670000 bytes
Requests per second: 140.25 [#/sec] (mean)
Time per request: 142.606 [ms] (mean)
Time per request: 7.130 [ms] (mean, across all concurrent requests)
Transfer rate: 1374.25 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 55 59 31.7 58 1057
Processing: 37 82 43.8 66 308
Waiting: 35 74 42.7 57 298
Total: 92 141 53.4 124 1096
Percentage of the requests served within a certain time (ms)
50% 124
66% 138
75% 153
80% 166
90% 199
95% 239
98% 282
99% 301
100% 1096 (longest request)
Edit: formatting.If you read the original article[0], you would know that this is a problem that only affects apps with large number of dynos.
I have not done queuing theory in a long time, but my initial sense is that the math on this one will be generalization of the birthday problem [1], which is Wiki-notable on the sole basis that the probability of sharing a birthday (or in our case, the probability of queueing a request) is far, far higher than ordinary people anticipate for N above 23. Assuming I've captured the essence of the problem correctly, you would see a sharp drop in performance when you start to saturate about 20-30 dynos.
Given that there's an entire Wikipedia article on the sole basis that the behavior of these mathematical functions are nonintuitive, I think it is pretty fair to give RapGenius a pass at being surprised by the math as well.
[0] http://rapgenius.com/James-somers-herokus-ugly-secret-lyrics [1] http://en.wikipedia.org/wiki/Birthday_problem
But, my sympathy is going to them, because what I see coming from Rap Genius looks like classic blame-game. So a vendors documentation was unclear and your server sucked publicly for some time? Shameful. You didn't know about it because you expected your vendors to give you extra hand-holding? That's really rough. Instead of fixing the issues and moving on, you make it the one thing that everyone thinks about when your company is mentioned... that might not be in your best long term interests.
After this, I would be hesitant to enter into any sort of relations with Rap Genius, and I'm not that sure of what they do or what their product is.
RG is paying for PaaS from Heroku based on documentation, sales pitches, etc. They're also paying good money for the tools necessary to make business decisions based on data collected from that PaaS. Just given the realm of customer service, why wouldn't you expect "hand-holding" from your vendor? Why is it unreasonable to have that expectation? Why is it acceptable for your vendor to have a fall down response in "optimize your web-stack"? How do you expect them to "fix" this problem without the vendors involvement? What did you expect them to do, change platforms? How are they supposed to "move on" when the issue hasn't been resolved?
Have we gotten so far away from customer service with the likes of Google, that we don't even know what that means anymore? Are we to settle for mediocrity from any PaaS because our expectations are just too high?
And, if you're going to go public with your complaints, it's best to do so in an understated, fact based manner. In this case, Rap Genius comes off like a guy screaming at the waiter in a fancy restaurant. They may be displeased, they may even be right about the choucroutes en sel being salted cabbage; but lot's of people around think they're making an ass of themselves.
This has taken on the patina of a really huge fight between operations and engineering with nobody to step in and say "Hey, we both want to make progress here, let see what we can do." there is no common point of contact here sadly.
What is the end goal? One of these companies being out of business? What? Its pretty clear that Heroku doesn't have any ideas on how to implement routing the way Rap Genius believed it worked, they even said as much. So what is the next step?
2. They wrote several blog posts explaining what happened and what is going to happen now (fixing) and in the future (more fixing)
3. They fixed their documentation
4. They helped a third party service to adapt their offering to better help their customers (NewRelic)
5. They offered their advice for better solutions for affected customers (Unicorn)
This sounds a lot like fixing to me.
And from what they did until know, this probably won't be the end of it. So why not just talk to them directly and see if it's enough for you - and if not just go somewhere else?
As a very happy NewRelic customer, I can say they did exactly what they advertise: Help monitor the application performance in the server (!). The queueing that now seems to be a problem of Heroku doesn't happen in the server that is processing the request, so by default it can't show the time needed.
Actually I'm quite sure that using one part of NewRelic, RUM (real user monitoring), should have shown the problem quite obviously. It shows how long a user had to wait for the request complete, including DNS lookup and network time. So if users waited longer for answers to their requests than the backend time should indicate, every developer should have taken this as a hint to investigate further.
Well, even using the application should have been enough to know that something is wrong when NR reports 250ms backend time, but page need at least 1200ms to return first byte to the customer.
Thing is, somebody had to take Heroku to task over this, and until they fix the problem somebody has to keep taking them to task.
I worked in the office beside Tom's for a year (pre-Rap Genius). He's a sharp guy. More importantly, he's right. I don't think being nice has any relevance to Rap Genius' bottom line.
But if I'm a customer of an admired former startup to whom I pay hundreds of thousands of dollars a month, I'm not allowed to go public with my complaints when I--and maybe hundreds of others--have been deceived and have suffered intentionally worse service than what I was promised?
I find the "enforced positiveness/optimism" of the startup community very disheartening. The essence of engineering is honesty (preferably quantified) about capabilities and limitations of systems. In this case, a former startup owned by a public company deceived their customers and then papered over (my impression) a valid, quantitatively-documented customer complaint once it became public.
Tom should be commended for speaking out. If he's right, dozens of startups have spent far more of their precious and limited capital on excess dynos and monitoring tools that could have been better spent elsewhere. I can't imagine a better service to the startup community than making this sort of thing public.