undefined | Better HN

0 pointsxaranke7y ago0 comments

Was there a way to fix the services so that you wouldn't get woken up in the middle of the night?

0 comments

zasz7y ago

Not entirely. Some issues were fixable, like moving our RabbitMQ cluster away from RHEL to AWS. But others weren't. There was an upstream service we depended on that went down, that caused a cascading failure. It was the company's core product, a massive Java program running on bare metal that frequently OOM-killed our service, and even though it was the big money-maker, no team owned it, and nobody understood how it worked. I don't remember why our service had to share a host with this monster, but there was a good reason and it just couldn't be worked around.

j / k navigate · click thread line to collapse

0 comments

zasz7y ago

j / k navigate · click thread line to collapse