I've been in various IT administration, development and DevOps positions for the last 20 years with differing "on call" responsibilities, and have never had anything as intrusive as this.
Getting to the point - my current manager says that getting paged every day of your primary support shift is "normal in the industry for operations". While this definitely doesn't match my personal experience - I'm curious: do any of you in technical support roles with "on call" responsibilities get paged this frequently? If not, what does a "normal" shift look like for you?
Thanks kindly for any feedback!
If you're constantly fixing the things causing you to get pages, why are there still many more than one per day? Just prioritisation of other work over fixes?
We have a similar system, though we have one person on after hours support doing normal work during the day, and one person during the day who doesn't do normal work. That person works on remediating the issues that cause people to get paged. Leads to a pretty low number of pages.
When on primary on-call, we also are generally not expected to make progress on project work, although we don't have reviewing of all our incidents after our shift (generally just major ones.) I think there's definitely room for improvement here.
After-hours calls should come infrequently, or in situations where someone’s personal involvement (for example, as the engineer with primary responsibility for a particular component or its maintenance) is indispensable.
In my experience, things that need a lot of unplanned attention are more likely to fail, if they haven’t already, in ways that have other unacceptable consequences. Fixing them should be a priority for this reason, too.
You haven’t mentioned why you keep getting paged. Is it the same problem repeatedly, or lots of different problems? Is there any hope of addressing the underlying causes?
It's decently common to have engineering teams oncall for their own services, with a regular PagerDuty shift as part of the job. In that case 5-7 alerts per week is pretty healthy. It sucks that you need to keep your work laptop with you and stay sober / within cell coverage, but even then it's pretty rare to catch an actual outage that requires significant attention.
There are actions being taken to fix both the number of customer support cases as well as the systems issues - but progress is slow, and our appetite to implement all of our customer requested changes end up adding lots of new problems.
In the first I was on call rotation for a wekend a month for two years and got called twice.
It was 1h of work paid if you didn't called and 4h if the phone rings, if you worked for more than 4h it than went straight to a full day.
Currently I am on call and only get paid if called, but, my manager only calls me on critical situations, have been called 2 times in a year and 7 months. If I get called I get half day of work paid.
We've invested a load of time reducing the frequency of paging incidents over the years, the entire technology organisation recognises the importance of fixing said incidents and how disruptive it is to peoples lives/sleep/etc.
At a previous company I was on call every second week and would receive a call maybe once every few months. That was with many hundreds of servers.
At another company I’m on once a week per month and get called once or twice. That’s with just a few hundred servers.
In the first case all time was reimbursed in lieu. In the second case my salary more than makes up for any inconvenience.
However in both cases I was very proactive in defining what is on call - critical production issues only. If it’s not critical or not production then I won’t log on to look at it.
And in both cases I had a LOT of false alarms from bad alerts when starting. I had all false alarms disabled.
You’ll get push back but I didn’t care - you can’t have an alarm waking up people every night on the off chance that one in a hundred will actually be an error. And hilariously, if you started including your boss on the call, they’d quickly agree it’s not acceptable. The human cost isn’t worth it.
While there’s often tonnes of room for improvements to monitoring and alerting (root cause analysis etc) that others have mentioned - in my experience most of the metrics and alarms are garbage anyway, and can and should be done away with. If it came from a boxed product it should near all be turned off from the get go. That crap is always pointless.
Oh no a server CPU usage has increased and memory is low because - it’s doing what it’s meant to? What junk.
The key criteria for me and paging are:
1. Was the page actionable? Did I need to do something to restore the system to functioning or prevent it from going down.
2. Can I prevent this page in the future and most importantly am I empowered by leadership to do that? If your app is paging me because it’s poorly made and I am not authorized to change it that’s a leadership problem that’s extremely common.
3. Are we auditing the pages? Often alerts in technology are designed in response to a particular problem and then never removed. Paging is, to me, a very serious action for a system to take. It means it is impossible for the system to naturally recover and all automation has failed. So every time we page someone we should as a team review those pages to ensure they’re actionable and actually impossible to naturally recover from.
These criteria have served me well for years and caused me to turn off the vast majority of the alerts of my services.
But you seem to have a culture that accepts this as normal and tbh these rarely change. Just know that it isn’t normal and it’s not acceptable.
There is effort to try and resolve the underlying problems, and we do make some headway here - we just keep adding changes to satisfy customers which end up causing new issues. We're being told this will get better over time, but it's certainly not happening fast enough IMHO.
Again, thanks for the feedback and insight!
Are the people in charge of fixing the underlying issues themselves on call? How about the people producing the changes that cause new issues?
If those two groups aren't themselves being woken up when there's a problem, you can reasonably expect that this won't change until the support calls start to directly affect the company's bottom line.
My employer makes use of Pagerduty and I’ve spent a lot of time setting up “auto-resolve” of alerts. I even hook into AWS autoscaling lifecycle events and send mock “OK” actions when something gets terminated that had thrown an alarm. I still get paged but most issues solve themselves if I wait one more monitoring interval.
I’ve also used being on call as excuse to leave early - to ensure I’m home and able to respond to calls when everyone else leaves the office, not much I can do if I’m stuck in traffic, or in a tunnel, etc.
The question you should be asking is: why am I being paged so often?
Are they legitimate things that you need to respond to? If so, you should be fixing these issues so that they don't happen again. If anyone gets a page, we make it a high priority to fix whatever caused it. We are a team of 7, and we dedicate one person a week to field questions relating to our platform as well as to fix up these issues that wake us up.
If they're not legitimate things that you need to be woken up for, why are you being woken up? If this is the case, you need to make sure everyone is on the same page regarding what constitutes something you need to be paged for after hours.
> The question you should be asking is: why am I being paged so often? Are they legitimate things that you need to respond to? If so, you should be fixing these issues so that they don't happen again.
This is mostly due to not having anyone else around to handle customer issues (which currently require manual intervention), however system issues are also pretty frequent here as well. Management is working on prioritizing the automation of the customer issues so that there are less of them in total, but system issues will likely be harder to resolve (we try to resolve them as they come up if possible, but many are more systemic to technical debt.)
So yes - I'm only including the events that are actionable and require breaking out the laptop - these generally vary from 15 minutes to 3 hours of support.
So basically, I don't work at companies that make their employees carry a pager etc. Life is too short for that shit.
I worked briefly at a startup shortly after it's acquisition by a FAANG. The startup's code was trash - I acknowledged while on call that I didn't exactly know what was going on after digging a while - asked for help - was then reprimanded for "not knowing the code well enough" basically because I asked for help. I left about a month after that. Again, life is too short for that shit.
Normal shift is like every other day. Just go to work, do my job. Come home, eat, chill a bit and go to sleep.
It used to be more. The company started with three people three years ago (myself included). Now we're over 50. We have enough resources to fix and solve problems before they become real problems.