Production support alone is not that much of a problem. What the author skipped (conveniently? or forgot to mention?) is - it's really the "on call" phenomenon that's the problem.
The "typical" on-call - where when you are on-call you are magically on-call 24x7. Yes, during your sleeping hours as well; as if that's less important and the company can avoid spending money to hire dedicated support for those hours and instead make you suffer (yes, it's just that - there's no other name for it like "satisfaction", "learning", "growing" or any of those buzzwords).
You want engineers to do production support? Well, let them do it during normal office hours and only few times a month. Or heck, let them do it for weeks but let them punch in and punch out normal office hours. Let them choose to do only one half of the day and have someone else willing to do the another half.
There's no excuse for burning out engineers (esp. unsuspecting youngsters) by pushing them into ungodly hours of work ruining their health among other things while trying to constantly tell them - "do you even realise what a service to humanity you are doing!".
It's just exploitation.
If a company thinks an application is important enough to run 24x7 then it should staff for 24x7 support. Stealing wages from workers by expecting them to be available 24x7 (on-call) is an absolute abuse.
It also leads to burn out, poor performance during the day (how is a dev's development ability when they were up at 2:30am on an incident call the night before?), and clouded thinking causing mistakes or impacting recovery time during incidents.
And where it really matters, they do. My team and I build and manage a large Emergency Services telecommunication network. We have Tier 1/2 operators on shift work 24/7. Tier 3 staff (programmers, system integrators and administrators) are their escalation point for critical issues outside of business hours.
> Stealing wages from workers by expecting them to be available 24x7 (on-call)
The Tier 3's that are on-call in our environment are on a rotating roster are compensated nicely for being prepared to answer the phone outside business hours. Frequently they don't get called during their week at all and it's free money.
> how is a dev's development ability when they were up at 2:30am on an incident call the night before?
Easy, as well as the financial compensation, we give them time in lieu. Two hours callout in the middle of the night, two (paid) hours given back on their next working day, or whenever they prefer, subject to availability of other staff.
There are simple solutions to these problems, and where they matter, they are applied. Granted things are very black and white for us as lives are potentially at stake, but any company that wants to have 24/7 engineers available needs to pay for that kind of support.
I would not trust someone who I just woke up at 2am to do something. He/she is mid-sleep. They will be prone to errors, they will be super tired, and I just ruined their next 1.5 days that it will take them to recover from that.
This is not a job where you live boxes where intellect is not needed as much, (strength and stamina will also be affected by a mid-night alarm). You want your folks to be 100% on par, otherwise they may make things worse.
"We need [insert thing manager asks for here] immediately" has consequences.
You are less of a person and more of a means to an end. A tool to achieve something, and some tools are disposable. It can be of career advantage to a manager to burn out engineers. Maybe instead of spreading 24x7 on-call across 3 teams in three timezones, you put it on 1 team in 1 timezone. By doing so a manager can achieve a lot with less resources, and hopefully secure their own elevation up the corporate ladder before the cost of their strategy becomes evident.
The cost of burn out I think remains hidden, in technology there's a constant flux of staff anyway, teams being being created and dissolved, in all the noise a few people being exhausted and bailing from the company is hardly noticed. Perhaps they said something before they left, but it's best for everyone in middle management if the burnt out individual is labeled the problem, they were a bad culture fit you see, a grumbler who didn't have what it took.
I'm happy to hold the pager if I've also got the right to block/rollback deploys until the system is stable - my current job has had two out-of-hours pages in the past year, and we're in the alexa top 10k so it's not like there's no traffic.
I've been at this for 35 years at many companies and working with many teams and it's always the same: if you want good software then make the team creating it also support it. In every case I've experienced it leads to software requiring little to no support, easy to maintain, and easy to extend. Why? Because nobody wants to get up in the middle of the night or work weekends and moreover, they'd rather be adding features than limping along with existing features.
I've worked at too many places that had no SWEs on-call for the on-call alerts that I get, which the vast majority of the time involves throwing a bandaid (as in redirecting traffic, etc) in front of an internal bug that I hope eventually gets fixed once the RFO/etc has been submitted before it hits my NEXT on-call rotation or my poor coworkers.
Without SWEs on my rotation they don't understand the immediacy. They aren't the ones getting their Christmas week interrupted every 4 hours while ops keeps the house running. In Ops having your entire day ruined by various on-call alerts usually feels like you're working without any breaks and nobody even cares.
Anyone want a bad golang developer, wannabe ex-ops person who knows a lot about platform reliability and o11y and wants to focus on the golang end finally? I'll make your teams automation and o11y purr no matter where it is (bm, cloud, global pops, serverless..)..
It creates an actual market for on call work where engineers can simply say no to the extra cash if they don't like work taking up their nights and weekends. If the company is having trouble with no engineers wanting to be on call the pay is simply too low and needs to be increased. It's a job like any other and should be compensated as such.
In the end I honestly believe it will be beneficial for the company not having engineers burn out so quickly. Compensation also clearly sets the expectations — if you're being paid to do it you'll take it more seriously.
Just my 2 cents
Where I work we are not on-call. Nevertheless, I try to help the ops team when they encounter issues. This does make you improve logging and error handling since you know it takes a lot more time when it's difficult to filter logs for the interesting events.
Engineers not exposed to production issues and customers will never understand why you need these extra measures.
I don’t know who needs to hear this besides me twenty years ago, but if you want to do charity then go home at 6pm and volunteer at a real charity. Don’t do it for a wannabe robber baron who will not share with you. Don’t do it for someone where even an emotional payoff is years away or may never come.
Find something else you care about and help some people just because. Not because you’re getting under-paid and over-guilted to do it.
On the other hand I had a friend at a very large and well known company. He got a job offer and was hired into one department, but he wanted to take a little time off between jobs before he started. They somehow convinced him to start saying there was a holiday coming up and he could take the time off then.
and as soon as he came in he started getting calls 1am 2am 3am etc...
so he left.
And they cajoled him back saying things were different and he finally bought it and went back.
same thing happened again, and he quit for a final time.
One part of the problem was that he was a US citizen working with a bunch of H1B visa folks and the company could get away with that sort of stuff. H1B folks will say yes sir, no sir because their dreams of living in the US are tied to keeping their job at all costs. and then the bad work culture festers.
I ran the Engineering org for a startup and we had a small, 3 person Ops team that handled initial triage of events. About 75% of these issues were Engineering-related. My solution was to a) create an on-call rotation for Engineering and b) allow the Engineers to prioritize reliability work.
It sounds like a no-brainer, but I had to fight with the rest of the exec team to allow b) to happen, since it came at the expense of the product roadmap. I eventually won the fight and our nightly on-call volume went from 1-2 incidents per day to 1-2 every few months.
Vindication came about a year later when we were acquired by a large company. As part of the due diligence process (including 18 hours with me going over technical details in front of 30 senior folks from the acquiring company) we got major kudos for having a level of reliability that far exceeded what they typically saw for a company our size.
When they ask it’s usually after a giant hole has been dug, and patterns have been set. If you knew from the outset that you would be the support team then you’d have prioritized some other tickets. You’d have increased the estimates on others. You would have refused to work on these three, you would have argued vigorously about these four decisions, and you would insisted your boss fire “That Guy” months ago because his code is garbage and his only real skill is articulate deflection.
This group of folks wants several somethings for nothing. One of them is labor, another is somewhere to assign blame. They are grooming you for failure and we all deserve better.
In the US, this is happening across the board and not just in tech. The expectation to always be available [often without monetary compensation] is sadly the new normal. Without strong labor laws in place, this implicit form of exploitation will never cease.
seriously
I get that you like unions but just because you have a hammer doesn't make every problem a nail.
If it's a serious issue they can't handle they might wake up one of us programmers, but usually they can find some temporary fix or workaround until the next morning.
They had to quit to get out of support.
However, production support teams don’t have a real understanding of our application and how it’s build. So most of the times you have engineers on call with production support, telling them how to debug the problem and come up with relevant logs.
It’s incredibly infuriating and time consuming, and I absolutely hate doing it this way.
90% of the time you also get incredibly vague bug reports with irrelevant logs, and a description of what they think the problem is. Most of the time you need to spend another day finding correct logs and somehow debugging it. Most teams log every single request with all parameters and payloads because they can just replicate the problem locally instead of relying on production support.
We’ve long advocated for either having dedicated support or have engineers on some sort of schedule that can do support.
Twenty years ago for many systems devs could do whatever they wanted in production. There were insider trading scandals and combined with SOX, regulators cracked down on it so now devs have lost at least write access. If you have an old system that relied on knowledgeable devs to fix stuff its a terrible situation where people just quit and no one can support it.
You're right that production support has access to those systems, and could potentially make changes and install different binaries, but the amount of people that can do that is extremely limited. Every change also requires a change request that needs several approvals, to request data you need another data request.
They can, they just can’t have direct access to live systems due to separation of duties. But there are methods for dealing with this, like centralised logging so a developer never needs to see the original log file on the problematic box.
"We don't trust" a dev. The change management processes demand the existence of 1) Dev, 2) Librarian (we used to call them that)(that would review and transfer the code, or review and compile the code), 3) the prod sys admin.
Some orgs may have a slightly different setup, but in some form or another, but (these general) rules apply.
Today with tools like CyberArk it is easier to grant temporarily privileged access to a dev for production support, we also got the tools to trace/monitor/record access, so it makes the process auditor-friendly.
To be fair, being great wasn't enough, their job was only possible because the company had unified tooling. A single deployment solution that was deploying near 1M tasks a day in the company, allowing all employees to lookup what is running where and see logs.
This made me appreciate just how useful it is to have both dedicated support AND unified tooling. The average company couldn't benefit from having folks on rota because it's impossible to figure out where anything is running.
The thing is, this all is pushed down from management. In my previous project, we tried to automate as much as possible, but at the end of the day, our production support still wanted to deploy manually. Our business still wanted to see manual end-to-end tests with screenshots.
Then there's also different regulations in certain countries where you need to host your application and database in the country itself, so that's another solution.
Working in finance can be a real eye opener sometimes.
This reeks of bad documentation to me (which finance is notorious for). If a dev has to be on to support normal prod ops thats largely due to errors in both documentation and often in poor tooling. Sometimes those errors aren't as much the devs fault because of management decisions, usually related to understaffing, but I hate how prod support gets shit on so often for failing to fix an issue when it's not really their fault.
> This reeks of bad documentation to me
Not necessarily, you can document your entire application, but production support only looks at the logs, and does a data extract based on what they see. It would be far more beneficial if you had someone who has a clear understanding of the application so that they can help with debugging and actually solving the problem.
At the end of the day, production support are teams who help with 10-20 applications, it's impossible for them to truly understand specific applications. They receive a bug report from the business, investigate and extract logs, then pass it to the relevant development teams. If you need extra info, well though luck, you can reply to the ticket and wait for it to be picked up again. It's no surprise companies like this move so slow.
In the off chance that a dev has the unique knowledge to solve a problem, they may get the firefighter/temporary elevated access needed, but will have to document the reason and the dev's actions very very well, because both internal and external auditors will zero in on that.
My code will crash sooner or later. I already know that. I don't write 100% bug-free code. But I cannot accept to give 100% of my time one week per month or so to a company in exchange for money. I just don't understand why people can't understand that I can be a professional only during 8 hours per day, but not more.
On-call: "Hey devs, I'm being woken up at 3AM because your app sucks. Please fix it." Devs: "Sure, no problem."
4 months go by
On-call: "These alerts are still coming in at 3AM. Did you fix the issue?" Dev: "We have a lot of work, we can't dedicate all our time to some minor problems, we have a deadline."
Next week, Devs are put on-call.
The alerts are fixed in two weeks. Site reliability goes up. Apps suddenly become more resilient to failure.
Honestly, the whole attitude of not wanting to work more than 8 hours is privilege. Most of the rest of the world works long hours. As a dev, you get a good salary and a job you don't have to break your body to do. The least you can do is be completely responsible for your own code.
And it helps you as an engineer. Like the article points out, it creates empathy for the users and product support engineers, it helps you improve architecture and app design, and it helps you understand different failure domains. You won't learn all that on your own time, especially without the scale of production.
> Honestly, the whole attitude of not wanting to work more than 8 hours is privilege. Most of the rest of the world works long hours. As a dev, you get a good salary and a job you don't have to break your body to do. The least you can do is be completely responsible for your own code.
unless i signed a contract that states i will do on-call, i'm not gonna do on-call. I doesn't matter how long the rest of the world works.
Well, that's another problem (the dev not being able to solve a bug that reappears at 3AM).
> The alerts are fixed in two weeks. Site reliability goes up. Apps suddenly become more resilient to failure.
I always wondered why DevOps has the "Dev" in its title. At least, in most of the companies I have worked on, it was DevOps the ones that were on call (payed), but they were very picky regarding what they can touch/work on (they almost never touched application code... we should call them "Ops" then, no?).
> Honestly, the whole attitude of not wanting to work more than 8 hours is privilege.
And it's a privlege I'm thankful for. What's wrong with that?
> As a dev, you get a good salary and a job you don't have to break your body to do.
We do break our body to do software engineering (our brains, to be more specifically). If you think physical work >>> brain work, well, that's relative. Every person is different, and for me, brain work is equally taxative as physical work.
> And it helps you as an engineer. Like the article points out, it creates empathy for the users and product support engineers, it helps you improve architecture and app design, and it helps you understand different failure domains
I know I can become better by working harder and smarter (it's obvious), but I just want to be the best version of myself by putting at most 40h/week. Isn't that something honourable in itself? Or does that make me a "bad engineer"?
Most of the world works labor jobs, and studies have shown that the body can work longer than the mind without burnout.
I would argue all developers should be required to do some support work.
Too often I see BigCorp development teams seeming blatantly oblivious to where their pain points are, and it's because they aren't forcing their developers to do support. They're pushing code, but they aren't pushing code that solves real problems for people.
No one expects customer support people to write code. Why? Because they don't have the skillset.
Yet people who make this argument seem to think any moron can do support.
The skillset for an engineer is not a superset of a customer support person.
Have your engineers sit in on support, by all means, but actually making them DO support will result in unhappy engineers and sub-par support.
Do not undervalue a good support person. They have a whole suite of skills engineers often don't have.
You aren't looking for the hardest problems, you're looking for the problems your users hit the most that an engineer could reduce in the product.
I'm not saying you wouldn't learn from working on production, but whether it's worth the stress is another question. In terms of software development, it's hard to think of a worse feeling than when you do a production deploy, you hit refresh on the website or whatever it is, and it shows a fatal error, then there's a mad scramble to roll back the change and figure out quickly what went wrong before the consequences grow too great. Most of the time bosses + coworkers aren't that understanding about it either and get into finger-pointing.
They're never offered extra work though. Companies are always willing to wait for Monday when they are asked to put money on the table.
Production support is customer support: responding to chat messages or communications from users.
An on-call rotation, on the other hand, involves responding to production incidents and mounting a proper incident response.
The Google SRE workbook has a great chapter on the subject: https://landing.google.com/sre/workbook/chapters/on-call/
Or to stagnate, depending on how you look at it
I have always had mixed feelings about "on call". I dread my turn on the rotation because the imminent threat of a prod issue has a psychological impact on my entire week, even off hours, and usually for a day or two after.
If everybody on the team feels that way, maybe it can act as a forcing function for product quality. I've seen this work on teams that already cultivate a strong sense of ownership.
On the flip side, it really stresses me out, and I sometimes resent that I'm not getting paid overtime for 24hr on call days. Maybe that's just baked into an engineer's salary these days, though...
What I want is to run an engineering organization as if you should never have to call us. And if you do you either get chewed out for making a frivolous call, or we’re falling all over ourselves because that thing that is happening should definitely not be happening and we’ll be looking at how to keep that from ever happening again, again.
I've also seen folks spotted extra time off for really gnarly oncall shifts. Folks should push to have such accommodations standardized.
A good structure is to have first line support be relatively generic ops people. They can handle problems related to infrastructure, e.g. hardware failures, network problems, or issues that can be handled by adding resources. The deployment process should be consistent enough across applications that they can e.g. roll back to a previous release.
This covers the majority of production problems. After that, it's time to bring in someone who understands the details of how the application works. If the dev team is geographically distributed, then someone is available during working hours. Otherwise, we have to get someone out of bed.
If the dev team has done their job right, this should be a rare occasion. Making the dev team fully responsible for the reliability of the application means that they are motivated to make it reliable. Otherwise there is a tendency to have an underclass of ops people who get abused.
A fundamental mindset here is taking responsibility for the user experience, including reliability. If this is not owned by the product development team, then who?
> I no longer work at Gojek