Excellent article. I can relate to a lot of it. The sad part is that we can't even control the quality of the systems we're oncall for. We're pushed by management for new features, not for robustness of the tools. Also some systems have no clear ownership, so nobody has an incentive to fix them. It'll be next oncall's business. Oncall is really the worst part of my job. I can stand long hours but this is something else.
One of the sidebars mentions that: "The production system in question is almost certainly a schizophrenic box of compromises brought about through poor decision-making, unaddressed technical debt, design-by-committee, and impossible timelines and budgets. This is not a system that any single rational human being on the team would’ve chosen to build if permitted to do so alone. Trying to assert ownership over an environment like that is just begging to get your shit rocked."
That’s basically every company and every system ever. Things are always in a state of flux, constantly being worked on. People come and go, priorities change and technologies evolve.
> Also some systems have no clear ownership, so nobody has an incentive to fix them.
It’s even worse when the system isn’t business-critical: a reporting service, a manual intervention tool, something that quietly supports a process. When it fails, everyone is affected, but no one is accountable.
Ironically, these are often legacy systems that have been rock-solid for years — so reliable they’re forgotten… until they break.
I feel for you, I’ve also suffered through this a lot over the years, and am finally at the stage of career and wisdom to start pushing back on the quality that I can’t control and ensuring that others are equally as accountable for their mess.
For one particular occasion , once we took blame out of the equation (at least within the engineering team) and started doing Post Incident Reports, the incentives finally became clear for the business as we were able to compile a list of recurrent issues during every issue, calculate a financial loss and present it for inspection each and every time they either began a witch hunt for downtime or refused to allocate time to backlog. Small wins.
I've been on call for almost 20 years. If a system is crashing often and disturbing your sleep but is not being prioritized to get fixed, then stop answering the calls. If it was important last night then it's still important this morning.
Be vocal and say "I will no longer respond when system XYZ goes down unless serious efforts are made to fix it."
If you get push back explain that you will also call the person telling you it's not important enough to fix each time it pages, and be willing to do so. What can they say?
So your solution is just "Get fired"?
If your case is valid you won't get fired for standing up for yourself. You might take some political damage but the guy willing to waste your sleep time was going to lay you off/betray you anyway.
> You won't get fired for standing up for yourself.
Yes you will, or can. "At will" employment in the US.