Item 43498874

slt2021 • 4 days ago

being oncall forces the quality of software to improve.

if you want fewer incidents: ensure better QA, monitoring, smaller rollouts

usually developers start becoming more conservative after they do few oncall shifts and suddenly prioritize important reliability improvements, instead of shiny new features nobody will use

geoffpado • 4 days ago

Being on-call forces the *desire* for the quality of software to improve. Shitty management can and will override that. We don't have time for QA or to waste an engineer adding monitoring, we gotta ship ship ship.

denkmoon • 4 days ago

Only a manager could have such a distorted view. I'd love to work on robustness but product management has 5 years worth of feature JIRAs lined up for me.

1 reply

slt2021 • 4 days ago

need to bake some refactoring time into regular tickets. PM should only care about features, while software devs should provide reliable estimates on the velocity of sustainable software development

1 reply

geoffpado • 4 days ago

Ah, but when Alex can do 4 tickets a week baking refactoring, maintenance, tests, and observability into their work, and Blake can do 8 tickets a week focusing only on features, who do you think is going to get promoted?

These incentives then quickly devolve into a classic prisoner's dilemma. There's huge incentives to "defect" by producing quick-but-dirty work. You get the benefit of looking like you're producing rapidly, but you've made the collective experience a little bit worse.

1 reply

slt2021 • 3 days ago

hm... it the team is agile, then everyone does refactoring and it is team lead's job to assign tickets and evaluate. Team lead should have enough context to compare apple to apples.

if your work improving codebase is not valued, then its probably time to change job or just stop caring about code sustainability - let the business accrue technical debt, which is sometimes viable strategy if your runway and planning horizon is limited

srhtftw • 4 days ago

> being oncall forces the quality of software to improve.

Only when it's the managers that are on call.

1 reply

applecrazy • 4 days ago

are they usually not? is there no industry standard concept of an escalation manager?

1 reply

srhtftw • 4 days ago

I believe that too is as the author wrote – like a disheartening number of things in the tech industry, there are no real standards around what on-call responsibilities look like. Each organization is free to set things up in whichever way suits their tastes, and the resulting practices vary widely as a result.

MisterBastahrd • 4 days ago

Yes, makes perfect sense. I know when I want my horse to go faster, I don't entice it with more carrots, I just try to find better sticks to beat it with.

darioush • 4 days ago

this doesn't always work. many things can go wrong in distributed systems and you cannot test for all of them. also you have no control of your dependencies like when AWS networking degrades or a 3rd party API provider changes their APIs without letting you know.

2 replies

nhumrich • 4 days ago

True, but these things happen very very rarely. Also: 1. Is there anything you can do about it? No? Remove the alert, replace with a "we are down sorry" message. Yes? Then automate that thing.

Rinse and repeat after every incident and you will eventually get paged rarely.

toast0 • 4 days ago

I think if you have a reasonable environment, where on-call feeds back to development (which is what OP is suggesting, more or less), you will absolutely get woken up for networking problems, because there's not really an alternative. Maybe some thresholding to allow for minor problems without alerting, but you know. If it's a big enough problem, someone has to fix it, and it doesn't matter if it's your problem or your dependency's problem, it breaks your service so it's your problem. If it happens a lot, you look for another network to run on.

For 3rd party APIs, if they're not critical, you start to develop kill switches. So yeah, someone has to wake up and handle it, but all they have to do is set the kill switch and go to sleep.

Personally, I did dev and on-call for SMS/Voice verification codes. Most of the time, that's in the nasty corner of it's super critical to the application (users can't use the product if they can't get a verification code) and it depends on 3rd parties that have three nines on a good year. In my case, I got tired of dealing with the disruption and developed automated routing that could manage most providers taking an outage without needing me to take action. Results could be better if a human took notice and action, but it was good enough most of the time, and partial outages were much easier for the automation to handle it.

Even if there's no way to do something like that, at least automation can take care of 3rd party API is failing hard, so mostly return errors quickly without trying the API and only let a small fraction of requests go through to sample if the service came back online. That can keep your servers from getting overwhelmed, as well as drive the alert that helps you wake up, yell at the vendor, and decide if you can go back to sleep while the system takes care of itself.

When on-call is disconnected from development, that's when it gets really miserable. If you can swing a shift-work/follow the sun operator job, that's certainly better than on-call where incidents are common and there's no feedback loop to reduce things. It may well be better even if there is a feedback loop, but the feedback loop in that case requires explicit communication and effort; if I'm on call for my own work, I don't have to tell me to not push shit code right before I leave for the day or go to sleep, I'll get that message from myself right away. If someone else is cleaning up after my messes and doesn't communicate the effects to me, I might never know.

1 reply

slt2021 • 4 days ago

you are right, if your software is suffering regularly from the same issues - you go and fix them at the architecture level.

network issues? use a second DC, or some HA SDN setup, or run from a second DC.

3rd party API issues? Change vendor, or send stuff to queue to reprocess later. All of these issues could and should be solved and thats the job of the developer