Item 43499585

nhumrich • 4 days ago

The difference with dev oncall vs doctor on call is that it is self inflicted. Why are you getting paged? Because you built the system. Either your system isn't resilient enough, or you have noisy alerts. Both are problems you should be motivated to fix.

I have been on call 24/7/52 in SRE roles most of my career. It has either sucked hard, or not at all. And the time it sucked the most was because every single practice was bad. And now, I build better things because of if. Paying me more for on call wouldn't have changed how much it sucked. It wouldn't have made any material impact on my actual quality of life. But it would have done two things: 1) made me feel like I can't complain 2) give me less motivation to fix it

Paying for on call doesn't seem like a win. I want happy employees, not disgruntled but silent ones.

srhtftw • 4 days ago

> Why are you getting paged? Because you built the system.

There are at least two problems with this thinking. The main problem is it's not generally true. The system is created by the entire organization. The people who raise money and allocate capital, the people who set development policies and priorities, the people who design and assemble the components, the people who sell it to customers and negotiate service levels and the people who operate and maintain it all collectively built the system.

Another problem is that it encourages moral hazards. Not paying fair on-call compensation allows unethical managers and sales staff to reap short-term rewards and bonuses by oversubscribing customers, promising more than can be delivered and rushing things to market before they're ready.

If you want happy employees, treat them fairly.

1 reply

chronid • 4 days ago

I guess what you are saying is the problem is the company culture - from a technical operations point of view at least - sucks. An no one wants or can put the effort into fixing it.

I see normally in oncall threads people complaining about "I got paged by an alerts because of another system X" - but in at least in a big enough organization this should not happen and it's an organizational failure. There should be an operations center on 24h/24h able to triage, escalate and evaluate, possibly not staffed only with L1 techs and given enough freedom to actually improve and automate. I know there are places where that is not true, and I ran away screaming from some in my career once I understood tech leadership had no understanding why it was needed.

But you would be surprised how much of the oncall pain is actually self inflicted by application teams themselves (some examples I encountered in the last year: TCP connect timeouts in the minutes and with no retries, no retry policies in general and things that should be idempotent that are not, no circuit breaker strategies, connection pools churning as they're shared between 10+ remote endpoints, wrong expectations about transaction isolation levels and how to handle conflicts at least in simple scenarios).

1 reply

srhtftw • 4 days ago

> I guess what you are saying is the problem is the company culture ... sucks.

I believe the problem is the way devops is often practiced. I've worked as a developer, a manager and an operator and I've occasionally carried a pager. I think there is value in rotating between those roles at different times since it enables engineers to gain knowledge and insight they often won't get any other way. But assigning engineers to after-hours on-call duties when they're simultaneously responsible for product development "because they built the system" is just a stupid unethical and unsustainable practice that needs to end.

Good companies hire and train engineers to develop, manage and operate systems sustainably.

croshan • 4 days ago

This works if you're on call for your systems. In many situations (ranging from small startups to big tech), you're also on-call for the systems of sister teams.

Not that there aren't other ways to fix that. But fixing the erroring service isn't practical in all cases.

crossroadsguy • 4 days ago

24x7 on-call must not be a reality imho. I know it is the reality but it should not be propagated and we should not even begin to try and somehow normalise it.

Can't this be simpler? If your system needs to working at night, and it pays (if it doesn't then what are we doing at night?), then you need to hire someone to look after it specifically at night (if possible from a geography where it is not night when it is others' nighttime.. and so on)? i.e follow the sun.

eestrada • 4 days ago

I think paid oncall could work if oncall is voluntary. The more oncall sucks, the less likely team members are to volunteer because the pay isn't worth it, then the company/team needs to pay more as an incentive to get people to volunteer for oncall. Eventually the price is so high that it becomes cheaper to just build the system correctly and stop shoehorning features in with no regard for stability.

If oncall burden is light, then everyone volunteers because it is an easy way to make a bit of money.

However, it is a huge systemic change to move towards a voluntary model. Not sure how feasible this really is.

1 reply

hx8 • 4 days ago

You will get PIPed for not signing up for being on call before they raise the on-call bonus.

2 replies

juliansimioni • 4 days ago

"we need to talk about your on call volunteering"

"Really? I volunteered for 15 hours of on call"

"Well, 15 is the minimum ok? Now it's up to you if you just want to do the bare minimum, or uh..well look at Brian for example, he volunteered for 37 hours of on call"

eestrada • 4 days ago

Good point. That sounds about right.

Jean-Papoulos • 4 days ago

>Why are you getting paged? Because you built the system.

I want to know in which company you've reworked at where that's even remotely true. There's always financial and time constraints that force you in trading off system resiliency for actually putting out a product.

cempaka • 4 days ago

We have integrations with dozens of external vendor APIs for which it's essentially impossible to disambiguate ahead of time whether any given error might be on their end or ours.