staplung 4 days ago

One of the biggest problems with on-call rotations is that you're actually incentivized to do it poorly. Every minute you spend doing on-call work is time that you can't spend on the things you've actually been assigned to do. You're never going to put your on-call work into your performance reviews; doing so might actively work against you. "I see that you spent time tuning alerts and updating the runbook. That's time you didn't spend on the actual tasks that were assigned to you."

If it's better to spend the least amount of time doing on-call work then the logical conclusion is that it's best to snooze as many alerts as possible until they either go away on their own or roll over past your rotation. Fixing the underlying problem might be worthwhile if it's something that you can fairly easily fix but if the on-call rotation is more than 2 people, the underlying problem is mathematically unlikely to be of your making and is it really a good idea to make a habit of fixing other people's broken code?

What's crazy is that I've never seen anyone with on call duties acting in this worst case bad faith manner. Companies basically abuse the work ethic of their employees because it's the cheapest possible way to check that box.

2
chronid 4 days ago

> Every minute you spend doing on-call work is time that you can't spend on the things you've actually been assigned to do.

In my experience at least if you're oncall during a sprint you would have less work assigned to you than otherwise (2 week sprint and 1 week you are oncall? 50% allocation) as the expectation is that week you will spend responding to alerts, or investigating issues, or even improving alerting and dashboard and fixing bugs. If this does not happen, devs don't push for it and management is completely blind to it you have an organization issue. If leadership does not care about the problem it's time to jump ship ASAP.

But I've seen people stubbornly defending an alert on >60% CPU usage of their 1 CPU allocated kubernetes pods where there was no impact in p99.9 latency (which was measured and was the actual metric that mattered as agreed with the rest of the business and internal customers of the service). Or alerting on each single pod restart. That is self inflicted pain.

xigency 4 days ago

It's just more hazing so the underclass of engineers don't realize they actually build everything big tech is selling and more.