Recurring pattern I notice is outages tend to occur the week of major holidays in US.
- MS 365/Teams/Exchange had a blip in the morning
- Fly.io with complete outage
- then a handful of sites and services impacted due to those outages
Usually advocate against “change freezes” but I think a change freeze around major holidays makes sense. Give all teams a recharge/pause/whatever.
Don’t put too much pressure on the B-squads that were unfortunate to draw the short stick.
Bad code rarely causes outages at this scale. The culprit is always configuration changes.
Sure you can try and reduce those as well during the holiday season, but what if a certificate has to be renewed? What if a critical security patch needs to be applied? What if a set of servers need to be reprovisioned? What if a hard disk is running out of space?
You cannot plan your way out of operational challenges, regardless of what time of year it is.
> Sure you can try and reduce those as well during the holiday season, but what if a certificate has to be renewed? What if a critical security patch needs to be applied? What if a set of servers need to be reprovisioned? What if a hard disk is running out of space?
Reading this, I see two routine operational issues, one security issue and one hardware issue.
You can’t plan you way around security issues or hardware failures, but operational issues you both can and should plan around. Holiday schedules like this are fixed points in time, so there’s absolutely no reason why you can’t plan all routine works to be completed either a week in advance, or a week after, the holiday period.
Certificates don’t need to be near the point of expiry to be renewed. Capacity doesn’t need to be at critical levels to be expanded. Ultimately, this is a risk management question (as a sibling has also commented). Is the organisation willing to take on increased risk in exchange for deferring operational expenses?
If the operational expense is inevitable (the certificate will need renewing), that seems like an easy answer when it comes to risk management over holidays.
If the operational expense is not inevitable (will we really need to expand capacity?), it then becomes a game of probabilities and financials - likelihood of expense being incurred, amount of expense incurred if done ahead of time, impact to business if something goes wrong during a holiday.
We'll have a postmortem in next week's infra log update, but here it was a particularly ambitious customer app pushing our state sync service into a corner case; it's one we knew about, but the solution (federating regional state sharing clusters rather than running one globally) is taking time to roll out.
I think a good way of looking at it is risk. Is the change (whether it is code or configuration, etc.) worth the risk it brings on.
For example if it's a small feature then it probably makes sense to wait and keep things stable. But, if it's something that itself causes larger imminent danger like security patches / hard disk space constraints, then it's worth taking on the risk of change to mitigate the risk of not doing it.
At the end of the day no system is perfect and it ends up being judgement calls but I think viewing it as a risk tradeoff is helpful to understand.
This is a good observation. Do you have any resources I can read up on to make this safer?
I think you can't avoid the fact that these holiday weeks are different from regular weeks. If you "change freeze" then you also freeze out the little fixes and perf tuning that usually happens across these systems, because they're not "critical".
And then inevitably it turns out that there's a special marketing/product push, with special pricing logic that needs new code, and new UI widgets, causing a huge traffic/load surge, and it needs to go out NOW during the freeze, and this is revenue, so it is critical to the business leaders. Most of eng, and all of infra, didn't know about it, because the product team was cramming until the last minute, and it was kinda secret. So it turns out you can freeze the high-quality little fixes, but you can't really freeze the flaky brand-new features ...
It's just a struggle, and I still advise to forget the freeze, and try to be reasonable and not rush things (before, during, or after the freeze).
Any big tech company with large peak periods disagrees with you. It's absolutely worth freezing non-critical changes.
Urgent business change needs to go through? Sure, be prepared to defend to a vp/exec why it needs to go in now.
Urgent security fix? Yep same vp will approve it.
It's a no-brainer to stop your typical changes which aren't needed for a couple of weeks. By the way, it doesn't mean your whole pipeline needs to stop. You can still have stuff ready to go to prod or pre prod after the freeze
Some shops conduct game days as the freeze approaches.
https://wa.aws.amazon.com/wellarchitected/2020-07-02T19-33-2... / https://archive.md/uaJlR
Blip? 365 has an ongoing incident since yesterday morning, european timezone. The reason I know is because I use their compliance tools to secure information in a rather large bankruptcy.
Then you just get devs rushing out changes before the freeze…
As a developer I don't see why I would rush out a change before the freeze when I could just wait until after. Maybe a stakeholder that really wants it would press for it to get out but personally I'd rather wait until after so I'm not fixing a bug during my holiday.
and stampeding changes in after the thaw, also leading to downtime. so it depends on the org, but doing a freeze is still reasonable policy. Downtime on December 15th is less expensive than on black Friday or cyber Monday for most retailers, so it's just a business decision at that point.
What do "Freezes" mean? Like, do you stop renewing your certificates? Do you stop taking in security updates for your software?
Sure maybe "unnecessary" changes, but the line gets very gray very fast.
It's not very grey, prod becomes as if you told everyone but your ops team to go home and then sent your ops team on a cruise with pagers. If it's not important enough to merit interrupting their vacation you don't do it.
Certs shouldn't still be done by hand that this point; if another heartbleed comes out in the next 7 days then the risk can be examined, escalated, and the CISO can overrule the freeze. If it's a patch for remote root via Bluetooth drivers on a server that has no Bluetooth hardware, it's gonna wait.
you're right that there's a grey line, but crossing that line involves waking up several people and the on call person makes a judgement call. if it's not important enough to wake up several people over, then things stay frozen.
There's still a lot of situations where automatic certificate enrollment and renewal is not possible. TLS is not the only use of X.509 certificates, and even then, public facing HTTPS is not the only use of TLS.
It needs to get better but it's not there yet.
Right, that's basically what I mean. There are a lot of automated changes happening in the background for services. I guess the whole thing I'm saying is that not every breakage is happening because of a code change.