Bad code rarely causes outages at this scale. The culprit is always configuration changes.
Sure you can try and reduce those as well during the holiday season, but what if a certificate has to be renewed? What if a critical security patch needs to be applied? What if a set of servers need to be reprovisioned? What if a hard disk is running out of space?
You cannot plan your way out of operational challenges, regardless of what time of year it is.
> Sure you can try and reduce those as well during the holiday season, but what if a certificate has to be renewed? What if a critical security patch needs to be applied? What if a set of servers need to be reprovisioned? What if a hard disk is running out of space?
Reading this, I see two routine operational issues, one security issue and one hardware issue.
You can’t plan you way around security issues or hardware failures, but operational issues you both can and should plan around. Holiday schedules like this are fixed points in time, so there’s absolutely no reason why you can’t plan all routine works to be completed either a week in advance, or a week after, the holiday period.
Certificates don’t need to be near the point of expiry to be renewed. Capacity doesn’t need to be at critical levels to be expanded. Ultimately, this is a risk management question (as a sibling has also commented). Is the organisation willing to take on increased risk in exchange for deferring operational expenses?
If the operational expense is inevitable (the certificate will need renewing), that seems like an easy answer when it comes to risk management over holidays.
If the operational expense is not inevitable (will we really need to expand capacity?), it then becomes a game of probabilities and financials - likelihood of expense being incurred, amount of expense incurred if done ahead of time, impact to business if something goes wrong during a holiday.
We'll have a postmortem in next week's infra log update, but here it was a particularly ambitious customer app pushing our state sync service into a corner case; it's one we knew about, but the solution (federating regional state sharing clusters rather than running one globally) is taking time to roll out.
I think a good way of looking at it is risk. Is the change (whether it is code or configuration, etc.) worth the risk it brings on.
For example if it's a small feature then it probably makes sense to wait and keep things stable. But, if it's something that itself causes larger imminent danger like security patches / hard disk space constraints, then it's worth taking on the risk of change to mitigate the risk of not doing it.
At the end of the day no system is perfect and it ends up being judgement calls but I think viewing it as a risk tradeoff is helpful to understand.
This is a good observation. Do you have any resources I can read up on to make this safer?