Item 42254285

karmarepellent • 2 months ago

In theory: absolutely. This is just anecdata and you are welcome to challenge me on it, but I have never had a problem upgrading Kubernetes itself. As long as you trail one version behind the latest to ensure critical bugs are fixed before you risk to run into them yourself, I think you are good.

Edit: To expand on it a little bit. I think there is always a real, theoretical risk that must be taken into account when you design your infrastructure. But when experience tells you that accounting for this potential risk may not be worth it in practice, you might get away with discarding it and keeping your infra lean. (Yes, I am starting to sweat just writing this).

mst • 2 months ago

"I am cutting this corner because I absolutely cannot make a business case I believe in for doing it the hard (but more correct) way but believe me I am still going to be low key paranoid about it indefinitely" is an experience that I think a lot of us can relate to.

I've actually asked for a task to be reassigned to somebody else before now on the grounds that I knew it deserved to be done the simple way but could not for the life of me bring myself to implement that.

(the trick is to find a colleague with a task you *can* do that they hate more and arrange a mutually beneficial swap)

1 reply

karmarepellent • 2 months ago

Actually I think the trick is to change ones own perspective on these things. Regardless of how many redundancies and how many 9's of availability your system theoretically achieves, there is always stuff that can go wrong for a variety of reasons. If things go wrong, I am faster at fixing a not-so-complex system than the more complex system that should, in theory, be more robust.

Also I have yet to experience that an outage of any kind had any negative consequences for me personally. As long as you stand by the decisions you made in the past and show a path forward, people (even the higher-ups) are going to respect that.

Anticipating every possible issue that might or might not occur during the lifetime of an application just leads to over-engineering.

I think rationalizing it a little bit may also help with the paranoia.

pickle-wizard • 2 months ago

At my last job we had a Kubernetes upgrade go so wrong we ended up having to blow away the cluster and redeploy everything. Even a restore of the etcd backup didn't work. I couldn't tell you exactly what went wrong, as I wasn't the one that did the upgrade. I wasn't around of the RCA on this one. As the fallout was straw that broke the camels back, I ended up quitting to take a sabbatical.