karmarepellent 1 day ago

Do you mind sharing what these operations were? I can think of a few things that may very well brick your control plane. But at the very least existing workloads continue to function in this case as far as I know. Same with e.g. misconfigured network policies. Those might cause downtimes, but at least you can roll them back easily. This was some time ago though. There may be more footguns now. Curious to know how you bricked your cluster, if you don't mind.

I agree that k8s offers many features that most users probably don't need and may not even know of. I found that I liked k8s best when we used only a few, stable features (only daemonsets and deployments for workloads, no statefulsets) and simple helm charts. Although we could have probably ditched helm altogether.

2
thephyber 1 day ago

You can’t roll back an AWS EKS control plane version upgrade. “Measure twice, cut once” kinda thing.

And operators/helm charts/CRDs use APIs which can and are deprecated, which can cause outages. It pays to make sure your infrastructure is automated with Got apps, CICD, and thorough testing so you can identify the potential hurdles before your cluster upgrade causes unplanned service downtime.

It is a huge effort just to “run in place” with the current EKS LTS versions if your company has lots of 3rd party tooling (like K8s operators) installed and there isn’t sufficient CICD+testing to validate potential upgrades as soon after they are released.

3rd party tooling is frequently run by open source teams, so they don’t always have resources or desire/alignment to stay compatible with the newest version of K8s. Also, when the project goes idle/disbands/fractures into rival projects, that can cause infra/ops teams time to evaluate the replacement/ substitute projects which are going to be a better solution going forward. We recently ran into this with the operator we had originally installed to run Cassandra.

thephyber 17 hours ago

`s/Got apps/GitOps/`

tinco 1 day ago

In my case, it was the ingress running out of subdomains because each staging environment would get its own subdomain, and our system had a bug that caused them to not be cleaned up. So the CI/CD was leaking subdomains, eventually the list became too long and it bumped the production domain off the list.