Why exactly did they have 47 clusters? One thing I noticed (maybe because I’m not at that scale) is that companies are running 1+ clusters per application. Isn’t the point of kubernetes that you can run your entire infra in a single cluster, and at most you’d need a second cluster for redundancy, and you can spread nodes across regions and AZs and even clouds?
I think the bottleneck is networking and how much crosstalk your control nodes can take, but that’s your architecture team’s job?
> Isn’t the point of kubernetes that you can run your entire infra in a single cluster
I've never seen that, but yes 47 seems like a lot. Often you'd need production, staging, test, development, something like that. Then you'd add an additional cluster for running auxiliary service, this is services that has special network access or are not related to you "main product". Maybe a few of these. Still there's a long way to 47.
Out in the real world I've frequently seen companies build a cluster per service, or group of services, to better control load and scaling and again to control network access. It could also be as simple as not all staff being allowed to access the same cluster, due to regulatory concerns. Also you might not want internal tooling to run on the public facing production cluster.
You also don't want one service, either do to misconfiguration or design flaws, taking down everything, because you placed all your infrastructure in one cluster. I've seen Kubernetes crash because some service spun out of control and then causing the networking pods to crash, taking out the entire cluster. You don't really want that.
Kubernetes doesn't really provide the same type of isolation as something like VmWare, or at least it's not trusted to the same extend.
>Often you'd need production, staging, test, development, something like that.
Normally in K8s, segregating environments is done via namespaces, not clusters (unless there are some very specific resource constraints).
Which in many cases would break SOC2 compliance (co-mingling of development and customer resources), and even goes against the basic advice offered in the K8s manual. Beyond that, this limits your ability to test Control Plane upgrades against your stack, though that has generally been very stable in my experience.
To be clear I'm not defending the 47 Cluster setup of the OP, just the practice of separating Development/Production.
Why would you commingle development and customer resources? A k8s cluster is just a control plane, that specifically controls where things are running, and if you specify they can’t share resources, that’s the end of that.
If you say they share the same control plane is commingling… then what do you think a cloud console is? And if you are using different accounts there… then I hope you are using dedicated resources for absolutely everything in prod (can’t imagine what you’d pay for dedicated s3, sqs) because god forbid those two accounts end up on the same machine. Heh, you are probably violating compliance and didn’t even know it!
Sigh. I digress.
The frustrating thing with SOC2, or pretty much most compliance requirements, is that they are less about what’s “technically true”, and more about minimizing raised eyebrows.
It does make some sense though. People are not perfect, especially in large organizations, so there is value in just following the masses rather than doing everything your own way.
Yes. But it also isn’t a regulation. It is pretty much whatever you say it is.
I would want to have at least dev + prod clusters, sometimes people want to test controllers or they have badly behaved workloads that k8s doesn't isolate well (making lots of giant etcd objects). You can also test k8s version upgrades in non-prod.
That said it sounds like these people just made a cluster per service which adds a ton of complexity and loses all the benefits of k8s.
In this case, I use a script to spin up another production cluster, perform my changes, and send some traffic to it. If everything looks good, we shift over all traffic to the new cluster and shutdown the old one. Easy peasy. Have you turned your pets into cattle only to create a pet ranch?
You always want lots of very specific resource constraints between those.
Indeed. My previous company did this due to regulatory concerns.
One cluster per country in prod, one cluster per team in staging, plus individual clusters for some important customers.
A DevOps engineer famously pointed that it was stupid since they could access everything with the same SSO user anyway, and the CISO demanded individual accounts with separate passwords and separate SSO keys.
> Then you'd add an additional cluster for running auxiliary service, this is services that has special network access or are not related to you "main product". Maybe a few of these. Still there's a long way to 47.
Why couldn't you do that with a dedicated node pool, namespaces, taints and affinities? This is how we run our simulators and analytics within the same k8s cluster.
You could do a dedicated node pool and limit the egress to those nodes, but it seems simpler, as in someone is less likely to provision something incorrect, by having a separate cluster.
In my experience companies do not trust Kubernetes to the same extend as they'd trust VLANs and VMs. That's probably not entirely fair, but as you can see from many of the other comments, people find managing Kubernetes extremely difficult to get right.
For some special cases you also have regulatory requirements that maybe could be fulfilled by some Kubernetes combination of node pools, namespacing and so on, but it's not really worth the risk.
From dealing with clients wanting hosted Kubernetes, I can only say that 100% of them have been running multiple clusters. Sometimes for good reason, other times because hosting costs where per project and it's just easier to price out a cluster, compared to buying X% of the capacity on an existing cluster.
One customer I've worked with even ran an entire cluster for a single container, but that was done because no one told the developers to not use that service as an excuse to play with Kubernetes. That was its own kind of stupid.
What you just described with one bad actor bringing the entire cluster down is yet another really good reason I’ll never put any serious app on that platform.
K8s requires a flat plane addressability model across all containers, meaning anyone can see and call anyone else?
I can see security teams getting uppity about that.
Also budgetary and org boundaries, cloud providers, disaster recovery/hot spares/redundancy/AB hotswap, avoid single tank point of failure.
Addressability is not accessibility . It's easy to control how services talk to each other through NetworkPolicy.
This… sounds remarkably like the problems kubernetes solves.
single tank point of failure should be
single YAML point of failure
mobile autocorrect is super "helpful"
I have completely tanked a kubernetes cluster before. Everything kept working. The only problem was that we couldn’t spin up new containers and if any of the running ones stopped, dns/networking wouldn’t get updated. So for a few hours while we figured out how to fix what I broke, not many issues happened.
So sure, I can kinda see your point, but it feels rather moot. In the cluster, there isnt much that is a single point of failure that also wouldn’t be a point of failure in multiple clusters.
> Out in the real world I've frequently seen companies build a cluster per service, or group of services, to better control load and scaling and again to control network access.
Network Policies have solved that at least for ingress traffic.
Egress traffic is another beast, you can't allow egress traffic to a service, only to pods or IP ranges.
It's just a matter of time before someone releases an orchestration layer for k8s clusters so the absurd Rube Goldberg machine that is modern devops stacks can grow even more complex.
karmada?
Because they have k8s engineers each of whom wants to put it on their resume that they designed and implemented a working cluster in prod.
Resume Driven Development.
Wouldn't that give you a prod mix of Kubernetes, Docker, Fargate, Rancher, etc?
Maybe they had 47 different Kubernetes consultants coming in sequentially and each one found something to do different from the last one, but none of them got any time to update their predecessor's stuff.
There are genuine reasons for running multiple clusters. It helps to sometimes keep stateful (databases generally) workloads on one cluster, have another for stateless workloads etc. Sometimes customers demand complete isolation so they get their own cluster (although somehow its ok that the nodes are still VMs that are probably running on shared nodes… these requirements can be arbitrary sometimes).
> It helps to sometimes keep stateful (databases generally) workloads on one cluster, have another for stateless workloads etc. Sometimes customers demand complete isolation
Are these not covered by taint/toleration? I guess maybe isolation depending on what exactly they're demanding but even then I'd think it could work.
Yes, a few, maybe even 10, 12, but 47? It's also a prime number, so it's not something like running each thing three times for dev, stage and prod.
Most important reason is the ridiculous etcd limit of 8Gb. That alone is the reason for most k8s cluster splits.
I hate etcd probably more than most, but that 8Gb seems to just be a warning, unless you have information otherwise https://etcd.io/docs/v3.5/dev-guide/limit/#storage-size-limi...
I'll take this opportunity to once again bitch and moan that Kubernetes just fucking refuses to allow the KV store to be pluggable, unlike damn near everything else in their world, because they think that's funny or something
It isn't a mere warning. It is strongly recommended as the upper limit.
https://www.perfectscale.io/blog/etcd-8gb
https://github.com/etcd-io/etcd/issues/9771
And yes, I agree not allowing a pluggable replacement is really stupid.
> https://github.com/etcd-io/etcd/issues/9771
> stale bot marked this as completed (by fucking closing it)
Ah, yes, what would a Kubernetes-adjacent project be without a fucking stale bot to close issues willy nilly
This doesn't make sense to me, and I feel like this is "holding it wrong"
But this is also a caveat with "managing it yourself", you get a lot of people having ideas and shooting themselves in the foot with it
Honest question - why would you want stateful workloads in a separate cluster from stateless? Why not just use a namespace?
We have 1 cluster per region (3 regions: Asia, US, EU), multiplied by redundancy clusters = 6 clusters.
Then we have test environments, around 20 of them: 20 clusters.
Then there are 10s of clusters installed on customers' infra.
So 47 clusters isn't really a huge/strange number.
I'm wondering the same. Either they are quite a big company, so such infrastructure comes naturally from many products/teams or their use case is to be in the clusters business (provisioning and managing k8s clusters for other companies). In both cases I'd say there should be a dedicated devops team that knows their way around k8s.
Other than that, the experience I have is that using a managed solution like EKS and one cluster per env (dev, staging, prod) with namespaces to isolate projects takes you a long way. Having used k8s for years now, I'm probably biased, but in general I disagree with many of the k8s-related posts that are frequently upvoted on the front page. I find it gives me freedom, I can iterate fast on services, change structure easily without worrying too much about isolation, networking and resources. In general I feel more nimble than I used to before k8s.
Don’t know about the writer of the article, but there are some legit reasons to use multiple K8s clusters. Single-tenant environments, segregation of resources into different cloud accounts, redundancy (although there are probably lots of ways to do most of this within a single cluster), 46 different developer / QA / CI clusters, etc.
Yeah, it's not very smart. I'm at a company with a $50B+ MC and we run prod on a cluster, staging on another one, then it's tooling clusters like dev spaces, ML, etc. I think in total we have 6 of 7 for ~1000 devs and thousands of nodes.
It makes sense that getting off k8s helped them if they were using it incorrectly.
I had a client who had a single K8 cluster, too much ops for the team, so their idea was to transfer that to each product dev team and thus was born the one K8 per product. They had at least a few 100s of products.
In my company (large corporate) we only have 3 clusters: dev/acc/prod. Everything runs on it. I love it.
Isn't one of the strategies also to run one or two backup clusters for any production cluster? Which can take over the workloads if the primary cluster fails for some reason?
In a cloud environment the backup cluster can be scaled up quickly if it has to take over, so while it's idling it only requires a few smaller nodes.
You might run a cluster per region, but the whole point of Kubernetes is that it's highly available. What specific piece are you worried about will go down in one cluster that you need two production clusters all the time? Upgrades are a special case where I could see spinning up a backup cluster for.
A lot of things can break (hardware, networking, ...). Spanning the workload over multiple clusters in different regions is already satisfying the "backup cluster" recommendation.
Many workloads don't need to be multi-region as a requirement. So they might run just on one cluster with the option to fail over to another region in case of an emergency. Running a workload on one cluster at a time (even with some downtime for a manual failover) makes a lot of things much easier. Many workloads don't need 99,99% availability, and nothing awful happens if they are down for a few hours.
To answer your question directly: yes, that's the point. You may have different clusters for different logical purposes but, yes: less clusters, more node groups is a better practice.
We ran only two (very small) clusters for some time in the past and even then it introduced some unnecessary overhead on the ops side and some headaches on the dev side. Maybe they were just growing pains, but if I have to run Kubernetes again I will definitely opt for a single large cluster.
After all Kubernetes provides all the primitives you need to enforce separation. You wouldn't create separate VMWare production and test clusters either unless you have a good reason.
You need a separate cluster for production because there are operations you'd do your staging/QA environments that might accidentally knock out your cluster, I did that once and it was not fun.
I completely agree with keeping everything as simple as possible though. No extra clusters if not absolutely necessary, and also no extra namespaces if not absolutely necessary.
The thing with Kubernetes is that it was designed to support every complex situation imaginable. All these features make you feel as though you should make use of them, but you shouldn't. This complexity leaked into systems like Helm, which why in my opinion it's better to roll your own deployment scripts rather than to use Helm.
Do you mind sharing what these operations were? I can think of a few things that may very well brick your control plane. But at the very least existing workloads continue to function in this case as far as I know. Same with e.g. misconfigured network policies. Those might cause downtimes, but at least you can roll them back easily. This was some time ago though. There may be more footguns now. Curious to know how you bricked your cluster, if you don't mind.
I agree that k8s offers many features that most users probably don't need and may not even know of. I found that I liked k8s best when we used only a few, stable features (only daemonsets and deployments for workloads, no statefulsets) and simple helm charts. Although we could have probably ditched helm altogether.
You can’t roll back an AWS EKS control plane version upgrade. “Measure twice, cut once” kinda thing.
And operators/helm charts/CRDs use APIs which can and are deprecated, which can cause outages. It pays to make sure your infrastructure is automated with Got apps, CICD, and thorough testing so you can identify the potential hurdles before your cluster upgrade causes unplanned service downtime.
It is a huge effort just to “run in place” with the current EKS LTS versions if your company has lots of 3rd party tooling (like K8s operators) installed and there isn’t sufficient CICD+testing to validate potential upgrades as soon after they are released.
3rd party tooling is frequently run by open source teams, so they don’t always have resources or desire/alignment to stay compatible with the newest version of K8s. Also, when the project goes idle/disbands/fractures into rival projects, that can cause infra/ops teams time to evaluate the replacement/ substitute projects which are going to be a better solution going forward. We recently ran into this with the operator we had originally installed to run Cassandra.
In my case, it was the ingress running out of subdomains because each staging environment would get its own subdomain, and our system had a bug that caused them to not be cleaned up. So the CI/CD was leaking subdomains, eventually the list became too long and it bumped the production domain off the list.
Kubernetes upgrades? Don't those risk bricking everything with just 1 environment?
In theory: absolutely. This is just anecdata and you are welcome to challenge me on it, but I have never had a problem upgrading Kubernetes itself. As long as you trail one version behind the latest to ensure critical bugs are fixed before you risk to run into them yourself, I think you are good.
Edit: To expand on it a little bit. I think there is always a real, theoretical risk that must be taken into account when you design your infrastructure. But when experience tells you that accounting for this potential risk may not be worth it in practice, you might get away with discarding it and keeping your infra lean. (Yes, I am starting to sweat just writing this).
"I am cutting this corner because I absolutely cannot make a business case I believe in for doing it the hard (but more correct) way but believe me I am still going to be low key paranoid about it indefinitely" is an experience that I think a lot of us can relate to.
I've actually asked for a task to be reassigned to somebody else before now on the grounds that I knew it deserved to be done the simple way but could not for the life of me bring myself to implement that.
(the trick is to find a colleague with a task you *can* do that they hate more and arrange a mutually beneficial swap)
Actually I think the trick is to change ones own perspective on these things. Regardless of how many redundancies and how many 9's of availability your system theoretically achieves, there is always stuff that can go wrong for a variety of reasons. If things go wrong, I am faster at fixing a not-so-complex system than the more complex system that should, in theory, be more robust.
Also I have yet to experience that an outage of any kind had any negative consequences for me personally. As long as you stand by the decisions you made in the past and show a path forward, people (even the higher-ups) are going to respect that.
Anticipating every possible issue that might or might not occur during the lifetime of an application just leads to over-engineering.
I think rationalizing it a little bit may also help with the paranoia.
At my last job we had a Kubernetes upgrade go so wrong we ended up having to blow away the cluster and redeploy everything. Even a restore of the etcd backup didn't work. I couldn't tell you exactly what went wrong, as I wasn't the one that did the upgrade. I wasn't around of the RCA on this one. As the fallout was straw that broke the camels back, I ended up quitting to take a sabbatical.
Why would those brick everything? You update node one by one and take it slow, so issues will become apparent after upgrade and you have time to solve those - whole point of having clusters comprised of many redundand nodes.
I think it depends on the definition of "bricking the cluster". When you start to upgrade your control plane, your control plane pods restart one after one, and not only those on the specific control plane node. So at this point your control plane might not respond anymore if you happen to run into a bug or some other issue. You might call it "bricking the cluster", since it is not possible to interact with the control plane for some time. Personally I would not call it "bricked", since your production workloads on worker nodes continue to function.
Edit: And even when you "brick" it and cannot roll back, there is still a way to bring your control plane back by using an etcd backup, right?
Not sure if this has changed, but there have been companies admitting to simply nuking Kubernetes clusters if they fail, because it does happens. The argument, which I completely believe, is that it's faster to build a brand new cluster than debugging a failed one.
Exactly. You might want to have one cluster per environment so you can test your deployments and rollback plans
I work for a large corp and we have three for apps (dev, integrated testing, prod) plus I think two or three more for the platform team that I don't interact with. 47 seems horrendously excessive