> Isn’t the point of kubernetes that you can run your entire infra in a single cluster
I've never seen that, but yes 47 seems like a lot. Often you'd need production, staging, test, development, something like that. Then you'd add an additional cluster for running auxiliary service, this is services that has special network access or are not related to you "main product". Maybe a few of these. Still there's a long way to 47.
Out in the real world I've frequently seen companies build a cluster per service, or group of services, to better control load and scaling and again to control network access. It could also be as simple as not all staff being allowed to access the same cluster, due to regulatory concerns. Also you might not want internal tooling to run on the public facing production cluster.
You also don't want one service, either do to misconfiguration or design flaws, taking down everything, because you placed all your infrastructure in one cluster. I've seen Kubernetes crash because some service spun out of control and then causing the networking pods to crash, taking out the entire cluster. You don't really want that.
Kubernetes doesn't really provide the same type of isolation as something like VmWare, or at least it's not trusted to the same extend.
>Often you'd need production, staging, test, development, something like that.
Normally in K8s, segregating environments is done via namespaces, not clusters (unless there are some very specific resource constraints).
Which in many cases would break SOC2 compliance (co-mingling of development and customer resources), and even goes against the basic advice offered in the K8s manual. Beyond that, this limits your ability to test Control Plane upgrades against your stack, though that has generally been very stable in my experience.
To be clear I'm not defending the 47 Cluster setup of the OP, just the practice of separating Development/Production.
Why would you commingle development and customer resources? A k8s cluster is just a control plane, that specifically controls where things are running, and if you specify they can’t share resources, that’s the end of that.
If you say they share the same control plane is commingling… then what do you think a cloud console is? And if you are using different accounts there… then I hope you are using dedicated resources for absolutely everything in prod (can’t imagine what you’d pay for dedicated s3, sqs) because god forbid those two accounts end up on the same machine. Heh, you are probably violating compliance and didn’t even know it!
Sigh. I digress.
The frustrating thing with SOC2, or pretty much most compliance requirements, is that they are less about what’s “technically true”, and more about minimizing raised eyebrows.
It does make some sense though. People are not perfect, especially in large organizations, so there is value in just following the masses rather than doing everything your own way.
Yes. But it also isn’t a regulation. It is pretty much whatever you say it is.
I would want to have at least dev + prod clusters, sometimes people want to test controllers or they have badly behaved workloads that k8s doesn't isolate well (making lots of giant etcd objects). You can also test k8s version upgrades in non-prod.
That said it sounds like these people just made a cluster per service which adds a ton of complexity and loses all the benefits of k8s.
In this case, I use a script to spin up another production cluster, perform my changes, and send some traffic to it. If everything looks good, we shift over all traffic to the new cluster and shutdown the old one. Easy peasy. Have you turned your pets into cattle only to create a pet ranch?
You always want lots of very specific resource constraints between those.
Indeed. My previous company did this due to regulatory concerns.
One cluster per country in prod, one cluster per team in staging, plus individual clusters for some important customers.
A DevOps engineer famously pointed that it was stupid since they could access everything with the same SSO user anyway, and the CISO demanded individual accounts with separate passwords and separate SSO keys.
> Then you'd add an additional cluster for running auxiliary service, this is services that has special network access or are not related to you "main product". Maybe a few of these. Still there's a long way to 47.
Why couldn't you do that with a dedicated node pool, namespaces, taints and affinities? This is how we run our simulators and analytics within the same k8s cluster.
You could do a dedicated node pool and limit the egress to those nodes, but it seems simpler, as in someone is less likely to provision something incorrect, by having a separate cluster.
In my experience companies do not trust Kubernetes to the same extend as they'd trust VLANs and VMs. That's probably not entirely fair, but as you can see from many of the other comments, people find managing Kubernetes extremely difficult to get right.
For some special cases you also have regulatory requirements that maybe could be fulfilled by some Kubernetes combination of node pools, namespacing and so on, but it's not really worth the risk.
From dealing with clients wanting hosted Kubernetes, I can only say that 100% of them have been running multiple clusters. Sometimes for good reason, other times because hosting costs where per project and it's just easier to price out a cluster, compared to buying X% of the capacity on an existing cluster.
One customer I've worked with even ran an entire cluster for a single container, but that was done because no one told the developers to not use that service as an excuse to play with Kubernetes. That was its own kind of stupid.
What you just described with one bad actor bringing the entire cluster down is yet another really good reason I’ll never put any serious app on that platform.
K8s requires a flat plane addressability model across all containers, meaning anyone can see and call anyone else?
I can see security teams getting uppity about that.
Also budgetary and org boundaries, cloud providers, disaster recovery/hot spares/redundancy/AB hotswap, avoid single tank point of failure.
Addressability is not accessibility . It's easy to control how services talk to each other through NetworkPolicy.
This… sounds remarkably like the problems kubernetes solves.
single tank point of failure should be
single YAML point of failure
mobile autocorrect is super "helpful"
I have completely tanked a kubernetes cluster before. Everything kept working. The only problem was that we couldn’t spin up new containers and if any of the running ones stopped, dns/networking wouldn’t get updated. So for a few hours while we figured out how to fix what I broke, not many issues happened.
So sure, I can kinda see your point, but it feels rather moot. In the cluster, there isnt much that is a single point of failure that also wouldn’t be a point of failure in multiple clusters.
> Out in the real world I've frequently seen companies build a cluster per service, or group of services, to better control load and scaling and again to control network access.
Network Policies have solved that at least for ingress traffic.
Egress traffic is another beast, you can't allow egress traffic to a service, only to pods or IP ranges.