214
204
rdsubhas 2 months ago

I wouldn't trust the management of this team for anything. They appear totally incompetent in both management and basic half-brain analytical skills. Who in the heck creates a cluster per service per cloud provider, duplicate all the supporting services around it, burn money and sanity in a pit, and blame the tool.

Literally every single decision they listed was to use any of the given tools in the absolute worst, incompetent way possible. I wouldn't trust them with a Lego toy set with this record.

The people who quit didn't quit merely out of burnout. They quit the stupidity of the managers running this s##tshow.

jeffwask 2 months ago

They lost me at, "And for the first time in two years, our DevOps team took uninterrupted vacations." which is an abject failure of leadership.

karmajunkie 2 months ago

that tends to be the take on most “k8s is too complex” articles, at least the ones i’ve seen.

yes, it’s complex, but it’s simpler than running true high availability setups without something like it to standardize the processes and components needed. what i want to see is a before and after postmortem on teams that dropped it and compare their numbers like outages to get at the whole truth of their experience.

InDubioProRubio 2 months ago

Complexity is a puzzle and attracts a certain kind of easy bored dev, who also has that rockstar flair, selling it to management - then quitting (cause bored) leaving a group of wizard-prophet-whorshippers to pray to the k8 goddess at the monolith circle at night. And you can not admit as management, that you went all in on a guru and a cult.

withinboredom 2 months ago

Then they hire a different cult leader, one that can clean up the mess and simplify it for the cult that was left behind. The old cult will question their every motive, hinder them with questions about how they could ever make it simpler. Eventually, once the last old architecture is turned off; they will see the errors of their ways. This new rock star heads off to the next complicated project.

A new leader arrives and says, “we could optimize it by…”

notnmeyer 2 months ago

…i’m sorry, what? k8s can be as simple or complex as you make it.

figassis 2 months ago

Why exactly did they have 47 clusters? One thing I noticed (maybe because I’m not at that scale) is that companies are running 1+ clusters per application. Isn’t the point of kubernetes that you can run your entire infra in a single cluster, and at most you’d need a second cluster for redundancy, and you can spread nodes across regions and AZs and even clouds?

I think the bottleneck is networking and how much crosstalk your control nodes can take, but that’s your architecture team’s job?

mrweasel 2 months ago

> Isn’t the point of kubernetes that you can run your entire infra in a single cluster

I've never seen that, but yes 47 seems like a lot. Often you'd need production, staging, test, development, something like that. Then you'd add an additional cluster for running auxiliary service, this is services that has special network access or are not related to you "main product". Maybe a few of these. Still there's a long way to 47.

Out in the real world I've frequently seen companies build a cluster per service, or group of services, to better control load and scaling and again to control network access. It could also be as simple as not all staff being allowed to access the same cluster, due to regulatory concerns. Also you might not want internal tooling to run on the public facing production cluster.

You also don't want one service, either do to misconfiguration or design flaws, taking down everything, because you placed all your infrastructure in one cluster. I've seen Kubernetes crash because some service spun out of control and then causing the networking pods to crash, taking out the entire cluster. You don't really want that.

Kubernetes doesn't really provide the same type of isolation as something like VmWare, or at least it's not trusted to the same extend.

anthonybsd 2 months ago

>Often you'd need production, staging, test, development, something like that.

Normally in K8s, segregating environments is done via namespaces, not clusters (unless there are some very specific resource constraints).

Elidrake24 2 months ago

Which in many cases would break SOC2 compliance (co-mingling of development and customer resources), and even goes against the basic advice offered in the K8s manual. Beyond that, this limits your ability to test Control Plane upgrades against your stack, though that has generally been very stable in my experience.

To be clear I'm not defending the 47 Cluster setup of the OP, just the practice of separating Development/Production.

withinboredom 2 months ago

Why would you commingle development and customer resources? A k8s cluster is just a control plane, that specifically controls where things are running, and if you specify they can’t share resources, that’s the end of that.

If you say they share the same control plane is commingling… then what do you think a cloud console is? And if you are using different accounts there… then I hope you are using dedicated resources for absolutely everything in prod (can’t imagine what you’d pay for dedicated s3, sqs) because god forbid those two accounts end up on the same machine. Heh, you are probably violating compliance and didn’t even know it!

Sigh. I digress.

_hl_ 2 months ago

The frustrating thing with SOC2, or pretty much most compliance requirements, is that they are less about what’s “technically true”, and more about minimizing raised eyebrows.

It does make some sense though. People are not perfect, especially in large organizations, so there is value in just following the masses rather than doing everything your own way.

withinboredom 2 months ago

Yes. But it also isn’t a regulation. It is pretty much whatever you say it is.

bigfatkitten 2 months ago

The problem is you need to be able to convince the auditor that your controls meet the requirement. That's a much easier discussion to have with robust logical or physical separation.

bigfatkitten 2 months ago

> And if you are using different accounts there

Which for separating dev and prod, you absolutely should be.

(Separate accounts for AWS; separate projects would suffice for GCP.)

bdndndndbve 2 months ago

I would want to have at least dev + prod clusters, sometimes people want to test controllers or they have badly behaved workloads that k8s doesn't isolate well (making lots of giant etcd objects). You can also test k8s version upgrades in non-prod.

That said it sounds like these people just made a cluster per service which adds a ton of complexity and loses all the benefits of k8s.

withinboredom 2 months ago

In this case, I use a script to spin up another production cluster, perform my changes, and send some traffic to it. If everything looks good, we shift over all traffic to the new cluster and shutdown the old one. Easy peasy. Have you turned your pets into cattle only to create a pet ranch?

mmcnl 2 months ago

Sometimes there are requirements to separate clusters on the network level.

marcosdumay 2 months ago

You always want lots of very specific resource constraints between those.

nwatson 2 months ago

The constraint often would be regulatory. Even if technically isolation is possible, management won't risk SOC2 or GDPR non-compliance.

Zambyte 2 months ago

SOC2 is voluntary, not regulatory.

bigfatkitten 2 months ago

It's not voluntary if your customers have signed contracts with you on the basis that you gain and maintain that certification. And if they haven't, you shouldn't have wasted your money.

AtlasBarfed 2 months ago

K8s requires a flat plane addressability model across all containers, meaning anyone can see and call anyone else?

I can see security teams getting uppity about that.

Also budgetary and org boundaries, cloud providers, disaster recovery/hot spares/redundancy/AB hotswap, avoid single tank point of failure.

wbl 2 months ago

Addressability is not accessibility . It's easy to control how services talk to each other through NetworkPolicy.

withinboredom 2 months ago

This… sounds remarkably like the problems kubernetes solves.

AtlasBarfed 2 months ago

single tank point of failure should be

single YAML point of failure

mobile autocorrect is super "helpful"

withinboredom 2 months ago

I have completely tanked a kubernetes cluster before. Everything kept working. The only problem was that we couldn’t spin up new containers and if any of the running ones stopped, dns/networking wouldn’t get updated. So for a few hours while we figured out how to fix what I broke, not many issues happened.

So sure, I can kinda see your point, but it feels rather moot. In the cluster, there isnt much that is a single point of failure that also wouldn’t be a point of failure in multiple clusters.

rootlocus 2 months ago

> Then you'd add an additional cluster for running auxiliary service, this is services that has special network access or are not related to you "main product". Maybe a few of these. Still there's a long way to 47.

Why couldn't you do that with a dedicated node pool, namespaces, taints and affinities? This is how we run our simulators and analytics within the same k8s cluster.

mrweasel 2 months ago

You could do a dedicated node pool and limit the egress to those nodes, but it seems simpler, as in someone is less likely to provision something incorrect, by having a separate cluster.

In my experience companies do not trust Kubernetes to the same extend as they'd trust VLANs and VMs. That's probably not entirely fair, but as you can see from many of the other comments, people find managing Kubernetes extremely difficult to get right.

For some special cases you also have regulatory requirements that maybe could be fulfilled by some Kubernetes combination of node pools, namespacing and so on, but it's not really worth the risk.

From dealing with clients wanting hosted Kubernetes, I can only say that 100% of them have been running multiple clusters. Sometimes for good reason, other times because hosting costs where per project and it's just easier to price out a cluster, compared to buying X% of the capacity on an existing cluster.

One customer I've worked with even ran an entire cluster for a single container, but that was done because no one told the developers to not use that service as an excuse to play with Kubernetes. That was its own kind of stupid.

whstl 2 months ago

Indeed. My previous company did this due to regulatory concerns.

One cluster per country in prod, one cluster per team in staging, plus individual clusters for some important customers.

A DevOps engineer famously pointed that it was stupid since they could access everything with the same SSO user anyway, and the CISO demanded individual accounts with separate passwords and separate SSO keys.

supersixirene 2 months ago

What you just described with one bad actor bringing the entire cluster down is yet another really good reason I’ll never put any serious app on that platform.

mschuster91 2 months ago

> Out in the real world I've frequently seen companies build a cluster per service, or group of services, to better control load and scaling and again to control network access.

Network Policies have solved that at least for ingress traffic.

Egress traffic is another beast, you can't allow egress traffic to a service, only to pods or IP ranges.

mrweasel 2 months ago

I was thinking egress, but you're correct on ingress.

this_user 2 months ago

It's just a matter of time before someone releases an orchestration layer for k8s clusters so the absurd Rube Goldberg machine that is modern devops stacks can grow even more complex.

teeray 2 months ago

We can call it k8tamari

englishspot 2 months ago
UltraSane 2 months ago

Is it ever a good idea to actually use this?

processunknown 2 months ago

Istio is not a k8s orchestration layer though

jhklkjj 2 months ago

This has already happened. There are orgs managing hundreds to thousands of k8s clusters.

mad_vill 2 months ago

cluster-api in a nutshell

torginus 2 months ago

Because they have k8s engineers each of whom wants to put it on their resume that they designed and implemented a working cluster in prod.

astura 2 months ago

Resume Driven Development.

AstroJetson 2 months ago

Wouldn't that give you a prod mix of Kubernetes, Docker, Fargate, Rancher, etc?

zelphirkalt 2 months ago

Maybe they had 47 different Kubernetes consultants coming in sequentially and each one found something to do different from the last one, but none of them got any time to update their predecessor's stuff.

pm90 2 months ago

There are genuine reasons for running multiple clusters. It helps to sometimes keep stateful (databases generally) workloads on one cluster, have another for stateless workloads etc. Sometimes customers demand complete isolation so they get their own cluster (although somehow its ok that the nodes are still VMs that are probably running on shared nodes… these requirements can be arbitrary sometimes).

raverbashing 2 months ago

This doesn't make sense to me, and I feel like this is "holding it wrong"

But this is also a caveat with "managing it yourself", you get a lot of people having ideas and shooting themselves in the foot with it

oofbey 2 months ago

Honest question - why would you want stateful workloads in a separate cluster from stateless? Why not just use a namespace?

pm90 2 months ago

because namespaces aren’t a failure boundary.

If your api gets hosed, you can create a new cluster and tear down the old one and call it a day.

With a stateful cluster, you can’t do that. As such you put in a lot more care with eg k8s upgrades, or introducing new controllers or admissions/mutating webhooks.

yjftsjthsd-h 2 months ago

> It helps to sometimes keep stateful (databases generally) workloads on one cluster, have another for stateless workloads etc. Sometimes customers demand complete isolation

Are these not covered by taint/toleration? I guess maybe isolation depending on what exactly they're demanding but even then I'd think it could work.

pm90 2 months ago

Yes only at the node (data plane) level but not the api (control plane).

lenkite 2 months ago

Most important reason is the ridiculous etcd limit of 8Gb. That alone is the reason for most k8s cluster splits.

mdaniel 2 months ago

I hate etcd probably more than most, but that 8Gb seems to just be a warning, unless you have information otherwise https://etcd.io/docs/v3.5/dev-guide/limit/#storage-size-limi...

I'll take this opportunity to once again bitch and moan that Kubernetes just fucking refuses to allow the KV store to be pluggable, unlike damn near everything else in their world, because they think that's funny or something

lenkite 2 months ago

It isn't a mere warning. It is strongly recommended as the upper limit.

https://www.perfectscale.io/blog/etcd-8gb

https://github.com/etcd-io/etcd/issues/9771

And yes, I agree not allowing a pluggable replacement is really stupid.

mdaniel 2 months ago

> https://github.com/etcd-io/etcd/issues/9771

> stale bot marked this as completed (by fucking closing it)

Ah, yes, what would a Kubernetes-adjacent project be without a fucking stale bot to close issues willy nilly

wasmitnetzen 2 months ago

Yes, a few, maybe even 10, 12, but 47? It's also a prime number, so it's not something like running each thing three times for dev, stage and prod.

pm90 2 months ago

yeahhhh 47 seems insane

rvense 2 months ago

> It helps

Can you elaborate on that? What does it do?

kgeist 2 months ago

We have 1 cluster per region (3 regions: Asia, US, EU), multiplied by redundancy clusters = 6 clusters.

Then we have test environments, around 20 of them: 20 clusters.

Then there are 10s of clusters installed on customers' infra.

So 47 clusters isn't really a huge/strange number.

gitaarik 2 months ago

Why 20 test clusters? Do you have more developers than users?

fragmede 2 months ago

Curious how the 20 test clusters differ by.

jan_g 2 months ago

I'm wondering the same. Either they are quite a big company, so such infrastructure comes naturally from many products/teams or their use case is to be in the clusters business (provisioning and managing k8s clusters for other companies). In both cases I'd say there should be a dedicated devops team that knows their way around k8s.

Other than that, the experience I have is that using a managed solution like EKS and one cluster per env (dev, staging, prod) with namespaces to isolate projects takes you a long way. Having used k8s for years now, I'm probably biased, but in general I disagree with many of the k8s-related posts that are frequently upvoted on the front page. I find it gives me freedom, I can iterate fast on services, change structure easily without worrying too much about isolation, networking and resources. In general I feel more nimble than I used to before k8s.

m00x 2 months ago

Yeah, it's not very smart. I'm at a company with a $50B+ MC and we run prod on a cluster, staging on another one, then it's tooling clusters like dev spaces, ML, etc. I think in total we have 6 of 7 for ~1000 devs and thousands of nodes.

It makes sense that getting off k8s helped them if they were using it incorrectly.

thephyber 2 months ago

Don’t know about the writer of the article, but there are some legit reasons to use multiple K8s clusters. Single-tenant environments, segregation of resources into different cloud accounts, redundancy (although there are probably lots of ways to do most of this within a single cluster), 46 different developer / QA / CI clusters, etc.

Foobar8568 2 months ago

I had a client who had a single K8 cluster, too much ops for the team, so their idea was to transfer that to each product dev team and thus was born the one K8 per product. They had at least a few 100s of products.

andix 2 months ago

Isn't one of the strategies also to run one or two backup clusters for any production cluster? Which can take over the workloads if the primary cluster fails for some reason?

In a cloud environment the backup cluster can be scaled up quickly if it has to take over, so while it's idling it only requires a few smaller nodes.

fragmede 2 months ago

You might run a cluster per region, but the whole point of Kubernetes is that it's highly available. What specific piece are you worried about will go down in one cluster that you need two production clusters all the time? Upgrades are a special case where I could see spinning up a backup cluster for.

andix 2 months ago

A lot of things can break (hardware, networking, ...). Spanning the workload over multiple clusters in different regions is already satisfying the "backup cluster" recommendation.

Many workloads don't need to be multi-region as a requirement. So they might run just on one cluster with the option to fail over to another region in case of an emergency. Running a workload on one cluster at a time (even with some downtime for a manual failover) makes a lot of things much easier. Many workloads don't need 99,99% availability, and nothing awful happens if they are down for a few hours.

mmcnl 2 months ago

In my company (large corporate) we only have 3 clusters: dev/acc/prod. Everything runs on it. I love it.

tonfreed 2 months ago

I work for a large corp and we have three for apps (dev, integrated testing, prod) plus I think two or three more for the platform team that I don't interact with. 47 seems horrendously excessive

levifig 2 months ago

To answer your question directly: yes, that's the point. You may have different clusters for different logical purposes but, yes: less clusters, more node groups is a better practice.

karmarepellent 2 months ago

We ran only two (very small) clusters for some time in the past and even then it introduced some unnecessary overhead on the ops side and some headaches on the dev side. Maybe they were just growing pains, but if I have to run Kubernetes again I will definitely opt for a single large cluster.

After all Kubernetes provides all the primitives you need to enforce separation. You wouldn't create separate VMWare production and test clusters either unless you have a good reason.

tinco 2 months ago

You need a separate cluster for production because there are operations you'd do your staging/QA environments that might accidentally knock out your cluster, I did that once and it was not fun.

I completely agree with keeping everything as simple as possible though. No extra clusters if not absolutely necessary, and also no extra namespaces if not absolutely necessary.

The thing with Kubernetes is that it was designed to support every complex situation imaginable. All these features make you feel as though you should make use of them, but you shouldn't. This complexity leaked into systems like Helm, which why in my opinion it's better to roll your own deployment scripts rather than to use Helm.

karmarepellent 2 months ago

Do you mind sharing what these operations were? I can think of a few things that may very well brick your control plane. But at the very least existing workloads continue to function in this case as far as I know. Same with e.g. misconfigured network policies. Those might cause downtimes, but at least you can roll them back easily. This was some time ago though. There may be more footguns now. Curious to know how you bricked your cluster, if you don't mind.

I agree that k8s offers many features that most users probably don't need and may not even know of. I found that I liked k8s best when we used only a few, stable features (only daemonsets and deployments for workloads, no statefulsets) and simple helm charts. Although we could have probably ditched helm altogether.

thephyber 2 months ago

You can’t roll back an AWS EKS control plane version upgrade. “Measure twice, cut once” kinda thing.

And operators/helm charts/CRDs use APIs which can and are deprecated, which can cause outages. It pays to make sure your infrastructure is automated with Got apps, CICD, and thorough testing so you can identify the potential hurdles before your cluster upgrade causes unplanned service downtime.

It is a huge effort just to “run in place” with the current EKS LTS versions if your company has lots of 3rd party tooling (like K8s operators) installed and there isn’t sufficient CICD+testing to validate potential upgrades as soon after they are released.

3rd party tooling is frequently run by open source teams, so they don’t always have resources or desire/alignment to stay compatible with the newest version of K8s. Also, when the project goes idle/disbands/fractures into rival projects, that can cause infra/ops teams time to evaluate the replacement/ substitute projects which are going to be a better solution going forward. We recently ran into this with the operator we had originally installed to run Cassandra.

thephyber 2 months ago

`s/Got apps/GitOps/`

tinco 2 months ago

In my case, it was the ingress running out of subdomains because each staging environment would get its own subdomain, and our system had a bug that caused them to not be cleaned up. So the CI/CD was leaking subdomains, eventually the list became too long and it bumped the production domain off the list.

oblio 2 months ago

Kubernetes upgrades? Don't those risk bricking everything with just 1 environment?

karmarepellent 2 months ago

In theory: absolutely. This is just anecdata and you are welcome to challenge me on it, but I have never had a problem upgrading Kubernetes itself. As long as you trail one version behind the latest to ensure critical bugs are fixed before you risk to run into them yourself, I think you are good.

Edit: To expand on it a little bit. I think there is always a real, theoretical risk that must be taken into account when you design your infrastructure. But when experience tells you that accounting for this potential risk may not be worth it in practice, you might get away with discarding it and keeping your infra lean. (Yes, I am starting to sweat just writing this).

mst 2 months ago

"I am cutting this corner because I absolutely cannot make a business case I believe in for doing it the hard (but more correct) way but believe me I am still going to be low key paranoid about it indefinitely" is an experience that I think a lot of us can relate to.

I've actually asked for a task to be reassigned to somebody else before now on the grounds that I knew it deserved to be done the simple way but could not for the life of me bring myself to implement that.

(the trick is to find a colleague with a task you *can* do that they hate more and arrange a mutually beneficial swap)

karmarepellent 2 months ago

Actually I think the trick is to change ones own perspective on these things. Regardless of how many redundancies and how many 9's of availability your system theoretically achieves, there is always stuff that can go wrong for a variety of reasons. If things go wrong, I am faster at fixing a not-so-complex system than the more complex system that should, in theory, be more robust.

Also I have yet to experience that an outage of any kind had any negative consequences for me personally. As long as you stand by the decisions you made in the past and show a path forward, people (even the higher-ups) are going to respect that.

Anticipating every possible issue that might or might not occur during the lifetime of an application just leads to over-engineering.

I think rationalizing it a little bit may also help with the paranoia.

pickle-wizard 2 months ago

At my last job we had a Kubernetes upgrade go so wrong we ended up having to blow away the cluster and redeploy everything. Even a restore of the etcd backup didn't work. I couldn't tell you exactly what went wrong, as I wasn't the one that did the upgrade. I wasn't around of the RCA on this one. As the fallout was straw that broke the camels back, I ended up quitting to take a sabbatical.

merpkz 2 months ago

Why would those brick everything? You update node one by one and take it slow, so issues will become apparent after upgrade and you have time to solve those - whole point of having clusters comprised of many redundand nodes.

karmarepellent 2 months ago

I think it depends on the definition of "bricking the cluster". When you start to upgrade your control plane, your control plane pods restart one after one, and not only those on the specific control plane node. So at this point your control plane might not respond anymore if you happen to run into a bug or some other issue. You might call it "bricking the cluster", since it is not possible to interact with the control plane for some time. Personally I would not call it "bricked", since your production workloads on worker nodes continue to function.

Edit: And even when you "brick" it and cannot roll back, there is still a way to bring your control plane back by using an etcd backup, right?

mrweasel 2 months ago

Not sure if this has changed, but there have been companies admitting to simply nuking Kubernetes clusters if they fail, because it does happens. The argument, which I completely believe, is that it's faster to build a brand new cluster than debugging a failed one.

nasmorn 2 months ago

I had this happen on a small scale and it scared me a lot. It felt like your executable suddenly falling apart and you now need to fix it in assembly. My takeaway was that the k8s abstraction is way leakier than it is made out to be

motbus3 2 months ago

Exactly. You might want to have one cluster per environment so you can test your deployments and rollback plans

timhigins 2 months ago

If you have 200 YAML files for a single service and 46 clusters I think you're using k8s wrong. And 5 + 3 different monitoring and logging tools could be a symptom of chaos in the organization.

k8s, and the go runtime and network stack have been heavily optimized by armies of engineers at Google and big tech, so I am very suspicious of these claims without evidence. Show me the resource usage from k8s component overhead, and the 15 minute to 3 minute deploys and then I'll believe you. And the 200 file YAML or Helm charts so I can understand why in gods name you're doing it that way.

This post just needs a lot more details. What are the typical services/workloads running on k8s? What's the end user application?

I taught myself k8s in the first month of my first job, and it felt like having super powers. The core concepts are very beautiful, like processes on Linux or JSON APIs over HTTP. And its not too hard to build a CustomResourceDefinition or dive into the various high performance disk and network IO components if you need to.

I do hate Helm to some degree but there are alternatives like Kustomize, Jsonnet/Tanka, or Cue https://github.com/cue-labs/cue-by-example/tree/main/003_kub.... You can even manage k8s resources via Terraform or Pulumi

namaria 2 months ago

> I do hate Helm to some degree

I feel you. I learned K8s with an employer where some well intentioned but misguided back end developers decided that their YAML deployments should ALL be templated and moved into Helm charts. It was bittersweet to say the least, learning all the power of K8s but having to argue and feel like an alien for saying that templating everything was definitely not going to make everything easier on the long term.

Then again they had like 6 developers and 30 services deployed and wanted to go "micro front end" on top of it. So they clearly had misunderstood the whole thing. CTO had a whole spiel on how "microservices" were a silver bullet and all.

I didn't last long there but they paid me to learn some amazing stuff. Which, in retrospect, they also taught me a bunch of lessons on how not to do things.

brainzap 2 months ago

feel the same about helm, it enables versioning and simpler deploy but misses core features

m1keil 2 months ago

How to save 1M off your cloud infra? Start from a 2M bill.

That's how I see most of these projects. You create a massively expensive infra because webscale, then 3 years down the road you (or someone else) gets to rebuild it 10x cheaper. You get to write two blog posts, one for using $tech and one for migrating off $tech. A line in the cv and a promotion.

But kudos for them for managing to stop the snowball and actually reverting course. Most places wouldn't dare because of sunken costs.

saylisteins 2 months ago

I don't think that's necessarily a problem. When starting a new product time to market as well as identifying the user needs feature wise is way more important than being able to scale "Infinitely".

It makes sense to use whatever $tech helps you get an MVP out asap and iterate on it. once you're sure you found gold, then it makes sense to optimize for scale. The only thing I guess one has to worry about when developing something like that, is to make sure good scalability is possible with some tinkering and efforts and not totally impossible.

m1keil 2 months ago

I agree with you. I'm not advocating for hyper cost optimization at an early stage startup. You probably don't need k8s to get your mvp out of the door either.

marcinzm 2 months ago

The article says they spend around $150k/year on infra. Given they have 8 DevOps engineers I assume a team of 50+ engineers. Assuming $100k/engineer that's $5 million/year in salary. That's all low end estimates.

They saved $100k in the move or 2% of their engineering costs. And they're still very much on the cloud.

If you tell most organizations that they need to change everything to save 2% they'll tell you to go away. This place benefited because their previous system was badly designed and not because it's cloud.

m1keil 2 months ago

I'm not making an argument against the cloud here. Not saying you should move out. The reason why I call out cloud infrastructure specifically is because of how easy it is to let the costs get away from you in the cloud. This is a common thread in every company that uses the cloud. There is a huge amount of waste. And this company's story isn't different.

By the way, 8 DevOps engineers with $150k/year cloud bill deserves to be highlighted here. This is a very high amount of staff dedicated to a relatively small infrastructure setup in an industry that keeps saying "cloud will manage that for you."

_bare_metal 2 months ago

To expand on this, I run BareMetalSavings.com[0] and the most common cause of people staying with the cloud is it's very hard for them to maintain their own K8S cluster(s), which they want to keep because they're great for any non-ops developer.

So those savings are possible only if your devs are willing to leave the lock in of comfort

[0]: https://BareMetalSavings.com

marcinzm 2 months ago

Cloud isn't about comfort lock in but dev efficiency.

_bare_metal 2 months ago

Not every use case is more efficient on the cloud

Havoc 2 months ago

How do you end up with 200 yaml file “basic deployments” without anyone looking up from their keyboard and muttering “guys what are we doing”?

Honestly they could have picked any stack as next one because the key win here was starting from scratch

StressedDev 2 months ago

This is not that surprising. First, it depends on how big the YAML files were and what was in them. If you have 200 services, I could easily see 200 YAML files. Second, there are non-service reasons to have YAML files. You might have custom roles, ingresses, volumes, etc. If you do not use something like Helm, you might also have 1 YAML file per environment (not the best idea but it happens).

My suspicion is the original environment (47 Kubernetes clusters, 200 YAML files, unreliable deployments, using 3 clouds, etc.) was not planned out. You probably had multiple teams provisioning infrastructure, half completed projects, and even dead clusters (clusters which were used once but were not destroyed when they were no longer used).

I give the DevOps team in the article a lot of credit for increasing reliability, reducing costs, and increasing efficiency. They did good work.

riffraff 2 months ago

> If you have 200 services, I could easily see 200 YAML files.

Out of curiosity, in what case would you _not_ see 200 files for 200 services? Even with Helm, you'd write a chart per app wouldn't you?

zo1 2 months ago

I've seen much-lauded "Devops" or "platform" teams spend two months writing 500+ files for 3 simple python services, 5 if you include two databases.

We could have spent a tiny fraction of that 10-dev-months to deploy something to production on a bare VM on any cloud platform in a secure and probably very-scalable way.

These days I cringe and shudder everytime I hear someone mentions writing "helm charts", or using the word "workloads".

ozim 2 months ago

Every guy that joins or starts new project - instead of reading and getting familiar with what is available does his own stuff.

I see this happening all the time and unless you really have DevOps or SysAdmins who are feel acting like 'assholes' enforcing rules it is going to be like that.

Of course 'assholes' is in quotes because they have to be firm and deny a lot fo crap to keep setup clean - but then also they will be assholes for some that "just want to do stuff".

torton 2 months ago

“We want to use one standard Helm chart for all applications but then we need it to support all possible variations and use cases across the whole company”

Havoc 2 months ago

Can’t fix organisational problems with yaml ;)

fireflash38 2 months ago

But can you fix them with templated yaml?

marcosdumay 2 months ago

Just go fully functional document generation languages, make your world there, and forget about the organizational problems.

Havoc 2 months ago

Only if you sprinkle some office politics on top of the template

risson 2 months ago

So they made bad architecture decisions, blamed it on Kubernetes for some reason, and then decided to rebuild everything from scratch. Solid. The takeaway being what? Don't make bad decisions?

whynotmaybe 2 months ago

My personal takeaway, when it fails and you can't blame anyone for any reason, blame a tool.

You'll get the whole team helping you when replacing the tool and setting up a better solution.

If you blame anyone, people will start to be extra cautious and won't take any initiative.

But don't overuse it, if you always blame the tool, you'll end up like my ex colleague "Steve" where every failure was Microsoft's fault.

mst 2 months ago

I've always been fond of blaming myself and asking everybody else to help make sure I don't cock it up a second time - when it works out I get lots of help, lots of useful feedback, and everybody else feels good about putting the effort in.

This does require management who won't punish you for recording it as your fault, though. I've been fairly lucky in that regard.

Aperocky 2 months ago

Your ex-colleague may not be factually correct, but I agree with him in spirit.

whynotmaybe 2 months ago

If you use powershell, what's your reaction when you delete some stuff by using the "-Force" parameter and that it's deleted ?

Steve usually said that Microsoft should ask for a confirmation before deleting anything, even with the "-Force" parameter.

It was Microsoft's fault when the whole test environment, that he spent two days setting up, was deleted with the "-Force" parameter. He said something along the lines of "Microsoft shouldn't let me do this".

StressedDev 2 months ago

I think the takeaway was Kubernetes did not work for their team. Kubernetes was probably not the root problem but it sounds like they simplified their infrastructure greatly by standardizing on a small set of technologies.

Kubernetes is not an easy to use technology and sometimes its documentation is lacking. My gut feeling is Kubernetes is great if you have team members who are willing to learn how to use it, and you have a LOT of different containers to run. It probably is not the best solution for small to medium sized teams because of its complexity and cost.

ali_piccioni 2 months ago

It highlights a classic management failure that I see again and again and again: Executing a project without identifying the prerequisite domain expertise and ensuring you have the right people.

namaria 2 months ago

Well understanding the problem and finding competent people is hard, riding on tool marketing and hiring bootcampers to do as you say is easy.

danjl 2 months ago

The third version works

edude03 2 months ago

Like most tech stories this had pretty much nothing to do the tool itself but with the people/organization. The entire article can be summarized with this one quote

> In short, organizational decisions and an overly cautious approach to resource isolation led to an unsustainable number of clusters.

And while I emphasize with how they could end up in this situation, it feels like a lot of words were spent blaming the tool choice vs being a cautionary tail about for example planning and communication.

marcinzm 2 months ago

In my experience organizations that end up this way have a very much non blame-free culture. Can be driven by founders that lack technical skills and management experience but have a type-A personality. As a result no one wants to point out a bad decision because the person who made it will get reprimanded heavily. So they go down a path that is clearly wrong until they find a way to blame something external to reset. Usually that's someone who recently left the company or some tool choice.

mst 2 months ago

The article reads to me as pretty explicitly saying that the only real takeaway wrt k8s itself is "it was the wrong choice for us and then we compounded that wrong choice by making more wrong choices in how we implemented it."

edude03 2 months ago

Maybe I'm reading it with rose coloured glasses - but I feel like the only thing kubernetes "did wrong" is allowing them to host multiple control planes. Yes, you need 3+ CP instances for HA, but the expectation is you'd have 3 CP instances for X (say 10) workers for Y (say 100) apps. Their implied ratio was insane in comparison.

Since you can't run the Fargate control plane that indirectly solved that problem for them

millerm 2 months ago

Thank you. I abhor Medium. I need to read this clusterf** though, as I read some comments and I have to witness this ineptitude myself.

deskr 2 months ago

So many astonishing things were done ...

> As the number of microservices grew, each service often got its own dedicated cluster.

Wow. Just wow.

jerf 2 months ago

That's not a microservice; that's a macroservice.

misswaterfairy 2 months ago

Distributed cloud-native monolith?

paxys 2 months ago

Are the people who decided to spin up a separate kubernetes cluster for each microservice still employed at your organization? If so, I don't have high hopes for your new solution either.

m00x 2 months ago

I feel like OP would've been much better off if they just reworked their cluster into something sensible instead of abandoning K8s completely.

I've worked on both ECS and K8s, and K8s is much better. All of the problems they listed were poor design decisions, not k8s limitations.

- 47 Clusters: This is insane. They ack it in the post, but they could've reworked this.

- Multi-cloud: It's now not possible with ECS, but they could've limited complexity with just single-cloud k8s.

paulddraper 2 months ago

47 clusters is major "we didn't know what we were doing"

grepfru_it 2 months ago

At my last gig we were in the process of sunsetting 200+ clusters. We allowed app teams to request and provision their own cluster. That 3 year experiment ended with a migration down to 24ish clusters (~2 clusters per datacenter)

nasmorn 2 months ago

How is K8S better for simple workloads? ECS works fine and has way fewer knobs to turn.

rootlocus 2 months ago

> As the number of microservices grew, each service often got its own dedicated cluster

mdaniel 2 months ago

Be forewarned that one is flagged, so contributing comments to it may be a losing proposition

I thought for sure you were going to link to https://news.ycombinator.com/item?id=42226005 (Dear friend, you have built a Kubernetes; Nov 24, 2024; 267 comments)

mercurialuser 2 months ago

We have 3 cluster, prod, dev, test with a few pod each.

Each cluster is wasting tons of cpu and i/o bandwidth just to be idle. I was told that it is etcd doing thousands i/o per second and this is normal.

For a few monolith

huksley 2 months ago

47 clusters? Is that per developer? You could manage small, disposable VPS for every developer/environment, etc and only have Kubernetes cluster for a production environment...

phendrenad2 2 months ago

Too bad the author and company are anonymous. I'd like to confirm my assumption that the author has zero business using k8s at all.

Infrastructure is a lost art. Nobody knows what they're doing. We've entered an evolutionary spandrel where "more tools = better" meaning the candidate for an IT role who swears by 10 k8s tools is always better than the one who can fix your infra, but will also remove k8s because it's not helping you at all.

xenospn 2 months ago

I’ve been building software for 25 years across startups and giant megacorps alike, and I still don’t know what Kubernetes is.

smitelli 2 months ago

Kubernetes, in a nutshell, is a dozen `while true` loops all fighting with each other.

executesorder66 2 months ago

And if in the next 25 years you figure out how to use a search engine, you might find out! :p

xenospn 2 months ago

Maybe one day I’ll have a reason to!

bklw 2 months ago

The leaps in this writing pain me. There are other aspects, but they’ve been mentioned enough.

Vendor lock in does not come about by relying only on one cloud, but by adopting non-standard technology and interfaces. I do agree that running on multiple providers is the best way of checking if there is lock-in.

Lowering the level of sharing further by running per-service and per-stage clusters, as mentioned in the piece was likewise at best an uninformed decision.

Naturally moving to AWS and letting dedicated teams handle workload orchestration at much higher scale will yield better efficiencies. Ideally without giving up vendor-agnostic deployments by continuing the use of IaC.

ribadeo 2 months ago

Sensible. Kubernetes is an anti-pattern, along with containerized production applications in general.

-replicates OS services poorly

-OS is likely running on a hypervisor divvying up hardware resources into VPS's

-wastes ram and cpu cycles

-forces kubectl onto everything

-destroys integrity of kernel networking basic principles

-takes advantage of developer ignorance of OS and enhances it

I get it, it's a handy hack, for non production services or oneoff installs, but then its basically just a glorified VM

Jean-Papoulos 2 months ago

>$25,000/month just for control planes

To get to this point, someone must have fucked up way earlier by not knowing what they were doing. Don't do k8s kids !

cies 2 months ago

I've looked into K8s some years back and found so many new concepts that I thought: is our team big enough for so much "new".

Then I read someone saying that K8s should never be used for teams <20 FTE and will require 3 people to learn it for redundancy (in case used to self-host a SaaS product). This seemed really good advice.

Our team is smaller than 20FTE, so we use AWS/Fargate now. Works like a charm.

teekert 2 months ago

I can’t post a translated Dutch website on HN without it being shadow blocked, yet one can post stuff like this. Still love HN of course!

dddw 2 months ago

432 error: kaaskop not allowed

woile 2 months ago

What else is out there? I'm running docker swarm and it's extremely hard to make it work with ipv6. I'm running my software on a 1GB RAM cloud instance and I pay 4EUR/month, and k8s requires at least 1GB of RAM.

As of now, it seems like my only alternative is to run k8s on a 2GB of RAM system, so I'm considering moving to Hetzner just to run k3s or k0s.

justinclift 2 months ago

Are you not using docker swarm behind a reverse proxy?

ie nginx has the public facing ipv4 & ipv6 address(es), with docker swarm behind that communicating to the nginx proxies over ipv4

woile 2 months ago

I have traefik as part of the docker-compose file. Installing nginx on the host seems less reproducible, though it could fix my problem. I guess I would choose something as caddy (I'm not that happy with traefik)

justinclift 2 months ago

I've not used Caddy before personally, but if it does reverse proxying then it'd probably work fine rather than Nginx. :)

rane 2 months ago

I run k3s on a single hetzner ARM64 VPS and it works pretty well for personal projects. 2 vcpu and 4GB ram for 4€/mo.

nisa 2 months ago

I've read this article now multiple times and I'm still not sure if this is just good satire or if it's real and they can burn money like crazy or some subtle ad for aws managed cloud services :)

catdog 2 months ago

Those kind of articles often read like an ad for managed cloud services. "We got rid of that complicated, complicated Kubernetes beast by cobbling together 20 bespoke managed services from provider X which is somehow so much easier".

bdangubic 2 months ago

I would be sooooooo embarrassed to write this article and publish it

scirob 2 months ago

agree, feels like content marketing by AWS

zeroc8 2 months ago

It's the same story over and over again. Nobody gets fired for choosing AWS or Azure. Clueless managers and resume driven developers, a terrible combination. The good thing is that this leaves a lot of room for improvement for small companies, who can out compete larger ones by just not making those dumb choices.

alexyz12 2 months ago

but does improving really help these small companies in the way it matters? If the cost of the infrastructure is apparently not important to the needs of the business..

miyuru 2 months ago

is this article written by AI? other posts are non related and at the end promotes a tool that writes AI powered blog posts.

sontek 2 months ago

I'm interested in reading this but don't want to have to login to medium to do so. Is there an alternate link?

mdaniel 2 months ago

https://scribe.rip/i-stopped-using-kubernetes-our-devops-tea... works for any Medium backed blog, which is almost always given away by the URL ending in a hex digits slug

https://news.ycombinator.com/item?id=28838053 is where I learned about that, with the top comment showing a bazillion bookmarklet fixes. I'd bet dollars to donuts someone has made a scribe.rip extension for your browser, too

b99andla 2 months ago

Medium, so cant read.

ExoticPearTree 2 months ago

Is there a non-paid version of this? The title is a little clickbait, but reading the comments here seems like this is a story that jumped on the k8s bandwagon, made a lot of terrible decisions along the way and now they're blaming k8s for everything.

andrewstuart 2 months ago

When they pulled apart all those kubernetes clusters they probably found a single fat computer would run their entire workload.

“Hey, look under all that DevOps cloud infrastructure! There’s some business logic! It’s been squashed flat by the weight of all the containers and orchestration and serverless functions and observability and IAM.”

StressedDev 2 months ago

I don't think this is a fair comment. This happens sometimes but usually people go with Kubernetes because they really do need its power and scalability. Also, any technology can be misused and abused. It's not the technology's fault. It's the people who misuse the technology's fault ().

() Even here, a lot of it comes from ignorance and human nature. 99.9% of people do not set out to fail. They usually do the best they can and sometimes they make mistakes.

andrewstuart 2 months ago

The pointy is most organisations reach for container orchestration before they attempt to build fast software on a single machine.

rvense 2 months ago

My company has more microservices than paying customers.

FridgeSeal 2 months ago

Given that seemingly half the devs and orgs are antithetical to writing performant software, or optimising anything I somewhat doubt that’s going to happen anytime soon. As much as I’d like that to happen.

Performance isn’t the only reason to use container orchestration tools: they’re handy for just application lifecycle management.

chris_wot 2 months ago

Would have loved to know more about this, but I sure as heck am not going to pay to find out.

bdcravens 2 months ago

Basically they went AWS-native, with ECS being the biggest part of that. I'm currently trying to move our own stack into a simpler architecture, but I can wholeheartedly recommend ECS as a Kubernetes alternative, giving you 80% of the functionality for 20% of the effort.

threePointFive 2 months ago

Does anyone have access to the full article? I'm curious what their alternative was. Terraform? Directly using cloud apis? VMs and Ansible?

phendrenad2 2 months ago

> We selected tools that matched specific workloads:

> Stateless services → Moved to AWS ECS/Fargate.

> Stateful services → Deployed on EC2 instances with Docker, reducing the need for complex orchestration.

> Batch jobs → Handled with AWS Batch.

> Event-driven workflows → Shifted to AWS Lambda.

marcosdumay 2 months ago

So... They outsourced their OPS?

jumpoddly 2 months ago

> 2 team members quit citing burnout

And I would have gotten away with it too if only someone would rid me of that turbulent meddling cluster orchestration tooling!

TheSwordsman 2 months ago

Yeah, I'd bet their burnout risks aren't going to go away by just replacing k8s with ECS.

I also hope they have a good support contract with AWS, otherwise they are going to be in for a fun surprise when ECS has some weird behavior or bug.

nrvn 2 months ago

Kubernetes is not a one size fits all solution but even the bullet points in the article raise a number of questions. I have been working with Kubernetes since 2016 and keep being pragmatic on tech stuff. Currently support 20+ clusters with a team of 5 people across 2 clouds plus on-prem. If Kubernetes is fine for this company/project/business use case/architecture we'll use it. Otherwise we'll consider whatever fits best for the specific target requirements.

Smelly points from the article:

- "147 false positive alerts" - alert and monitoring hygiene helps. Anything will have a low signal-to-noise ratio if not properly taken care of. Been there, done that.

- "$25,000/month just for control planes / 47 clusters across 3 cloud providers" - multiple questions here. Why so many clusters? Were they provider-managed(EKS, GKE, AKS, etc.) or self-managed? $500 per control plane per month is too much. Cost breakdown would be great.

- "23 emergency deployments / 4 major outages" - what was the nature of emergency and outages? Post mortem RCA summary? lessons learnt?..

- "40% of our nodes running Kubernetes components" - a potential indicator of a huge number of small worker nodes. Cluster autoscaler been used? descheduler been used? what were those components?

- "3x redundancy for high availability" - depends on your SLO, risk appetite and budget. it is fine to have 2x with 3 redundancy zones and stay lean on resource and budget usage, and it is not mandatory for *everything* to be highly available 24/7/365.

- "60% of DevOps time spent on maintenance" - https://sre.google/workbook/eliminating-toil/

- "30% increase in on-call incidents" - Postmortems, RCA, lessons learnt? on-call incidents do not increase just because of the specific tool or technology being used.

- "200+ YAML files for basic deployments" - There are multiple ways to organise and optimise configuration management. How was it done in the first place?

- "5 different monitoring tools / 3 separate logging solutions" - should be at most one for each case. 3 different cloud providers? So come up with a cloud-agnostic solution.

- "Constant version compatibility issues" - if due diligence is not properly done. Also, Kubernetes API is fairly stable(existing APIs preserve backwards compatibility) and predictable in terms of deprecating existing APIs.

That being said, glad to know the team has benefited from ditching Kubernetes. Just keep in mind that this "you don't need ${TECHNOLOGY_NAME} and here is why" is oftentimes an emotional generalisation of someone's particular experience and cannot be applied as the universal rule.

mrkeen 2 months ago

> DevOps Team Is Happier Than Ever

Or course they are. The original value proposition of cloud providers managing your infra (and moreso with k8s) was that you could fire your ops team (now called "DevOps" because the whole idea didn't pan out) and the developers could manage their services directly.

In any case, your DevOps team has job security now.

adamtulinius 2 months ago

It doesn't take any more competent people to self-host a modern stack, than it does to babysit how a company uses something like Azure.

The original value proposition is false, and more and more are realising this.

karmarepellent 2 months ago

I think the value proposition holds when you are just getting started with your company and you happen to employ people that know their way around the hyperscaler cloud ecosystems.

But I agree that moving your own infra or outsourcing operations when you have managed to do it on your own for a while is most likely misguided. Speaking from experience it introduces costs that cannot possibly be calculdated before the fact and thus always end up more complicated and costlier than the suits imagined.

In the past, when similar decicions were made, I always thought to myself: You could have just hired one more person bringing their own, fresh perspective on what we are doing in order to improve our ops game.

wyclif 2 months ago

Oh, I've seen this before and it's true in an anecdotal sense for me. One reason why is that they always think of hiring an additional developer as a cost, never savings.

flumpcakes 2 months ago

Hammer meet nail. I currently work in a more traditional "ops" team with our cloud infrastructure dictated by development (through contract hires at first, and now a new internal DevOps team). It's mind boggling how poorly they run things. It goes so deep it's almost issues at product design stage. There's now a big project to move the responsibility back into our team because it's not fit for purpose.

I think an operations background gives you a strong ability to smell nonsense and insecurity. The DevOps team seems to be people who want to be 'developers' rather than people who care about 'ops'. Yaml slinging without thinking about what the yaml actually means.

ninininino 2 months ago

A cluster per service sounds a bit like having an aircraft carrier for each aircraft.

scirob 2 months ago

The price comparison doesn't make sense if they used to have a multi cloud system and now its just AWS. Makes me fear this is just content paid by AWS. Actually getting multi cloud to work is a huge achievment and I would be super interested to hear of another tech standard that would make that easier.

also : post paywall mirror https://archive.is/x9tB6

rob_c 2 months ago

How did your managers ever _ever_ sign off on something that cost an extra $0.5M?

Either your pre profit or some other bogus entity, or your company streamlined moving to k8s and then further streamlined by cutting away things you don't need.

I'm frankly just alarmed at the thought of wasting that much revenue, I could bring up a fleet of in house racks for that money!

dvektor 2 months ago

I feel like the medium paywall saved me... as soon as I saw "47 clusters across 3 different cloud providers", I begin to think that the tool used here might not actually the real issue.

jimberlage 2 months ago

> We were managing 47 Kubernetes clusters across three cloud providers.

What a doozy of a start to this article. How do you even reach this point?

denysvitali 2 months ago

Skill issue

geuis 2 months ago

Oh boy. Please, please stop using Medium for anything. I have lost count of how many potentially interesting or informative articles are published behind the Medium sign-in wall. At least for me, if you aren't publishing blog articles in public then what's the point of me trying to read them.

throwaway2037 2 months ago

    > "Create an account to read the full story."
Why is this required?

throwaway92t2 2 months ago

They are also misleading the users with that message. I signed up, but now I got a message that I have to upgrade to a paid account...

If they had told me that from the start, I would not bother with creating an account at all. The message implies that a free account would be enough

dspillett 2 months ago

This is why I always sign-up for something first time with fake data and a throw-away email address (I have a catch-all sub-domian, that is no longer off my main domain, for that). If it turns out the site is something I might care to return to I might then sign up with real details, or edit the initial account, if not then the email address given to them gets forwarded to /dev/null for when the inevitable spam starts arriving. I'm thinking of seeing if the wildcarding/catchall can be applied at the subdomain level, so I can fully null-route dead addresses in DNS and not even have connection attempts related to the spam.

zigman1 2 months ago

Enjoy all the mail that you will receive now

suzzer99 2 months ago

Yeah that's a BS dark pattern that drops my opinion of the site to zero.

spiderfarmer 2 months ago

It's awkward to write that on this platform, but: venture capital.

klysm 2 months ago

Money

0xEF 2 months ago

Medium likes to sell data.

whereismyacc 2 months ago

I believe the blog poster has an option to toggle this. Maybe the default behaves like this?

takladev 2 months ago

Medium! Can't Readium

tmnvix 2 months ago

I often see articles posted simultaneously to Medium and the author's own site. I imagine it helps with visibility.

It would be nice if there was a common practice of including a link to the alternative version as the first line of a Medium article.

negatic 2 months ago

word

amne 2 months ago

Don't bother reading. This is just another garbage in garbage out kind of article written by something that ends in gpt. Information density approaches zero in this one.