I think what a lot of people fail to understand is that there are certain categories of apps that simply “can never go down”
Examples include basically any PaaS, IaaS, or any company that provides a mission-critical service to another company (B2B SaaS).
If you run a basic B2C CRUD app, maybe it’s not a big deal if you service goes down for 5 minutes. Unfortunately there are quite a few categories of companies where downtime simply isn’t tolerated by customers. (I operate a company with a “zero downtime” expectation from customers - it’s no joke, and I would never use any infrastructure abstraction layer other than AWS, GCP or Azure - preferably AWS us-east-1 because, well, if you know the joke…)
> I think what a lot of people fail to understand is that there are certain categories of apps that simply “can never go down”
I refuse to believe that this category still exists, when I need to keep my county's alternate number for 911 in my address book, because CenturyLink had a 6 hour outage in 2014 and a two day outage in 2018. If the phone company can't manage to keep 911 running anymore, I'd be very surprised what does have zero downtime over a ten year period.
Personally, nine nines is too hard, so I shoot for eight eights.
My experience with very large scale B2B SaaS and PaaS has been that customers like to get money, if allowed by contract, by complaining about outages, but that overall, B2B SaaS is actually very forgiving.
Most B2B SaaS solutions have very long sales cycles and a high total cost to implement, so there is a lot of inertia to switching that “a few annoying hours of downtime a year” isn’t going to cover. Also, the metric that will drive churn isn’t actually zero downtime, it’s “nearest competitor’s downtime,” which is usually a very different number.
Every PaaS and IaaS I’ve ever used has had some amount of downtime, often considerably more than 5 minutes, and I’ve run production services on many of them. Plenty of random issues on major cloud providers as well. Certainly plenty of situations with dozens of Twitter posts happening but never any acknowledgement on the AWS status page. Nothing’s perfect.
Yea, when running services where 5 minutes of downtime results in lots of support tickets, you learn to accept that the incident will happen and learn to manage the incident rather than relying that it will never occur.
If your app cannot go down ever, then you cannot use a cloud provider either (because even AWS and Azure do fail sometime, just look up for “Azur down” on HN).
But the truth is everybody can afford some level of outage, simply because nobody has the budget to provision an infra that can never fail.
I’ve seen a team try and be truly “multi-cloud” but then ended up with this Frankenstein architecture where instead of being able to weather one cloud going down, their app would die if _any_ cloud had an issue. It was also surprisingly hard to convince people it doesn’t matter how many globally distributed clusters you have if all your data is in us-east.
you realize all of those services you mention can't give you zero downtime, they would never even advertise that. They have quite good reliability certainly, but on long enough time horizons absolutely no-one has zero downtime.
All of your examples have had multiple cases of going down, some for multiple days (2011 AWS was the first really long one I think) - or potentially worse, just deleting all customer data permanently and irretrievably.
Meaning empirically, downtime seems to be tolerated by their customers up to some point?