I'm grateful to HN for keeping me well aware of Fly's issues. I'll never use them.
It's still 99.99+% SLA? Would you really pay 100% more for <0.01% more uptime?
No dog in this fight, all props to the Fly.io team for having the gumption to do what they are doing, I genuinely hope they are successful...
> It's still 99.99+% SLA
But this is simply not accurate. 99.99% uptime is < 52m 9.8s annually of downtime. They apparently blew well through that today. Looks like they essentially had the equivalent of 4 years of 99.99% uptime equivalent this evening.
Four nines is so unforgiving that it's almost the case that if people are required to be in the loop at any point during an incident, you will blow the fourth nine for the whole year in a single incident.
Again, I know it's hard. I would not want to be in the space. That fourth nine is really difficult to earn.
In the meanwhile, <hugops> to the Fly team as they work to resolve this (and hopefully get some rest).
99.99+% SLA typically means you get some billing credits for the downtime exceeding 99.99+ availability. So technically do get a "99.99+% SLA", but you don't get 99.99+% availability.
Other circles use "SLO" (where the O stands for objective).
(Anyone know what the details in fly.io SLA are?)
Answering myself, https://fly.io/legal/sla-uptime/ says you get some credits for under 99.9% uptime "provided that Customer reports to Fly.io such failure to meet the Uptime Commitment". So at least currently there's no talk of 99.99%.
You are correct in the legal/technical sense!
Technically, anyone could offer five- or six-nines and just depend on most customers not to claim the credits :-D
Actually hitting/exceeding four nines is still tough.
My app didn't go down yesterday, this was a downtime related to internal API and some specific regions.
You say that like it's their only issue.
Earlier in the year they had a catastrophic outage in LHR, we lost all our data. Yes this is also on me, I'm aware. Still, that's a hard nope from me, we migrated.
I think what a lot of people fail to understand is that there are certain categories of apps that simply “can never go down”
Examples include basically any PaaS, IaaS, or any company that provides a mission-critical service to another company (B2B SaaS).
If you run a basic B2C CRUD app, maybe it’s not a big deal if you service goes down for 5 minutes. Unfortunately there are quite a few categories of companies where downtime simply isn’t tolerated by customers. (I operate a company with a “zero downtime” expectation from customers - it’s no joke, and I would never use any infrastructure abstraction layer other than AWS, GCP or Azure - preferably AWS us-east-1 because, well, if you know the joke…)
> I think what a lot of people fail to understand is that there are certain categories of apps that simply “can never go down”
I refuse to believe that this category still exists, when I need to keep my county's alternate number for 911 in my address book, because CenturyLink had a 6 hour outage in 2014 and a two day outage in 2018. If the phone company can't manage to keep 911 running anymore, I'd be very surprised what does have zero downtime over a ten year period.
Personally, nine nines is too hard, so I shoot for eight eights.
My experience with very large scale B2B SaaS and PaaS has been that customers like to get money, if allowed by contract, by complaining about outages, but that overall, B2B SaaS is actually very forgiving.
Most B2B SaaS solutions have very long sales cycles and a high total cost to implement, so there is a lot of inertia to switching that “a few annoying hours of downtime a year” isn’t going to cover. Also, the metric that will drive churn isn’t actually zero downtime, it’s “nearest competitor’s downtime,” which is usually a very different number.
Every PaaS and IaaS I’ve ever used has had some amount of downtime, often considerably more than 5 minutes, and I’ve run production services on many of them. Plenty of random issues on major cloud providers as well. Certainly plenty of situations with dozens of Twitter posts happening but never any acknowledgement on the AWS status page. Nothing’s perfect.
Yea, when running services where 5 minutes of downtime results in lots of support tickets, you learn to accept that the incident will happen and learn to manage the incident rather than relying that it will never occur.
If your app cannot go down ever, then you cannot use a cloud provider either (because even AWS and Azure do fail sometime, just look up for “Azur down” on HN).
But the truth is everybody can afford some level of outage, simply because nobody has the budget to provision an infra that can never fail.
I’ve seen a team try and be truly “multi-cloud” but then ended up with this Frankenstein architecture where instead of being able to weather one cloud going down, their app would die if _any_ cloud had an issue. It was also surprisingly hard to convince people it doesn’t matter how many globally distributed clusters you have if all your data is in us-east.
you realize all of those services you mention can't give you zero downtime, they would never even advertise that. They have quite good reliability certainly, but on long enough time horizons absolutely no-one has zero downtime.
All of your examples have had multiple cases of going down, some for multiple days (2011 AWS was the first really long one I think) - or potentially worse, just deleting all customer data permanently and irretrievably.
Meaning empirically, downtime seems to be tolerated by their customers up to some point?