No surprise. About a year ago, I looked at fly.io because of it's low pricing and I was wondering where they were cutting corners to still make some money. Ultimately, I found the answer in their tech docs where it was spelled out clearly that an fly instance is hardwired to one physical server and thus cannot fail over in case that server dies. Not sure if that part still is in the official documentation.
In practice, that means if a server goes down, they have to load the last snapshot from that instance from the Backup and push it on a new server, update the network path, and pray to god that not more server fail than spare capacity is available. Otherwise you have to wait for a restore until the datacenter mounted a few more boxes in the rack.
That explains quite a bit the randomness of those outage reports i.e. my app is down vs the other is fine and mine came back in 5 minutes vs the other took forever.
As a business on a budget, I think anything else i.e. a small civo cluster serves you better.
Fly.io can migrate vm+volume now: https://fly.io/docs/reference/machine-migration/ / https://archive.md/rAK0V
> a fly instance is hardwired to one physical server and thus cannot fail over
I'm having trouble understanding how else this is supposed to be? I understand that live migration is a thing, but even in those cases, a VM is "hardwired" to some physical server, no?
> I'm having trouble understanding how else this is supposed to be? I understand that live migration is a thing, but even in those cases, a VM is "hardwired" to some physical server, no?
They mean the storage part. If your VM's storage(state) is on one server and that server dies, you have to restore from backup. If your VM's storage is on remote shared storage mounted to that server and the server dies, your VM can be restarted elsewhere that has access to that shared storage.
In AWS land it's the difference between instance store (local to a server) and EBS (remote, attached locally).
There's a tradeoff in that shared storage will be slightly slower due to having to traverse networking, and it's harder to manage properly; but the reliability gain is massive.
> I'm having trouble understanding how else this is supposed to be? I understand that live migration is a thing, but even in those cases, a VM is "hardwired" to some physical server, no?
You can run your workload (in this case a VM) on top of a scheduler, so if one node goes down the workload is just spun up on another available node.
You will have downtime, but it will be limited.
> so if one goes down ... just spun up on another
On Fly, one can absolutely set this up. Multiple ways: https://fly.io/docs/apps/app-availability / https://archive.md/SJ32K
> Ultimately, I found the answer in their tech docs where it was spelled out clearly that an fly instance is hardwired to one physical server and thus cannot fail over in case that server dies.
Majority of EC2 instance types did not have live migration until very recently. Some probably still don't (they don't really spell out how and when it's supposed to work). It is also not free - there's a noticeable brown-out when your VM gets migrated on GCP for example.
Can you shed some more light on this "browning out" phenomenon?
Here's the GCP doc [1]. Other live migration products are similar.
Generally, you have worse performance while in the preparing to move state, an actual pause, then worse performance as the move finishes up. Depending on the networking setup, some inbound packets may be lost or delayed.
[1] https://cloud.google.com/compute/docs/instances/live-migrati...
If you want HA on Fly you need to deploy an app to multiple regions (multiple machines).
Fly might still go down completely if their proxy layer fails but it's much less common.
The status tells a story about a high-availability/clustering system failure so I think in this case the problem is rather the complexity of the HA machinery hurting the system's availability vs something like a simple VPS.