Item 42242385

marvin-hansen • 2 months ago

No surprise. About a year ago, I looked at fly.io because of it's low pricing and I was wondering where they were cutting corners to still make some money. Ultimately, I found the answer in their tech docs where it was spelled out clearly that an fly instance is hardwired to one physical server and thus cannot fail over in case that server dies. Not sure if that part still is in the official documentation.

In practice, that means if a server goes down, they have to load the last snapshot from that instance from the Backup and push it on a new server, update the network path, and pray to god that not more server fail than spare capacity is available. Otherwise you have to wait for a restore until the datacenter mounted a few more boxes in the rack.

That explains quite a bit the randomness of those outage reports i.e. my app is down vs the other is fine and mine came back in 5 minutes vs the other took forever.

As a business on a budget, I think anything else i.e. a small civo cluster serves you better.

ignoramous • 2 months ago

Fly.io can migrate vm+volume now: https://fly.io/docs/reference/machine-migration/ / https://archive.md/rAK0V

> a fly instance is hardwired to one physical server and thus cannot fail over

I'm having trouble understanding how else this is supposed to be? I understand that live migration is a thing, but even in those cases, a VM is "hardwired" to some physical server, no?

2 replies

sofixa • 2 months ago

> I'm having trouble understanding how else this is supposed to be? I understand that live migration is a thing, but even in those cases, a VM is "hardwired" to some physical server, no?

They mean the storage part. If your VM's storage(state) is on one server and that server dies, you have to restore from backup. If your VM's storage is on remote shared storage mounted to that server and the server dies, your VM can be restarted elsewhere that has access to that shared storage.

In AWS land it's the difference between instance store (local to a server) and EBS (remote, attached locally).

There's a tradeoff in that shared storage will be slightly slower due to having to traverse networking, and it's harder to manage properly; but the reliability gain is massive.

mzi • 2 months ago

> I'm having trouble understanding how else this is supposed to be? I understand that live migration is a thing, but even in those cases, a VM is "hardwired" to some physical server, no?

You can run your workload (in this case a VM) on top of a scheduler, so if one node goes down the workload is just spun up on another available node.

You will have downtime, but it will be limited.

1 reply

ignoramous • 2 months ago

> so if one goes down ... just spun up on another

On Fly, one can absolutely set this up. Multiple ways: https://fly.io/docs/apps/app-availability / https://archive.md/SJ32K

dilyevsky • 2 months ago

> Ultimately, I found the answer in their tech docs where it was spelled out clearly that an fly instance is hardwired to one physical server and thus cannot fail over in case that server dies.

Majority of EC2 instance types did not have live migration until very recently. Some probably still don't (they don't really spell out how and when it's supposed to work). It is also not free - there's a noticeable brown-out when your VM gets migrated on GCP for example.

1 reply

ixaxaar • 2 months ago

Can you shed some more light on this "browning out" phenomenon?

1 reply

toast0 • 2 months ago

Here's the GCP doc [1]. Other live migration products are similar.

Generally, you have worse performance while in the preparing to move state, an actual pause, then worse performance as the move finishes up. Depending on the networking setup, some inbound packets may be lost or delayed.

[1] https://cloud.google.com/compute/docs/instances/live-migrati...

pier25 • 2 months ago

If you want HA on Fly you need to deploy an app to multiple regions (multiple machines).

Fly might still go down completely if their proxy layer fails but it's much less common.

1 reply

sb8244 • 2 months ago

The proxy layer was the cause of yesterday's outage according to support.

1 reply

pier25 • 2 months ago

Yes but the previous comment was about hardware failure.

fulafel • 2 months ago

The status tells a story about a high-availability/clustering system failure so I think in this case the problem is rather the complexity of the HA machinery hurting the system's availability vs something like a simple VPS.