2024-11-26

Fly.io outage – resolved

Nature and impact of the outage

fly.io website, dashboard, and API became inaccessible; many users could not deploy, manage apps, or access databases.
Some apps stayed up; others saw 5–16 minutes or more of HTTP errors, and some reported complete unavailability.
A related outage affected Turso, whose login relies on the Fly API.
Fly staff in the thread describe this as a control-plane / API / deployments outage rather than full request-routing failure, though other users report apps and API both down.

Root causes and architecture

Discussion links this and earlier incidents to Fly’s custom global state systems (Consul → Corrosion) and complex HA/state-sync machinery.
A prior major incident involved an expired Consul root signing key and bidirectional TLS, forcing redeployment of certs fleet-wide and exposing other latent issues.
Some criticize rolling their own datastore/state system; others note this is driven by Fly’s unusual geo-distributed design.
There is debate over Fly’s machine/storage model: initial docs suggested instances tied to a single physical server with backup-restore on failure; newer features support VM+volume migration and multi-region deployments, but the proxy/state layers remain potential single points of failure.

Reliability track record and expectations

Several users report this as one of many major Fly incidents, with some leaving after repeated outages or even data loss in a region.
Others say reliability has improved over the last year but that deployment/control-plane flakiness remains common.
There is pushback on claims of “99.99%”; one commenter notes the official SLA credits start below 99.9%, and that four-nines availability is extremely hard.
Some argue PaaS/IaaS “can’t go down” for certain B2B use cases and would only trust hyperscalers; others counter that all major clouds have multi-hour incidents and customers ultimately tolerate some downtime.

Value proposition vs. alternatives

Pro-Fly points: very easy Docker-based deployment (“new Heroku” feel), built-in global distribution and autoscaling, small microVM pricing, and strong developer experience. Good fit for hobby projects and low-latency edge-style apps.
Skeptical views: edge compute is premature optimization for most; repeated outages and control-plane issues negate the platform’s appeal for production use.
Multiple people report migrating to DigitalOcean App Platform, Cloudflare Workers (and upcoming CF container platform), Railway, or plain VPS providers; DO and CF are praised for stability, Railway for responsive support (though some saw early control-panel issues).
Others note even those alternatives sit on underlying clouds and/or their own hardware, so no provider is free of outages.

Operational practices and communication

Some observe a pattern of outages near major holidays and advocate change freezes; others argue many failures come from config/infra changes that can’t be fully frozen.
Fly’s infra blog and detailed postmortems are widely noted and generally praised for transparency, though some see a tension between “no tech debt” rhetoric and recurring complex failures.
A Fly representative states openly that more deployment-blocking incidents will occur given the difficulty of what they’re building, and that users who must strictly maximize reliability should prefer hyperscalers.

Related topics