Fly.io outage – resolved

Nature and impact of the outage

  • fly.io website, dashboard, and API became inaccessible; many users could not deploy, manage apps, or access databases.
  • Some apps stayed up; others saw 5–16 minutes or more of HTTP errors, and some reported complete unavailability.
  • A related outage affected Turso, whose login relies on the Fly API.
  • Fly staff in the thread describe this as a control-plane / API / deployments outage rather than full request-routing failure, though other users report apps and API both down.

Root causes and architecture

  • Discussion links this and earlier incidents to Fly’s custom global state systems (Consul → Corrosion) and complex HA/state-sync machinery.
  • A prior major incident involved an expired Consul root signing key and bidirectional TLS, forcing redeployment of certs fleet-wide and exposing other latent issues.
  • Some criticize rolling their own datastore/state system; others note this is driven by Fly’s unusual geo-distributed design.
  • There is debate over Fly’s machine/storage model: initial docs suggested instances tied to a single physical server with backup-restore on failure; newer features support VM+volume migration and multi-region deployments, but the proxy/state layers remain potential single points of failure.

Reliability track record and expectations

  • Several users report this as one of many major Fly incidents, with some leaving after repeated outages or even data loss in a region.
  • Others say reliability has improved over the last year but that deployment/control-plane flakiness remains common.
  • There is pushback on claims of “99.99%”; one commenter notes the official SLA credits start below 99.9%, and that four-nines availability is extremely hard.
  • Some argue PaaS/IaaS “can’t go down” for certain B2B use cases and would only trust hyperscalers; others counter that all major clouds have multi-hour incidents and customers ultimately tolerate some downtime.

Value proposition vs. alternatives

  • Pro-Fly points: very easy Docker-based deployment (“new Heroku” feel), built-in global distribution and autoscaling, small microVM pricing, and strong developer experience. Good fit for hobby projects and low-latency edge-style apps.
  • Skeptical views: edge compute is premature optimization for most; repeated outages and control-plane issues negate the platform’s appeal for production use.
  • Multiple people report migrating to DigitalOcean App Platform, Cloudflare Workers (and upcoming CF container platform), Railway, or plain VPS providers; DO and CF are praised for stability, Railway for responsive support (though some saw early control-panel issues).
  • Others note even those alternatives sit on underlying clouds and/or their own hardware, so no provider is free of outages.

Operational practices and communication

  • Some observe a pattern of outages near major holidays and advocate change freezes; others argue many failures come from config/infra changes that can’t be fully frozen.
  • Fly’s infra blog and detailed postmortems are widely noted and generally praised for transparency, though some see a tension between “no tech debt” rhetoric and recurring complex failures.
  • A Fly representative states openly that more deployment-blocking incidents will occur given the difficulty of what they’re building, and that users who must strictly maximize reliability should prefer hyperscalers.