Corrosion

Blog readability and presentation

  • Several readers report the article text rendering as gray and/or bold, calling it low-contrast or “unreadable” on some Safari/iOS setups.
  • Others say it looks normal, suggesting a variable-font or CSS override issue; one points to a .text-gray-600 rule possibly not being overridden when JS/CSS fail.
  • Suggestions include using browser reader/article mode. Some expect a “public cloud” vendor blog to be readable without such workarounds.
  • There’s also a small meta-thread about dates: people want the publication date clearly at the top, not only as a “last updated” line at the bottom.
  • Writing style divides readers: some love the vivid metaphors and vocabulary; others find it needlessly ornate.

Global state, routing, and consensus

  • A recurring theme is that “instant” global database-style consensus for routing state is not workable at Fly’s scale.
  • Commenters explore alternative patterns: Envoy xDS with etcd + watches, cluster-level health checking, gossip-based systems, and DNS/DNS++-style discovery.
  • There’s debate over whether DNS is fundamentally inadequate or actually a robust, cheap service-discovery layer with failures mainly in the control plane and reconciliation logic.
  • One key distinction: noticing an instance is down is easy; reliably notifying every proxy worldwide, within tight latency bounds, is not.
  • Suggestions to reduce blast radius via sharding, cells, or shuffle sharding conflict with Fly’s stated premise: any edge proxy must be able to route to any customer app globally.

Corrosion, SWIM/gossip, and regionalization

  • Corrosion is described as a SWIM-like gossip layer plus cr-sqlite CRDT replication to maintain a distributed routing “working set.”
  • The current direction is regionalization: per-region Corrosion clusters with fine-grained machine state, and a lean global cluster that only maps apps to regions, mainly to limit blast radius of bugs.
  • Some discuss how far SWIM/gossip can scale: estimates range up to millions or more members, with practical limits driven more by change frequency and blast radius than raw node count.

SQLite, CRDTs, and correctness concerns

  • Readers are intrigued to see cr-sqlite used in production but probe its behavior: nullable column “backfill” is really clock metadata, not data writes, and is noted as an optimization opportunity.
  • A critical thread argues that cr-sqlite’s use of column versions isn’t a proper logical (Lamport) clock and that its conflict-resolution semantics are suspect.
  • Others dislike doing CRDTs inside SQL with a last-writer-wins bias, calling it overkill versus simpler Postgres patterns for this specific routing-state problem.
  • There’s also a side question whether Corrosion/cr-sqlite could act as a multi-writer alternative to tools like litestream (no clear answer in-thread).

Rust bug and language evolution

  • One outage involved an if let expression holding an RwLock guard longer than intended, causing a contagious deadlock when the else-branch assumed the lock had been released.
  • Commenters note that the Rust 2024 edition changes temporary lifetimes for if let, which would have prevented this specific pattern; there’s some back-and-forth on when that edition became available relative to the incident.

Product maturity, trust, and expectations

  • Some commenters argue that repeated issues around service discovery, certificate expiry, and distributed state suggest a “move fast and learn in production” mindset that’s risky for a public cloud.
  • They contend such a provider should enter the market only after having a robust, validated design for global routing and automation of basics like cert renewal.
  • Fly’s responses emphasize that global any-to-any routing is the core premise of the platform, not a bolt-on differentiator; the complexity and nonstandard solutions follow from that premise.

Other tools and side discussions

  • rqlite is raised as a possible way to achieve a fault-tolerant SQLite-based system; its creator chimes in with a couple of production references.
  • Some criticize the perceived “obsession” with SQLite/CRDT, suggesting a traditional Postgres deployment and even specialized networking hardware (FPGAs), though concrete benefits for this problem are not clearly articulated.
  • There are minor threads about name collisions with another project named “Corrosion” and general curiosity about how very large gossip systems are run in practice.