Corrosion
Blog readability and presentation
- Several readers report the article text rendering as gray and/or bold, calling it low-contrast or “unreadable” on some Safari/iOS setups.
- Others say it looks normal, suggesting a variable-font or CSS override issue; one points to a
.text-gray-600rule possibly not being overridden when JS/CSS fail. - Suggestions include using browser reader/article mode. Some expect a “public cloud” vendor blog to be readable without such workarounds.
- There’s also a small meta-thread about dates: people want the publication date clearly at the top, not only as a “last updated” line at the bottom.
- Writing style divides readers: some love the vivid metaphors and vocabulary; others find it needlessly ornate.
Global state, routing, and consensus
- A recurring theme is that “instant” global database-style consensus for routing state is not workable at Fly’s scale.
- Commenters explore alternative patterns: Envoy xDS with etcd + watches, cluster-level health checking, gossip-based systems, and DNS/DNS++-style discovery.
- There’s debate over whether DNS is fundamentally inadequate or actually a robust, cheap service-discovery layer with failures mainly in the control plane and reconciliation logic.
- One key distinction: noticing an instance is down is easy; reliably notifying every proxy worldwide, within tight latency bounds, is not.
- Suggestions to reduce blast radius via sharding, cells, or shuffle sharding conflict with Fly’s stated premise: any edge proxy must be able to route to any customer app globally.
Corrosion, SWIM/gossip, and regionalization
- Corrosion is described as a SWIM-like gossip layer plus cr-sqlite CRDT replication to maintain a distributed routing “working set.”
- The current direction is regionalization: per-region Corrosion clusters with fine-grained machine state, and a lean global cluster that only maps apps to regions, mainly to limit blast radius of bugs.
- Some discuss how far SWIM/gossip can scale: estimates range up to millions or more members, with practical limits driven more by change frequency and blast radius than raw node count.
SQLite, CRDTs, and correctness concerns
- Readers are intrigued to see cr-sqlite used in production but probe its behavior: nullable column “backfill” is really clock metadata, not data writes, and is noted as an optimization opportunity.
- A critical thread argues that cr-sqlite’s use of column versions isn’t a proper logical (Lamport) clock and that its conflict-resolution semantics are suspect.
- Others dislike doing CRDTs inside SQL with a last-writer-wins bias, calling it overkill versus simpler Postgres patterns for this specific routing-state problem.
- There’s also a side question whether Corrosion/cr-sqlite could act as a multi-writer alternative to tools like litestream (no clear answer in-thread).
Rust bug and language evolution
- One outage involved an
if letexpression holding anRwLockguard longer than intended, causing a contagious deadlock when the else-branch assumed the lock had been released. - Commenters note that the Rust 2024 edition changes temporary lifetimes for
if let, which would have prevented this specific pattern; there’s some back-and-forth on when that edition became available relative to the incident.
Product maturity, trust, and expectations
- Some commenters argue that repeated issues around service discovery, certificate expiry, and distributed state suggest a “move fast and learn in production” mindset that’s risky for a public cloud.
- They contend such a provider should enter the market only after having a robust, validated design for global routing and automation of basics like cert renewal.
- Fly’s responses emphasize that global any-to-any routing is the core premise of the platform, not a bolt-on differentiator; the complexity and nonstandard solutions follow from that premise.
Other tools and side discussions
- rqlite is raised as a possible way to achieve a fault-tolerant SQLite-based system; its creator chimes in with a couple of production references.
- Some criticize the perceived “obsession” with SQLite/CRDT, suggesting a traditional Postgres deployment and even specialized networking hardware (FPGAs), though concrete benefits for this problem are not clearly articulated.
- There are minor threads about name collisions with another project named “Corrosion” and general curiosity about how very large gossip systems are run in practice.