2025-10-23

Corrosion

Blog readability and presentation

Several readers report the article text rendering as gray and/or bold, calling it low-contrast or “unreadable” on some Safari/iOS setups.
Others say it looks normal, suggesting a variable-font or CSS override issue; one points to a .text-gray-600 rule possibly not being overridden when JS/CSS fail.
Suggestions include using browser reader/article mode. Some expect a “public cloud” vendor blog to be readable without such workarounds.
There’s also a small meta-thread about dates: people want the publication date clearly at the top, not only as a “last updated” line at the bottom.
Writing style divides readers: some love the vivid metaphors and vocabulary; others find it needlessly ornate.

Global state, routing, and consensus

A recurring theme is that “instant” global database-style consensus for routing state is not workable at Fly’s scale.
Commenters explore alternative patterns: Envoy xDS with etcd + watches, cluster-level health checking, gossip-based systems, and DNS/DNS++-style discovery.
There’s debate over whether DNS is fundamentally inadequate or actually a robust, cheap service-discovery layer with failures mainly in the control plane and reconciliation logic.
One key distinction: noticing an instance is down is easy; reliably notifying every proxy worldwide, within tight latency bounds, is not.
Suggestions to reduce blast radius via sharding, cells, or shuffle sharding conflict with Fly’s stated premise: any edge proxy must be able to route to any customer app globally.

Corrosion, SWIM/gossip, and regionalization

Corrosion is described as a SWIM-like gossip layer plus cr-sqlite CRDT replication to maintain a distributed routing “working set.”
The current direction is regionalization: per-region Corrosion clusters with fine-grained machine state, and a lean global cluster that only maps apps to regions, mainly to limit blast radius of bugs.
Some discuss how far SWIM/gossip can scale: estimates range up to millions or more members, with practical limits driven more by change frequency and blast radius than raw node count.

SQLite, CRDTs, and correctness concerns

Readers are intrigued to see cr-sqlite used in production but probe its behavior: nullable column “backfill” is really clock metadata, not data writes, and is noted as an optimization opportunity.
A critical thread argues that cr-sqlite’s use of column versions isn’t a proper logical (Lamport) clock and that its conflict-resolution semantics are suspect.
Others dislike doing CRDTs inside SQL with a last-writer-wins bias, calling it overkill versus simpler Postgres patterns for this specific routing-state problem.
There’s also a side question whether Corrosion/cr-sqlite could act as a multi-writer alternative to tools like litestream (no clear answer in-thread).

Rust bug and language evolution

One outage involved an if let expression holding an RwLock guard longer than intended, causing a contagious deadlock when the else-branch assumed the lock had been released.
Commenters note that the Rust 2024 edition changes temporary lifetimes for if let, which would have prevented this specific pattern; there’s some back-and-forth on when that edition became available relative to the incident.

Product maturity, trust, and expectations

Some commenters argue that repeated issues around service discovery, certificate expiry, and distributed state suggest a “move fast and learn in production” mindset that’s risky for a public cloud.
They contend such a provider should enter the market only after having a robust, validated design for global routing and automation of basics like cert renewal.
Fly’s responses emphasize that global any-to-any routing is the core premise of the platform, not a bolt-on differentiator; the complexity and nonstandard solutions follow from that premise.

Other tools and side discussions

rqlite is raised as a possible way to achieve a fault-tolerant SQLite-based system; its creator chimes in with a couple of production references.
Some criticize the perceived “obsession” with SQLite/CRDT, suggesting a traditional Postgres deployment and even specialized networking hardware (FPGAs), though concrete benefits for this problem are not clearly articulated.
There are minor threads about name collisions with another project named “Corrosion” and general curiosity about how very large gossip systems are run in practice.

Related topics