2025-07-21

LetsEncrypt Outage

Immediate impact of the outage

Affected many downstream services that depend on Let’s Encrypt (LE) for issuance, including platforms like Heroku; others like Cloudflare were noted as less affected because they don’t rely solely on LE.
For most sites with existing certs, this should be a non-event due to renewal happening well before expiration; the main pain is for issuing new certs or replacing recently expired ones.
Some users hit the outage while spinning up new services or renewing already-expired certificates and had to scramble for workarounds.

Reliance on a single CA and redundancy

Several comments worry about “encrypting the web” being effectively dependent on a single free CA.
Alternatives mentioned: ZeroSSL, Buypass, and cloud-provider CAs (Google, AWS) via ACME.
Some tooling (e.g., Caddy) supports automatic fallback to another ACME provider, but there are edge cases (like API-based configuration) where fallback failed.
People share configs and patterns for using multiple ACME authorities for resilience.

Certificate lifetimes: short vs long

Debate around LE’s move toward very short-lived certs (down to 6 days in future plans) and broader ecosystem trends (eventual 47-day max for public CAs).
Pro-short-lifetime arguments:
- Compensate for broken revocation; expiration is the only reliable revocation.
- Enable fast ecosystem-wide rotations (e.g., algorithm changes, compromises).
Anti-short-lifetime arguments:
- Increases operational fragility and automation complexity.
- Encourages weaker security practices (more keys exposed to automation, more cert warnings, more “alert fatigue”).
- Some feel it’s analogous to over-frequent password rotation and yields marginal real security benefit.

Operations, automation, and monitoring

LE discontinued expiration reminder emails; some admins were caught out, with certs expiring the same day as the outage.
Strong sentiment that operators should rely on automatic renewal and independent monitoring, not vendor emails.
Suggestions: custom scripts, CT-log–based monitors, self-hosted tools (e.g., gatus, uptime-kuma), and Prometheus exporters.
Discussion of misconfigured certbot setups and “you’re holding it wrong” critiques when renewal isn’t automated.

PKI, DANE, and centralization concerns

Calls for DANE and DNSSEC-based models to “cut out the middleman,” but skepticism that DNSSEC/DANE will be widely adopted; “that ship has sailed” is a recurring view.
Concern over centralized control of trust by browser vendors and a small club of CAs; some argue registrars should be the CAs for their own domains.
Broader critique that the Web PKI and X.509 stack is over-complex and structurally flawed; a few mention decentralized identifiers or token-based models as possible future directions, though details remain unclear and contested.

Outage cause and reliability history

LE attributed this incident to DNS; thread is full of classic “it’s always DNS” humor and war stories.
Some recall previous multi-hour LE outages; others note LE generally learns and improves after incidents.
Concern about a “thundering herd” of renewals when service comes back, though LE has historically provisioned for very high throughput.

Related topics