LetsEncrypt Outage
Immediate impact of the outage
- Affected many downstream services that depend on Let’s Encrypt (LE) for issuance, including platforms like Heroku; others like Cloudflare were noted as less affected because they don’t rely solely on LE.
- For most sites with existing certs, this should be a non-event due to renewal happening well before expiration; the main pain is for issuing new certs or replacing recently expired ones.
- Some users hit the outage while spinning up new services or renewing already-expired certificates and had to scramble for workarounds.
Reliance on a single CA and redundancy
- Several comments worry about “encrypting the web” being effectively dependent on a single free CA.
- Alternatives mentioned: ZeroSSL, Buypass, and cloud-provider CAs (Google, AWS) via ACME.
- Some tooling (e.g., Caddy) supports automatic fallback to another ACME provider, but there are edge cases (like API-based configuration) where fallback failed.
- People share configs and patterns for using multiple ACME authorities for resilience.
Certificate lifetimes: short vs long
- Debate around LE’s move toward very short-lived certs (down to 6 days in future plans) and broader ecosystem trends (eventual 47-day max for public CAs).
- Pro-short-lifetime arguments:
- Compensate for broken revocation; expiration is the only reliable revocation.
- Enable fast ecosystem-wide rotations (e.g., algorithm changes, compromises).
- Anti-short-lifetime arguments:
- Increases operational fragility and automation complexity.
- Encourages weaker security practices (more keys exposed to automation, more cert warnings, more “alert fatigue”).
- Some feel it’s analogous to over-frequent password rotation and yields marginal real security benefit.
Operations, automation, and monitoring
- LE discontinued expiration reminder emails; some admins were caught out, with certs expiring the same day as the outage.
- Strong sentiment that operators should rely on automatic renewal and independent monitoring, not vendor emails.
- Suggestions: custom scripts, CT-log–based monitors, self-hosted tools (e.g., gatus, uptime-kuma), and Prometheus exporters.
- Discussion of misconfigured certbot setups and “you’re holding it wrong” critiques when renewal isn’t automated.
PKI, DANE, and centralization concerns
- Calls for DANE and DNSSEC-based models to “cut out the middleman,” but skepticism that DNSSEC/DANE will be widely adopted; “that ship has sailed” is a recurring view.
- Concern over centralized control of trust by browser vendors and a small club of CAs; some argue registrars should be the CAs for their own domains.
- Broader critique that the Web PKI and X.509 stack is over-complex and structurally flawed; a few mention decentralized identifiers or token-based models as possible future directions, though details remain unclear and contested.
Outage cause and reliability history
- LE attributed this incident to DNS; thread is full of classic “it’s always DNS” humor and war stories.
- Some recall previous multi-hour LE outages; others note LE generally learns and improves after incidents.
- Concern about a “thundering herd” of renewals when service comes back, though LE has historically provisioned for very high throughput.