LetsEncrypt Outage

Immediate impact of the outage

  • Affected many downstream services that depend on Let’s Encrypt (LE) for issuance, including platforms like Heroku; others like Cloudflare were noted as less affected because they don’t rely solely on LE.
  • For most sites with existing certs, this should be a non-event due to renewal happening well before expiration; the main pain is for issuing new certs or replacing recently expired ones.
  • Some users hit the outage while spinning up new services or renewing already-expired certificates and had to scramble for workarounds.

Reliance on a single CA and redundancy

  • Several comments worry about “encrypting the web” being effectively dependent on a single free CA.
  • Alternatives mentioned: ZeroSSL, Buypass, and cloud-provider CAs (Google, AWS) via ACME.
  • Some tooling (e.g., Caddy) supports automatic fallback to another ACME provider, but there are edge cases (like API-based configuration) where fallback failed.
  • People share configs and patterns for using multiple ACME authorities for resilience.

Certificate lifetimes: short vs long

  • Debate around LE’s move toward very short-lived certs (down to 6 days in future plans) and broader ecosystem trends (eventual 47-day max for public CAs).
  • Pro-short-lifetime arguments:
    • Compensate for broken revocation; expiration is the only reliable revocation.
    • Enable fast ecosystem-wide rotations (e.g., algorithm changes, compromises).
  • Anti-short-lifetime arguments:
    • Increases operational fragility and automation complexity.
    • Encourages weaker security practices (more keys exposed to automation, more cert warnings, more “alert fatigue”).
    • Some feel it’s analogous to over-frequent password rotation and yields marginal real security benefit.

Operations, automation, and monitoring

  • LE discontinued expiration reminder emails; some admins were caught out, with certs expiring the same day as the outage.
  • Strong sentiment that operators should rely on automatic renewal and independent monitoring, not vendor emails.
  • Suggestions: custom scripts, CT-log–based monitors, self-hosted tools (e.g., gatus, uptime-kuma), and Prometheus exporters.
  • Discussion of misconfigured certbot setups and “you’re holding it wrong” critiques when renewal isn’t automated.

PKI, DANE, and centralization concerns

  • Calls for DANE and DNSSEC-based models to “cut out the middleman,” but skepticism that DNSSEC/DANE will be widely adopted; “that ship has sailed” is a recurring view.
  • Concern over centralized control of trust by browser vendors and a small club of CAs; some argue registrars should be the CAs for their own domains.
  • Broader critique that the Web PKI and X.509 stack is over-complex and structurally flawed; a few mention decentralized identifiers or token-based models as possible future directions, though details remain unclear and contested.

Outage cause and reliability history

  • LE attributed this incident to DNS; thread is full of classic “it’s always DNS” humor and war stories.
  • Some recall previous multi-hour LE outages; others note LE generally learns and improves after incidents.
  • Concern about a “thundering herd” of renewals when service comes back, though LE has historically provisioned for very high throughput.