The Dangers of SSL Certificates
Nature of the problem & “cliff-edge” failure mode
- Several commenters note that certificates fail as a hard cliff (everything breaks at once), but argue this is not unique: disks, DBs, and keys/passwords also have binary failure modes.
- Others agree the “digital cliff” is real and operationally tricky because time, not a deploy, triggers the failure, so there’s no natural staging or partial rollout.
- Some see the issue as human/ops failure, not a flaw in certificates themselves: expiring certs are a “known problem with an established solution.”
Monitoring, alerting, and inventory
- Strong consensus that external monitoring of endpoints is essential: check what cert is actually being served, not just whether renewal jobs ran or CT logs show a new cert.
- Common patterns: Prometheus blackbox or ssl_exporter metrics, Uptime Kuma, custom scripts (e.g., alert if expiry <14–30 days), SaaS tools that mine certificate transparency, and healthchecks / deadman switches to “monitor the monitors.”
- People stress user-perspective checks (e.g., “remaining validity of the certificate offered by the service”) plus internal checks (ACME jobs, reloads, secret distribution).
- A recurring pain point is inventory: finding all certs across load balancers, containers, internal CAs, wildcards, and client/mTLS certs.
Multiple certificates & shrinking lifetimes
- One thread proposes overlapping “backup” certificates; some infrastructure already supports multiple certs, though often not multiple of the same key type.
- Others argue overlapping certs add complexity with little benefit versus simply renewing early and reliably.
- Multiple certs are seen as more useful for fast key-rotation after compromise than for normal expiry.
- Shortening maximum lifetimes (heading toward ~47 days) is contentious: some say this will force better automation and more practice; others see it as pain imposed by a small set of PKI players.
Operational discipline & runbooks
- Anecdotes highlight autorenewal silently failing (DNS changes, bad reloads, broken alerts) and outages on first on-call shifts or yearly wildcard renewals.
- Recommended mitigations: clear expiry thresholds, automatic tickets, CI/CD-run runbooks for renewal and rotation, and “upside-down pyramid” monitoring where failure of monitoring itself is a first-class alert.
Alternatives and broader security views
- A minority suggests HTTP+HTTPS for personal sites to avoid hard failures; most push back, emphasizing integrity/authentication over mere encryption.
- Some discuss SSH-style trust-on-first-use or DANE/DNSSEC, but note scalability, UX, and deployment challenges.