2024-07-27

Show HN: I built an open-source tool to make on-call suck less

Overall reaction to the tool

Many welcome another open‑source, Slack-integrated on-call tool and like the focus on reducing alert fatigue, surfacing context, and providing post‑shift analytics.
Some see strong overlap with existing incident / on-call tools and ask how this differs or will compete.
A few ask for similar tooling for data/business metrics, and others mention adjacent/open‑source projects in the same space.

Slack / IM as alert channels

Broad agreement that Slack/Telegram/IM are bad as primary alert mechanisms: messages scroll away, don’t re‑alert, and are easy to miss.
Common pattern: send alerts to PagerDuty/OpsGenie (or similar) for paging, and mirror to Slack/Email for collaboration and visibility.
Some orgs are on Microsoft Teams or can’t use Slack due to security, reliability, or regulatory concerns, so Slack‑only support is seen as limiting.

Alert fatigue, culture, and management

Many argue on-call problems are mostly cultural/organizational: understaffing, lack of observability, tolerance for noisy alerts, and refusal to prioritize reliability work.
Suggested remedies: “no broken windows” culture (features stop when things are broken), clear SLOs, strong incident systems of record, better reporting on alert load, and putting managers on or near the on-call rotation.
Others note that in many enterprises IT/ops are seen as a cost center, making change slow and political.

LLMs for alert classification

Supporters like using LLMs to classify alerts as noisy vs. actionable, especially to:
- Reduce cognitive load in the moment.
- Produce after‑the‑fact data about which alerts are wasteful and should be tuned or removed.
Critics see this as a risky band‑aid:
- Worry about hallucinations or misclassifying mission‑critical alerts.
- Argue it may entrench bad alert hygiene instead of fixing root causes.
- Emphasize “assist, don’t decide”; use ML for prioritization and analytics, not for silencing pages autonomously.
Several note that even good orgs have structurally noisy but sometimes‑useful alerts; tools that help triage and analyze those can still be valuable.

On-call expectations, pay, and ergonomics

Many stress that paying fairly for on-call (cash, overtime, or generous comp time) and setting realistic uptime expectations are crucial to “making on-call suck less.”
Some push back on the normalization of constant on-call for non‑critical SaaS, especially when uncompensated.
Practical pain points raised: unreliable phone notifications, desire for dedicated hardware or phones, complex scheduling (multiple shifts, holidays, training/shadowing), and calendar integration quirks.

Best practices and alternative approaches

Recurrent advice:
- Only alert on actionable conditions with clear runbooks.
- Use priority levels; keep “smoke test” / low‑priority alerts distinct from pages.
- Continuously tune alerts (thresholds, grace periods, auto‑remediation) and run regular “alert hygiene” sessions.
- Ensure every alert has an owner and gather feedback on whether alerts were actually useful.
Some reference prior art in telecom fault/alarm management and note IT is essentially reinventing this, often with less structured data and more ad‑hoc channels like Slack.

Related topics