Show HN: I built an open-source tool to make on-call suck less
Overall reaction to the tool
- Many welcome another open‑source, Slack-integrated on-call tool and like the focus on reducing alert fatigue, surfacing context, and providing post‑shift analytics.
- Some see strong overlap with existing incident / on-call tools and ask how this differs or will compete.
- A few ask for similar tooling for data/business metrics, and others mention adjacent/open‑source projects in the same space.
Slack / IM as alert channels
- Broad agreement that Slack/Telegram/IM are bad as primary alert mechanisms: messages scroll away, don’t re‑alert, and are easy to miss.
- Common pattern: send alerts to PagerDuty/OpsGenie (or similar) for paging, and mirror to Slack/Email for collaboration and visibility.
- Some orgs are on Microsoft Teams or can’t use Slack due to security, reliability, or regulatory concerns, so Slack‑only support is seen as limiting.
Alert fatigue, culture, and management
- Many argue on-call problems are mostly cultural/organizational: understaffing, lack of observability, tolerance for noisy alerts, and refusal to prioritize reliability work.
- Suggested remedies: “no broken windows” culture (features stop when things are broken), clear SLOs, strong incident systems of record, better reporting on alert load, and putting managers on or near the on-call rotation.
- Others note that in many enterprises IT/ops are seen as a cost center, making change slow and political.
LLMs for alert classification
- Supporters like using LLMs to classify alerts as noisy vs. actionable, especially to:
- Reduce cognitive load in the moment.
- Produce after‑the‑fact data about which alerts are wasteful and should be tuned or removed.
- Critics see this as a risky band‑aid:
- Worry about hallucinations or misclassifying mission‑critical alerts.
- Argue it may entrench bad alert hygiene instead of fixing root causes.
- Emphasize “assist, don’t decide”; use ML for prioritization and analytics, not for silencing pages autonomously.
- Several note that even good orgs have structurally noisy but sometimes‑useful alerts; tools that help triage and analyze those can still be valuable.
On-call expectations, pay, and ergonomics
- Many stress that paying fairly for on-call (cash, overtime, or generous comp time) and setting realistic uptime expectations are crucial to “making on-call suck less.”
- Some push back on the normalization of constant on-call for non‑critical SaaS, especially when uncompensated.
- Practical pain points raised: unreliable phone notifications, desire for dedicated hardware or phones, complex scheduling (multiple shifts, holidays, training/shadowing), and calendar integration quirks.
Best practices and alternative approaches
- Recurrent advice:
- Only alert on actionable conditions with clear runbooks.
- Use priority levels; keep “smoke test” / low‑priority alerts distinct from pages.
- Continuously tune alerts (thresholds, grace periods, auto‑remediation) and run regular “alert hygiene” sessions.
- Ensure every alert has an owner and gather feedback on whether alerts were actually useful.
- Some reference prior art in telecom fault/alarm management and note IT is essentially reinventing this, often with less structured data and more ad‑hoc channels like Slack.