Show HN: I built an open-source tool to make on-call suck less

Overall reaction to the tool

  • Many welcome another open‑source, Slack-integrated on-call tool and like the focus on reducing alert fatigue, surfacing context, and providing post‑shift analytics.
  • Some see strong overlap with existing incident / on-call tools and ask how this differs or will compete.
  • A few ask for similar tooling for data/business metrics, and others mention adjacent/open‑source projects in the same space.

Slack / IM as alert channels

  • Broad agreement that Slack/Telegram/IM are bad as primary alert mechanisms: messages scroll away, don’t re‑alert, and are easy to miss.
  • Common pattern: send alerts to PagerDuty/OpsGenie (or similar) for paging, and mirror to Slack/Email for collaboration and visibility.
  • Some orgs are on Microsoft Teams or can’t use Slack due to security, reliability, or regulatory concerns, so Slack‑only support is seen as limiting.

Alert fatigue, culture, and management

  • Many argue on-call problems are mostly cultural/organizational: understaffing, lack of observability, tolerance for noisy alerts, and refusal to prioritize reliability work.
  • Suggested remedies: “no broken windows” culture (features stop when things are broken), clear SLOs, strong incident systems of record, better reporting on alert load, and putting managers on or near the on-call rotation.
  • Others note that in many enterprises IT/ops are seen as a cost center, making change slow and political.

LLMs for alert classification

  • Supporters like using LLMs to classify alerts as noisy vs. actionable, especially to:
    • Reduce cognitive load in the moment.
    • Produce after‑the‑fact data about which alerts are wasteful and should be tuned or removed.
  • Critics see this as a risky band‑aid:
    • Worry about hallucinations or misclassifying mission‑critical alerts.
    • Argue it may entrench bad alert hygiene instead of fixing root causes.
    • Emphasize “assist, don’t decide”; use ML for prioritization and analytics, not for silencing pages autonomously.
  • Several note that even good orgs have structurally noisy but sometimes‑useful alerts; tools that help triage and analyze those can still be valuable.

On-call expectations, pay, and ergonomics

  • Many stress that paying fairly for on-call (cash, overtime, or generous comp time) and setting realistic uptime expectations are crucial to “making on-call suck less.”
  • Some push back on the normalization of constant on-call for non‑critical SaaS, especially when uncompensated.
  • Practical pain points raised: unreliable phone notifications, desire for dedicated hardware or phones, complex scheduling (multiple shifts, holidays, training/shadowing), and calendar integration quirks.

Best practices and alternative approaches

  • Recurrent advice:
    • Only alert on actionable conditions with clear runbooks.
    • Use priority levels; keep “smoke test” / low‑priority alerts distinct from pages.
    • Continuously tune alerts (thresholds, grace periods, auto‑remediation) and run regular “alert hygiene” sessions.
    • Ensure every alert has an owner and gather feedback on whether alerts were actually useful.
  • Some reference prior art in telecom fault/alarm management and note IT is essentially reinventing this, often with less structured data and more ad‑hoc channels like Slack.