Ask HN: What do you monitor on your servers?

What people actually monitor

  • Beyond CPU/RAM/disk, many emphasize:
    • Disk I/O, latency, IOPS, and filesystem inodes.
    • Network throughput, connection counts/states, retransmits, error rates.
    • Swap and memory pressure (page faults, PSI, OOM events), less focus on “RAM used”.
    • Service and unit health: systemd failures, restarts, timers, cron jobs.
    • HTTP metrics: uptime, error rates (4xx/5xx), latency percentiles, throughput.
    • DB metrics: query rates, slow queries, cache/index usage.
    • Security/health: RAID, SMART, TLS/domain expiry, failed logins, firewall events.
    • “Pressure/saturation” metrics and thread/task pool utilization as better indicators than raw CPU/RAM.

Application vs server monitoring

  • One camp: server metrics alone are low-value noise once sizing is done; the real value is app metrics, APM, and tracing.
  • Another: you must also monitor hardware/OS (logs, MCEs, disk errors, network) or you’ll misdiagnose weird failures.
  • Widely agreed: monitor both, but app-level SLOs (uptime, latency, error rate) are the primary “is it working?” signals.

Tools and ecosystems

  • Very common stacks:
    • Prometheus + Grafana (+ node_exporter, process_exporter, Alertmanager; Mimir/Thanos for scale; Loki for logs).
    • Netdata for “instant everything” on single hosts.
    • Nagios/Icinga/Checkmk/Monit/Zabbix for classic active checks.
    • Commercial APM/monitoring: Datadog, New Relic, Azure App Insights.
    • Homelab/simple: Uptime Kuma, HetrixTools, PRTG, basic scripts + Grafana.
  • VictoriaMetrics gets both praise (performance, simplicity, cheaper storage) and criticism (API/PromQL differences, perceived FUD marketing).

Logs, tracing, and collectors

  • Logs: Loki+promtail, Vector, Fluentd, journald collectors, syslog-based setups; some want “poor man’s” central log solutions.
  • Tracing seen as highly valuable; some feel most other telemetry is “noise”.
  • Several recommend single, lightweight host agents (e.g., vmagent, Coroot’s eBPF agent, Telegraf/collectd).

Push vs pull and OTEL

  • Push advocates: easier to scale, central pullers have timing and load issues.
  • Pull advocates: Prometheus-style scraping plus service discovery makes it clear who’s missing.
  • OpenTelemetry:
    • Proponents: industry standard, avoid reinventing protocols, build custom distros.
    • Critics: overengineered, opaque errors, immature in some languages; prefer Prometheus/StatsD-style simplicity.

Meta and philosophy

  • Strong advice not to “reinvent the wheel” unless there’s a clear, differentiated vision.
  • Emphasis on actionable metrics and anomaly detection, not “pretty graphs”.
  • Some only monitor cost-relevant or business-impacting metrics; others try to “monitor everything” to ease debugging.