Ask HN: What do you monitor on your servers?
What people actually monitor
- Beyond CPU/RAM/disk, many emphasize:
- Disk I/O, latency, IOPS, and filesystem inodes.
- Network throughput, connection counts/states, retransmits, error rates.
- Swap and memory pressure (page faults, PSI, OOM events), less focus on “RAM used”.
- Service and unit health: systemd failures, restarts, timers, cron jobs.
- HTTP metrics: uptime, error rates (4xx/5xx), latency percentiles, throughput.
- DB metrics: query rates, slow queries, cache/index usage.
- Security/health: RAID, SMART, TLS/domain expiry, failed logins, firewall events.
- “Pressure/saturation” metrics and thread/task pool utilization as better indicators than raw CPU/RAM.
Application vs server monitoring
- One camp: server metrics alone are low-value noise once sizing is done; the real value is app metrics, APM, and tracing.
- Another: you must also monitor hardware/OS (logs, MCEs, disk errors, network) or you’ll misdiagnose weird failures.
- Widely agreed: monitor both, but app-level SLOs (uptime, latency, error rate) are the primary “is it working?” signals.
Tools and ecosystems
- Very common stacks:
- Prometheus + Grafana (+ node_exporter, process_exporter, Alertmanager; Mimir/Thanos for scale; Loki for logs).
- Netdata for “instant everything” on single hosts.
- Nagios/Icinga/Checkmk/Monit/Zabbix for classic active checks.
- Commercial APM/monitoring: Datadog, New Relic, Azure App Insights.
- Homelab/simple: Uptime Kuma, HetrixTools, PRTG, basic scripts + Grafana.
- VictoriaMetrics gets both praise (performance, simplicity, cheaper storage) and criticism (API/PromQL differences, perceived FUD marketing).
Logs, tracing, and collectors
- Logs: Loki+promtail, Vector, Fluentd, journald collectors, syslog-based setups; some want “poor man’s” central log solutions.
- Tracing seen as highly valuable; some feel most other telemetry is “noise”.
- Several recommend single, lightweight host agents (e.g., vmagent, Coroot’s eBPF agent, Telegraf/collectd).
Push vs pull and OTEL
- Push advocates: easier to scale, central pullers have timing and load issues.
- Pull advocates: Prometheus-style scraping plus service discovery makes it clear who’s missing.
- OpenTelemetry:
- Proponents: industry standard, avoid reinventing protocols, build custom distros.
- Critics: overengineered, opaque errors, immature in some languages; prefer Prometheus/StatsD-style simplicity.
Meta and philosophy
- Strong advice not to “reinvent the wheel” unless there’s a clear, differentiated vision.
- Emphasis on actionable metrics and anomaly detection, not “pretty graphs”.
- Some only monitor cost-relevant or business-impacting metrics; others try to “monitor everything” to ease debugging.