2024-08-13

Ask HN: What do you monitor on your servers?

What people actually monitor

Application vs server monitoring

One camp: server metrics alone are low-value noise once sizing is done; the real value is app metrics, APM, and tracing.
Another: you must also monitor hardware/OS (logs, MCEs, disk errors, network) or you’ll misdiagnose weird failures.
Widely agreed: monitor both, but app-level SLOs (uptime, latency, error rate) are the primary “is it working?” signals.

Tools and ecosystems

Very common stacks:
- Prometheus + Grafana (+ node_exporter, process_exporter, Alertmanager; Mimir/Thanos for scale; Loki for logs).
- Netdata for “instant everything” on single hosts.
- Nagios/Icinga/Checkmk/Monit/Zabbix for classic active checks.
- Commercial APM/monitoring: Datadog, New Relic, Azure App Insights.
- Homelab/simple: Uptime Kuma, HetrixTools, PRTG, basic scripts + Grafana.
VictoriaMetrics gets both praise (performance, simplicity, cheaper storage) and criticism (API/PromQL differences, perceived FUD marketing).

Logs, tracing, and collectors

Logs: Loki+promtail, Vector, Fluentd, journald collectors, syslog-based setups; some want “poor man’s” central log solutions.
Tracing seen as highly valuable; some feel most other telemetry is “noise”.
Several recommend single, lightweight host agents (e.g., vmagent, Coroot’s eBPF agent, Telegraf/collectd).

Push vs pull and OTEL

Push advocates: easier to scale, central pullers have timing and load issues.
Pull advocates: Prometheus-style scraping plus service discovery makes it clear who’s missing.
OpenTelemetry:
- Proponents: industry standard, avoid reinventing protocols, build custom distros.
- Critics: overengineered, opaque errors, immature in some languages; prefer Prometheus/StatsD-style simplicity.

Meta and philosophy

Strong advice not to “reinvent the wheel” unless there’s a clear, differentiated vision.
Emphasis on actionable metrics and anomaly detection, not “pretty graphs”.
Some only monitor cost-relevant or business-impacting metrics; others try to “monitor everything” to ease debugging.

Related topics