Scaling our observability platform by embracing wide events and replacing OTel
Data Volume, Retention, and “Waste”
- Some argue that collecting 100PB of observability data is fundamentally wasteful; most systems “don’t need” more than 60–90 days of logs, and GDPR encourages short retention for anything possibly containing personal data.
- Others counter that logs and traces are essential for compliance, forensics (e.g., discovering past exploitation of a newly found vuln), long-term trends, and rare, slow-burning incidents.
- View that storage is now cheap (especially tiered/S3) and that discarding observability data to save space is often shortsighted, especially for high-cardinality, unsampled traces.
Log Quality vs Quantity
- Several comments note logging is often undisciplined: verbose “connection successful” spam, poor log levels, and no thought about future use.
- Suggested alternative: treat important “logs” as structured business events or domain data, with explicit modeling and refinement instead of firehosing arbitrary text.
- Disagreement on how much success noise is useful: some see it as bisecting execution; others see it as drowning failures.
Data Representation, OTel, and Efficiency
- Strong criticism of JSON-based and naive “wide log” representations; OpenTelemetry is described as flexible but not designed with efficiency first.
- Examples given of extreme compression via binary diffs, RLE, and columnar storage; modern metrics/log databases (ClickHouse, VictoriaMetrics, Prometheus-like systems) rely on these tricks to reach sub-byte-per-sample compression.
- The ClickHouse change is summarized as eliminating JSON (de)serialization and doing (almost) zero-copy raw-byte ingestion, drastically reducing CPU usage.
- At petabyte scale, each extra serialization/network hop (e.g., OTel collectors) can cost real money; eliminating a hop can justify dedicated custom ingestion code.
ClickHouse vs Postgres and Operational Pain Points
- ClickHouse is praised for analytics on append-only/immutable data (logs, metrics, events, embeddings, archives) with massive speedups over Postgres.
- It’s seen as painful or “full of footguns” for mutable/OLTP workloads; guidance is to keep Postgres for OLTP and use ClickHouse for OLAP.
- Operational complexity of ZooKeeper/ClickHouse Keeper is heavily criticized, especially around cluster restarts and quorum handling.
Logs vs Metrics/Traces and Observability “Maximalism”
- Some see “log everything forever” as observability maximalism—a costly “digital landfill” and security liability, especially with EU personal data.
- Others insist it’s safer to ingest everything and then filter, using:
- severity-based routing (errors to hot store, debug to cheap archive),
- tiered storage (NVMe → HDD → tape/S3),
- ability to re-hydrate archived logs on demand.
- Proposed idea: “attention-weighted retention” – auto-prune log patterns that never appear in queries or alerts; some report large cost savings with query/alert-driven TTLs.
Wide Events Tradeoffs
- Concern: wide events that capture all context in a single record will inflate storage vs classic metrics + traces + sampled logs.
- Counterpoint: when done correctly (one wide event per external request with all relevant fields), they can reduce storage compared to chaotic, multi-line logging, and compress well in ClickHouse.
- Open question (unclear in thread): how to model sub-operations like outbound HTTP calls that would normally appear as separate spans inside a single wide event.
Why ClickHouse over JSON Files / Elasticsearch
- For small-scale historical logs, files may suffice; at 100PB scale they become impractical.
- Columnar, log-optimized databases:
- compress far better than raw JSON (even compressed),
- skip reading irrelevant data, yielding orders-of-magnitude faster queries than grep,
- scale horizontally to query tens/hundreds of petabytes.
- Elasticsearch is acknowledged as strong for full-text search, but feasibility at 100PB (especially RAM for indexing) is questioned.
Crash-Time Collection and OTel
- The article’s claim that OTel is “passive” and captures stdout/stderr even when services are down is challenged as incomplete; many use OTel in fully active modes (e.g., Kubernetes filelog receivers tailing pod logs irrespective of ClickHouse health).
Kubernetes Log Aggregation
- Frustration with Kubernetes’ lack of “show me everything from this deployment now” by default.
- Multiple tools/approaches are suggested (stern, kubetail, k9s, simple scripts) to aggregate pod logs per deployment.
Retention and Compression Numbers
- For ClickHouse’s own platform: 100PB is quoted as raw, uncompressed volume over 180 days.
- With their compression and schema optimization, they report around 15× compression, storing ~6.5PB at rest.
Open and Unresolved Topics
- Debate over whether industry observability standards (OTel, GraphQL, OpenAPI, etc.) are inherently “half-baked” or just evolving via trial and error.
- A question is raised about better tooling and techniques for correlating stateful, multi-party workflows (e.g., SFU video calls with complex signaling paths); no concrete “state of the art” answer is provided in the thread.