2025-06-21

Scaling our observability platform by embracing wide events and replacing OTel

Data Volume, Retention, and “Waste”

Some argue that collecting 100PB of observability data is fundamentally wasteful; most systems “don’t need” more than 60–90 days of logs, and GDPR encourages short retention for anything possibly containing personal data.
Others counter that logs and traces are essential for compliance, forensics (e.g., discovering past exploitation of a newly found vuln), long-term trends, and rare, slow-burning incidents.
View that storage is now cheap (especially tiered/S3) and that discarding observability data to save space is often shortsighted, especially for high-cardinality, unsampled traces.

Log Quality vs Quantity

Several comments note logging is often undisciplined: verbose “connection successful” spam, poor log levels, and no thought about future use.
Suggested alternative: treat important “logs” as structured business events or domain data, with explicit modeling and refinement instead of firehosing arbitrary text.
Disagreement on how much success noise is useful: some see it as bisecting execution; others see it as drowning failures.

Data Representation, OTel, and Efficiency

Strong criticism of JSON-based and naive “wide log” representations; OpenTelemetry is described as flexible but not designed with efficiency first.
Examples given of extreme compression via binary diffs, RLE, and columnar storage; modern metrics/log databases (ClickHouse, VictoriaMetrics, Prometheus-like systems) rely on these tricks to reach sub-byte-per-sample compression.
The ClickHouse change is summarized as eliminating JSON (de)serialization and doing (almost) zero-copy raw-byte ingestion, drastically reducing CPU usage.
At petabyte scale, each extra serialization/network hop (e.g., OTel collectors) can cost real money; eliminating a hop can justify dedicated custom ingestion code.

ClickHouse vs Postgres and Operational Pain Points

ClickHouse is praised for analytics on append-only/immutable data (logs, metrics, events, embeddings, archives) with massive speedups over Postgres.
It’s seen as painful or “full of footguns” for mutable/OLTP workloads; guidance is to keep Postgres for OLTP and use ClickHouse for OLAP.
Operational complexity of ZooKeeper/ClickHouse Keeper is heavily criticized, especially around cluster restarts and quorum handling.

Logs vs Metrics/Traces and Observability “Maximalism”

Some see “log everything forever” as observability maximalism—a costly “digital landfill” and security liability, especially with EU personal data.
Others insist it’s safer to ingest everything and then filter, using:
- severity-based routing (errors to hot store, debug to cheap archive),
- tiered storage (NVMe → HDD → tape/S3),
- ability to re-hydrate archived logs on demand.
Proposed idea: “attention-weighted retention” – auto-prune log patterns that never appear in queries or alerts; some report large cost savings with query/alert-driven TTLs.

Wide Events Tradeoffs

Concern: wide events that capture all context in a single record will inflate storage vs classic metrics + traces + sampled logs.
Counterpoint: when done correctly (one wide event per external request with all relevant fields), they can reduce storage compared to chaotic, multi-line logging, and compress well in ClickHouse.
Open question (unclear in thread): how to model sub-operations like outbound HTTP calls that would normally appear as separate spans inside a single wide event.

Why ClickHouse over JSON Files / Elasticsearch

For small-scale historical logs, files may suffice; at 100PB scale they become impractical.
Columnar, log-optimized databases:
- compress far better than raw JSON (even compressed),
- skip reading irrelevant data, yielding orders-of-magnitude faster queries than grep,
- scale horizontally to query tens/hundreds of petabytes.
Elasticsearch is acknowledged as strong for full-text search, but feasibility at 100PB (especially RAM for indexing) is questioned.

Crash-Time Collection and OTel

The article’s claim that OTel is “passive” and captures stdout/stderr even when services are down is challenged as incomplete; many use OTel in fully active modes (e.g., Kubernetes filelog receivers tailing pod logs irrespective of ClickHouse health).

Kubernetes Log Aggregation

Frustration with Kubernetes’ lack of “show me everything from this deployment now” by default.
Multiple tools/approaches are suggested (stern, kubetail, k9s, simple scripts) to aggregate pod logs per deployment.

Retention and Compression Numbers

For ClickHouse’s own platform: 100PB is quoted as raw, uncompressed volume over 180 days.
With their compression and schema optimization, they report around 15× compression, storing ~6.5PB at rest.

Open and Unresolved Topics

Debate over whether industry observability standards (OTel, GraphQL, OpenAPI, etc.) are inherently “half-baked” or just evolving via trial and error.
A question is raised about better tooling and techniques for correlating stateful, multi-party workflows (e.g., SFU video calls with complex signaling paths); no concrete “state of the art” answer is provided in the thread.

Related topics