Scaling our observability platform by embracing wide events and replacing OTel

Data Volume, Retention, and “Waste”

  • Some argue that collecting 100PB of observability data is fundamentally wasteful; most systems “don’t need” more than 60–90 days of logs, and GDPR encourages short retention for anything possibly containing personal data.
  • Others counter that logs and traces are essential for compliance, forensics (e.g., discovering past exploitation of a newly found vuln), long-term trends, and rare, slow-burning incidents.
  • View that storage is now cheap (especially tiered/S3) and that discarding observability data to save space is often shortsighted, especially for high-cardinality, unsampled traces.

Log Quality vs Quantity

  • Several comments note logging is often undisciplined: verbose “connection successful” spam, poor log levels, and no thought about future use.
  • Suggested alternative: treat important “logs” as structured business events or domain data, with explicit modeling and refinement instead of firehosing arbitrary text.
  • Disagreement on how much success noise is useful: some see it as bisecting execution; others see it as drowning failures.

Data Representation, OTel, and Efficiency

  • Strong criticism of JSON-based and naive “wide log” representations; OpenTelemetry is described as flexible but not designed with efficiency first.
  • Examples given of extreme compression via binary diffs, RLE, and columnar storage; modern metrics/log databases (ClickHouse, VictoriaMetrics, Prometheus-like systems) rely on these tricks to reach sub-byte-per-sample compression.
  • The ClickHouse change is summarized as eliminating JSON (de)serialization and doing (almost) zero-copy raw-byte ingestion, drastically reducing CPU usage.
  • At petabyte scale, each extra serialization/network hop (e.g., OTel collectors) can cost real money; eliminating a hop can justify dedicated custom ingestion code.

ClickHouse vs Postgres and Operational Pain Points

  • ClickHouse is praised for analytics on append-only/immutable data (logs, metrics, events, embeddings, archives) with massive speedups over Postgres.
  • It’s seen as painful or “full of footguns” for mutable/OLTP workloads; guidance is to keep Postgres for OLTP and use ClickHouse for OLAP.
  • Operational complexity of ZooKeeper/ClickHouse Keeper is heavily criticized, especially around cluster restarts and quorum handling.

Logs vs Metrics/Traces and Observability “Maximalism”

  • Some see “log everything forever” as observability maximalism—a costly “digital landfill” and security liability, especially with EU personal data.
  • Others insist it’s safer to ingest everything and then filter, using:
    • severity-based routing (errors to hot store, debug to cheap archive),
    • tiered storage (NVMe → HDD → tape/S3),
    • ability to re-hydrate archived logs on demand.
  • Proposed idea: “attention-weighted retention” – auto-prune log patterns that never appear in queries or alerts; some report large cost savings with query/alert-driven TTLs.

Wide Events Tradeoffs

  • Concern: wide events that capture all context in a single record will inflate storage vs classic metrics + traces + sampled logs.
  • Counterpoint: when done correctly (one wide event per external request with all relevant fields), they can reduce storage compared to chaotic, multi-line logging, and compress well in ClickHouse.
  • Open question (unclear in thread): how to model sub-operations like outbound HTTP calls that would normally appear as separate spans inside a single wide event.

Why ClickHouse over JSON Files / Elasticsearch

  • For small-scale historical logs, files may suffice; at 100PB scale they become impractical.
  • Columnar, log-optimized databases:
    • compress far better than raw JSON (even compressed),
    • skip reading irrelevant data, yielding orders-of-magnitude faster queries than grep,
    • scale horizontally to query tens/hundreds of petabytes.
  • Elasticsearch is acknowledged as strong for full-text search, but feasibility at 100PB (especially RAM for indexing) is questioned.

Crash-Time Collection and OTel

  • The article’s claim that OTel is “passive” and captures stdout/stderr even when services are down is challenged as incomplete; many use OTel in fully active modes (e.g., Kubernetes filelog receivers tailing pod logs irrespective of ClickHouse health).

Kubernetes Log Aggregation

  • Frustration with Kubernetes’ lack of “show me everything from this deployment now” by default.
  • Multiple tools/approaches are suggested (stern, kubetail, k9s, simple scripts) to aggregate pod logs per deployment.

Retention and Compression Numbers

  • For ClickHouse’s own platform: 100PB is quoted as raw, uncompressed volume over 180 days.
  • With their compression and schema optimization, they report around 15× compression, storing ~6.5PB at rest.

Open and Unresolved Topics

  • Debate over whether industry observability standards (OTel, GraphQL, OpenAPI, etc.) are inherently “half-baked” or just evolving via trial and error.
  • A question is raised about better tooling and techniques for correlating stateful, multi-party workflows (e.g., SFU video calls with complex signaling paths); no concrete “state of the art” answer is provided in the thread.