Binance built a 100PB log service with Quickwit

Scale and Use Case for 100 PB of Logs

  • Logs are primarily application/observability logs (what a smaller org might send to Datadog), not blockchain or transaction ledgers.
  • Binance reportedly logs huge volumes from APIs, microservices, HFT bots, and user activity; some estimate trillions of events and millions of orders per second at peak.
  • Use cases include debugging, operational observability, customer support (“what happened to this user 2 months ago?”), and regulatory/audit investigations that may arrive months late.
  • Several commenters are skeptical that sub‑second search and months of full‑fidelity logs are really necessary; others argue finance and security/liability make deep retention and fast search justifiable.

Quickwit vs. Elasticsearch and Other Tech Choices

  • Elasticsearch was seen as too expensive and complex at this scale; Quickwit stores indices directly on object storage (e.g., S3) and queries them in place, with optional RAM caching.
  • Quickwit uses Zstd compression, Lucene-like indexing, and object-storage-friendly dictionaries; building inverted indices is described as strongly CPU‑bound.
  • Reported ingest: ~1.6 PB/day into ~20 PB compressed (≈5:1); some consider this compression weak for logs and point to more aggressive log-specific schemes.
  • Regex search is limited due to the chosen dictionary structure; prefix and tokenized search are supported.

Cost, Storage, and Infrastructure Debates

  • Rough cost estimates: ~US$460k/month for 20 PB S3, ~US$100k/month for compute; discounts, spot, and alternative storage classes can reduce this.
  • Others argue self‑hosted HDD arrays with erasure coding and colo could be significantly cheaper over 5 years, but with higher operational complexity and IOPS constraints for “hot” data.
  • There is extensive discussion of write amplification from verbose JSON logs; many argue changing logging formats and sampling could save more than sophisticated indexing.

Logs vs. Metrics vs. Traces

  • Large sub‑thread argues logs should not be the primary source of operational truth: metrics are cheaper, better for SLAs/alerts, and easier to keep organized.
  • Opposing view: for application debugging, forensics, and financial traceability, detailed logs are irreplaceable; metrics and even traces can be derived from them.
  • Common suggested compromise: structured logs, strict retention windows, aggressive sampling/tail-sampling (especially for “OK” traces), and separate high‑assurance audit trails for money flows.