Binance built a 100PB log service with Quickwit
Scale and Use Case for 100 PB of Logs
- Logs are primarily application/observability logs (what a smaller org might send to Datadog), not blockchain or transaction ledgers.
- Binance reportedly logs huge volumes from APIs, microservices, HFT bots, and user activity; some estimate trillions of events and millions of orders per second at peak.
- Use cases include debugging, operational observability, customer support (“what happened to this user 2 months ago?”), and regulatory/audit investigations that may arrive months late.
- Several commenters are skeptical that sub‑second search and months of full‑fidelity logs are really necessary; others argue finance and security/liability make deep retention and fast search justifiable.
Quickwit vs. Elasticsearch and Other Tech Choices
- Elasticsearch was seen as too expensive and complex at this scale; Quickwit stores indices directly on object storage (e.g., S3) and queries them in place, with optional RAM caching.
- Quickwit uses Zstd compression, Lucene-like indexing, and object-storage-friendly dictionaries; building inverted indices is described as strongly CPU‑bound.
- Reported ingest: ~1.6 PB/day into ~20 PB compressed (≈5:1); some consider this compression weak for logs and point to more aggressive log-specific schemes.
- Regex search is limited due to the chosen dictionary structure; prefix and tokenized search are supported.
Cost, Storage, and Infrastructure Debates
- Rough cost estimates: ~US$460k/month for 20 PB S3, ~US$100k/month for compute; discounts, spot, and alternative storage classes can reduce this.
- Others argue self‑hosted HDD arrays with erasure coding and colo could be significantly cheaper over 5 years, but with higher operational complexity and IOPS constraints for “hot” data.
- There is extensive discussion of write amplification from verbose JSON logs; many argue changing logging formats and sampling could save more than sophisticated indexing.
Logs vs. Metrics vs. Traces
- Large sub‑thread argues logs should not be the primary source of operational truth: metrics are cheaper, better for SLAs/alerts, and easier to keep organized.
- Opposing view: for application debugging, forensics, and financial traceability, detailed logs are irreplaceable; metrics and even traces can be derived from them.
- Common suggested compromise: structured logs, strict retention windows, aggressive sampling/tail-sampling (especially for “OK” traces), and separate high‑assurance audit trails for money flows.