ClickHouse gets lazier and faster: Introducing lazy materialization
Performance & Algorithms for Top‑K Queries
- Many were struck by the demo: finding top‑3 from 150M values in ~70ms with ~3.6 MiB peak memory.
- Discussion dug into why this is plausible:
- Columnar layout only reads the single needed column.
- Back‑of‑envelope bandwidth math (SSD, RAM, cache) shows 600MB can be scanned in far under 70ms with modern hardware.
- Debate over how ClickHouse likely implements top‑k:
- Several participants describe streaming top‑k in O(n) with O(k) memory (e.g., selection algorithms, Floyd–Rivest, heap‑based or buffered approaches).
- A ClickHouse contributor confirms ClickHouse uses a selection algorithm (Floyd–Rivest) plus block‑wise processing (e.g., 64k blocks, “top of tops”).
- Side discussion clarifies differences between:
- Top‑k by value (max/min) vs. top‑k most frequent (which typically needs more space or approximate streaming algorithms).
Lazy / Late Materialization Behavior
- Users note this is a known “late materialization” class of optimizations in column stores, but still non‑trivial to implement correctly and efficiently.
- Confirmed via EXPLAIN that lazy materialization applies even with complex ORDER BY expressions (e.g., weighted random sampling), and shows ~5× speedups in some tests.
- A setting limits the default to small LIMIT values (e.g., 10) to avoid excessive random I/O when many rows are requested; larger limits may hurt in more typical workloads.
- It also works with
LIMIT … OFFSET.
Use Cases, Comparisons & Ecosystem
- Several teams report migrating analytics from Postgres to ClickHouse and seeing dramatic speedups with far less manual tuning.
- Discussion contrasts:
- Columnar OLAP systems like ClickHouse vs. row‑based OLTP systems; participants stress they target different workloads.
- ClickHouse vs. DuckDB/Polars:
- ClickHouse: network server, real‑time analytics, strong performance, more operational complexity.
- DuckDB/Polars: embedded, convenient on desktops; DuckDB’s concurrency and on‑disk format have trade‑offs.
- Some mention embedded ClickHouse (chdb) as a DuckDB‑like option, but note missing static builds and ergonomics in some languages.
Developer Experience, Deployment & Governance
- ClickHouse is widely praised for speed, CLI, SQL‑ish familiarity, and documentation, but some see it as “heavy‑duty” and low‑level to operate.
- Windows‑native support is seen as a barrier for desktop analytics users.
- There is discussion about ClickHouse’s partially proprietary cloud features and whether that dilutes its open‑source story, alongside similar concerns about other vendors in the space.