A sharded DuckDB on 63 nodes runs 1T row aggregation challenge in 5 sec
Sharded / Distributed Query Engines
- Question about open-source sharded planners over DuckDB/SQLite led to mentions of Apache DataFusion Ballista and DeepSeek’s “smallpond” as comparable approaches.
- GizmoEdge itself is not open source; the author intends to mature it into a product. Smallpond is cited as an OSS alternative for similar distributed DuckDB-style workloads.
- Other systems suggested as “already built for this”: Trino, ClickHouse, Spark, BigQuery, and Redshift; some see GizmoEdge as re-implementing a familiar MapReduce-style pattern (worker SQL + combinatorial SQL).
Hardware Scale, Cost, and Practicality
- The cluster used 63 Azure E64pds v6 nodes (64 vCPUs, ~500 GiB RAM each), totaling ~4,000 vCPUs and ~30 TiB RAM.
- Multiple commenters argue this is “overpowered” and question whether it’s cheaper than Snowflake/BigQuery.
- Rough cost math in the thread: about $236/hour on-demand (~$0.33 for a 5-second query) vs a single Snowflake 4XL at ~$384/hour, but critics note this ignores cluster setup, engineering, and always-on costs.
- A single-node DuckDB setup by the same author reportedly did the challenge in ~2 minutes for about $0.10, raising questions about where the scale-out point really pays off.
Challenge Methodology & Fairness
- Key caveat: the 5-second time excludes loading/materializing data. Workers spend 1–2 minutes downloading Parquet from cloud storage and converting to DuckDB files on local NVMe first.
- Some argue this violates the spirit of the “One Trillion Row Challenge,” which they interpret as timing from raw files to result; pre-materializing and then measuring only query latency is called “cheating” or at least misleading.
- Others request explicit cold-vs-hot-cache benchmarks and clearer disclosure; filesystem caching and lack of cache dropping may affect comparability.
Architecture & Implementation Choices
- Each node ran ~16 worker pods (3.8 vCPU, 30 GiB RAM) due to Kubernetes overhead and cloud quota; the author admits shard sizing is heuristic, not fully optimized.
- Workers execute DuckDB queries locally and stream Arrow IPC results back to a central server via WebSockets. The server merges partial results.
- A long subthread debates WebSockets vs raw TCP/UDP:
- Pro-WebSocket arguments: easy framing, TLS termination, existing libraries, multiplexing via HTTP routing.
- Skeptical views: extra protocol complexity, HTTP parser, and SHA-1 for minimal benefit in a non-browser context; alternatives like raw sockets, ZeroMQ, or Arrow Flight are mentioned.
- Filesystem choice (ext4 vs XFS) and Linux page cache behavior are raised as potentially material to performance; reproducibility concerns are noted.
OLAP vs OLTP and Other Databases
- Several comments contrast DuckDB (columnar, OLAP) with OLTP systems like MSSQL, explaining why analytical aggregations can be orders of magnitude faster on OLAP engines.
- DuckDB’s “OLAP-ness” is briefly questioned due to writer blocking readers, but others clarify “online” refers to interactive analytics, not realtime streaming.
- ClickHouse is cited as a market leader in real-time analytics, though some note it still favors throughput over ultra-low-latency ingestion.
- DuckLake is described as solving upserts over data lakes; some confusion remains about what it adds beyond reading Parquet directly.
Use Cases, Robustness, and Skepticism
- One commenter worries that DuckDB’s strength is single-node, one-off analytics and that bolting it into a persistent Kubernetes cluster sidesteps hard problems (fault tolerance, re-planning on failure, multi-query resource management, distributed joins).
- Others see the experiment as a “fun demo” and a proof-of-possibility for edge/observability scenarios, but not yet production-grade.
- A notable criticism is that sustaining this performance implies keeping 30 TiB of RAM and 4,000 vCPUs warm, which many organizations would balk at paying for continuously.
Miscellaneous Technical Points
- COUNT DISTINCT at scale is discussed: approximate HLL-based sketches vs exact bitmap-based methods, with mention of a DuckDB extension.
- Some joking asides: Tableau generating huge queries, quantum-computing hype, and sortbenchmark.org’s insistence on including I/O in benchmarks.