2025-10-24

A sharded DuckDB on 63 nodes runs 1T row aggregation challenge in 5 sec

Sharded / Distributed Query Engines

Question about open-source sharded planners over DuckDB/SQLite led to mentions of Apache DataFusion Ballista and DeepSeek’s “smallpond” as comparable approaches.
GizmoEdge itself is not open source; the author intends to mature it into a product. Smallpond is cited as an OSS alternative for similar distributed DuckDB-style workloads.
Other systems suggested as “already built for this”: Trino, ClickHouse, Spark, BigQuery, and Redshift; some see GizmoEdge as re-implementing a familiar MapReduce-style pattern (worker SQL + combinatorial SQL).

Hardware Scale, Cost, and Practicality

The cluster used 63 Azure E64pds v6 nodes (64 vCPUs, ~500 GiB RAM each), totaling ~4,000 vCPUs and ~30 TiB RAM.
Multiple commenters argue this is “overpowered” and question whether it’s cheaper than Snowflake/BigQuery.
Rough cost math in the thread: about $236/hour on-demand (~$0.33 for a 5-second query) vs a single Snowflake 4XL at ~$384/hour, but critics note this ignores cluster setup, engineering, and always-on costs.
A single-node DuckDB setup by the same author reportedly did the challenge in ~2 minutes for about $0.10, raising questions about where the scale-out point really pays off.

Challenge Methodology & Fairness

Key caveat: the 5-second time excludes loading/materializing data. Workers spend 1–2 minutes downloading Parquet from cloud storage and converting to DuckDB files on local NVMe first.
Some argue this violates the spirit of the “One Trillion Row Challenge,” which they interpret as timing from raw files to result; pre-materializing and then measuring only query latency is called “cheating” or at least misleading.
Others request explicit cold-vs-hot-cache benchmarks and clearer disclosure; filesystem caching and lack of cache dropping may affect comparability.

Architecture & Implementation Choices

Each node ran ~16 worker pods (3.8 vCPU, 30 GiB RAM) due to Kubernetes overhead and cloud quota; the author admits shard sizing is heuristic, not fully optimized.
Workers execute DuckDB queries locally and stream Arrow IPC results back to a central server via WebSockets. The server merges partial results.
A long subthread debates WebSockets vs raw TCP/UDP:
- Pro-WebSocket arguments: easy framing, TLS termination, existing libraries, multiplexing via HTTP routing.
- Skeptical views: extra protocol complexity, HTTP parser, and SHA-1 for minimal benefit in a non-browser context; alternatives like raw sockets, ZeroMQ, or Arrow Flight are mentioned.
Filesystem choice (ext4 vs XFS) and Linux page cache behavior are raised as potentially material to performance; reproducibility concerns are noted.

OLAP vs OLTP and Other Databases

Several comments contrast DuckDB (columnar, OLAP) with OLTP systems like MSSQL, explaining why analytical aggregations can be orders of magnitude faster on OLAP engines.
DuckDB’s “OLAP-ness” is briefly questioned due to writer blocking readers, but others clarify “online” refers to interactive analytics, not realtime streaming.
ClickHouse is cited as a market leader in real-time analytics, though some note it still favors throughput over ultra-low-latency ingestion.
DuckLake is described as solving upserts over data lakes; some confusion remains about what it adds beyond reading Parquet directly.

Use Cases, Robustness, and Skepticism

One commenter worries that DuckDB’s strength is single-node, one-off analytics and that bolting it into a persistent Kubernetes cluster sidesteps hard problems (fault tolerance, re-planning on failure, multi-query resource management, distributed joins).
Others see the experiment as a “fun demo” and a proof-of-possibility for edge/observability scenarios, but not yet production-grade.
A notable criticism is that sustaining this performance implies keeping 30 TiB of RAM and 4,000 vCPUs warm, which many organizations would balk at paying for continuously.

Miscellaneous Technical Points

COUNT DISTINCT at scale is discussed: approximate HLL-based sketches vs exact bitmap-based methods, with mention of a DuckDB extension.
Some joking asides: Tableau generating huge queries, quantum-computing hype, and sortbenchmark.org’s insistence on including I/O in benchmarks.

Related topics