A sharded DuckDB on 63 nodes runs 1T row aggregation challenge in 5 sec

Sharded / Distributed Query Engines

  • Question about open-source sharded planners over DuckDB/SQLite led to mentions of Apache DataFusion Ballista and DeepSeek’s “smallpond” as comparable approaches.
  • GizmoEdge itself is not open source; the author intends to mature it into a product. Smallpond is cited as an OSS alternative for similar distributed DuckDB-style workloads.
  • Other systems suggested as “already built for this”: Trino, ClickHouse, Spark, BigQuery, and Redshift; some see GizmoEdge as re-implementing a familiar MapReduce-style pattern (worker SQL + combinatorial SQL).

Hardware Scale, Cost, and Practicality

  • The cluster used 63 Azure E64pds v6 nodes (64 vCPUs, ~500 GiB RAM each), totaling ~4,000 vCPUs and ~30 TiB RAM.
  • Multiple commenters argue this is “overpowered” and question whether it’s cheaper than Snowflake/BigQuery.
  • Rough cost math in the thread: about $236/hour on-demand (~$0.33 for a 5-second query) vs a single Snowflake 4XL at ~$384/hour, but critics note this ignores cluster setup, engineering, and always-on costs.
  • A single-node DuckDB setup by the same author reportedly did the challenge in ~2 minutes for about $0.10, raising questions about where the scale-out point really pays off.

Challenge Methodology & Fairness

  • Key caveat: the 5-second time excludes loading/materializing data. Workers spend 1–2 minutes downloading Parquet from cloud storage and converting to DuckDB files on local NVMe first.
  • Some argue this violates the spirit of the “One Trillion Row Challenge,” which they interpret as timing from raw files to result; pre-materializing and then measuring only query latency is called “cheating” or at least misleading.
  • Others request explicit cold-vs-hot-cache benchmarks and clearer disclosure; filesystem caching and lack of cache dropping may affect comparability.

Architecture & Implementation Choices

  • Each node ran ~16 worker pods (3.8 vCPU, 30 GiB RAM) due to Kubernetes overhead and cloud quota; the author admits shard sizing is heuristic, not fully optimized.
  • Workers execute DuckDB queries locally and stream Arrow IPC results back to a central server via WebSockets. The server merges partial results.
  • A long subthread debates WebSockets vs raw TCP/UDP:
    • Pro-WebSocket arguments: easy framing, TLS termination, existing libraries, multiplexing via HTTP routing.
    • Skeptical views: extra protocol complexity, HTTP parser, and SHA-1 for minimal benefit in a non-browser context; alternatives like raw sockets, ZeroMQ, or Arrow Flight are mentioned.
  • Filesystem choice (ext4 vs XFS) and Linux page cache behavior are raised as potentially material to performance; reproducibility concerns are noted.

OLAP vs OLTP and Other Databases

  • Several comments contrast DuckDB (columnar, OLAP) with OLTP systems like MSSQL, explaining why analytical aggregations can be orders of magnitude faster on OLAP engines.
  • DuckDB’s “OLAP-ness” is briefly questioned due to writer blocking readers, but others clarify “online” refers to interactive analytics, not realtime streaming.
  • ClickHouse is cited as a market leader in real-time analytics, though some note it still favors throughput over ultra-low-latency ingestion.
  • DuckLake is described as solving upserts over data lakes; some confusion remains about what it adds beyond reading Parquet directly.

Use Cases, Robustness, and Skepticism

  • One commenter worries that DuckDB’s strength is single-node, one-off analytics and that bolting it into a persistent Kubernetes cluster sidesteps hard problems (fault tolerance, re-planning on failure, multi-query resource management, distributed joins).
  • Others see the experiment as a “fun demo” and a proof-of-possibility for edge/observability scenarios, but not yet production-grade.
  • A notable criticism is that sustaining this performance implies keeping 30 TiB of RAM and 4,000 vCPUs warm, which many organizations would balk at paying for continuously.

Miscellaneous Technical Points

  • COUNT DISTINCT at scale is discussed: approximate HLL-based sketches vs exact bitmap-based methods, with mention of a DuckDB extension.
  • Some joking asides: Tableau generating huge queries, quantum-computing hype, and sortbenchmark.org’s insistence on including I/O in benchmarks.