650GB of Data (Delta Lake on S3). Polars vs. DuckDB vs. Daft vs. Spark

Scope of Single-Node vs. Distributed Tools

  • Many commenters argue that modern single-node engines (DuckDB, Polars, ClickHouse, etc.) can comfortably handle hundreds of GB to ~1 TB on a typical box; you often don’t need Spark until you’re in the multi‑TB or multi‑user regime.
  • Spark is seen as overused “by default,” especially when the dataset is small enough that a well-written single-machine job (or even CLI tools) would suffice.
  • At the same time, several point out that once you have lots of concurrent jobs, SLAs, or multi-stage pipelines, distributed systems still make sense even for moderately sized datasets.

IO, Network, and S3 vs Local Storage

  • Many think the benchmark is fundamentally NIC/S3‑bound, not CPU‑bound: a 10 Gbps EC2 instance makes ~9 minutes a hard lower bound just to read 650 GB from S3.
  • Column pruning means the query likely read far less than the nominal 650 GB, further complicating interpretation.
  • Local NVMe is repeatedly described as vastly faster and cheaper than S3 for this kind of workload; a decent desktop could likely outperform the chosen cloud setup.
  • Several stress understanding theoretical resource limits (network, disk, RAM) before attributing performance to the engine.

Data Formats, Catalogs, and Engine quirks

  • Polars’ Delta Lake support depends on delta-rs, which currently lacks deletion vector support.
  • DuckDB’s new “DuckLake” catalog sparks debate:
    • Pro: RDBMS-backed metadata gives simple, strong ACID semantics and good performance.
    • Con: Needing a SQL catalog undermines the “just files” simplicity that attracted people to Parquet; file-based catalogs (e.g., Iceberg) are cited as alternatives with concurrency trade-offs.
  • Some mention edge‑case limitations of DuckDB when spilling to disk and that DuckLake’s data inlining / flush-to-Parquet features are still maturing.

How “Big” is 650 GB?

  • Opinions diverge: some call 650 GB trivial (“fits in RAM/a phone”), others work with PB‑scale S3 footprints.
  • Others counter that most real-world “big data” deployments are far smaller than PB and that 650 GB is a very relevant scale for typical companies.
  • Critiques note the benchmark uses a simple aggregation over one column that fits in memory; results may not generalize to complex joins or truly larger‑than‑memory workloads.

Organizational, Cost, and Platform Considerations

  • Distributed platforms (Spark/Databricks, Snowflake, Trino, etc.) are defended for: managed operations, governance, multi-team access, and integrations—not just raw speed.
  • Several stories describe Databricks or Snowflake chosen for “big vendor” comfort, sometimes followed by sticker shock and re‑architecture.
  • Some attribute cluster adoption partly to resume-padding and “big impressive systems,” while others emphasize real benefits of managed, ephemeral query clusters and data catalogs.