2025-11-13

650GB of Data (Delta Lake on S3). Polars vs. DuckDB vs. Daft vs. Spark

Scope of Single-Node vs. Distributed Tools

Many commenters argue that modern single-node engines (DuckDB, Polars, ClickHouse, etc.) can comfortably handle hundreds of GB to ~1 TB on a typical box; you often don’t need Spark until you’re in the multi‑TB or multi‑user regime.
Spark is seen as overused “by default,” especially when the dataset is small enough that a well-written single-machine job (or even CLI tools) would suffice.
At the same time, several point out that once you have lots of concurrent jobs, SLAs, or multi-stage pipelines, distributed systems still make sense even for moderately sized datasets.

IO, Network, and S3 vs Local Storage

Many think the benchmark is fundamentally NIC/S3‑bound, not CPU‑bound: a 10 Gbps EC2 instance makes ~9 minutes a hard lower bound just to read 650 GB from S3.
Column pruning means the query likely read far less than the nominal 650 GB, further complicating interpretation.
Local NVMe is repeatedly described as vastly faster and cheaper than S3 for this kind of workload; a decent desktop could likely outperform the chosen cloud setup.
Several stress understanding theoretical resource limits (network, disk, RAM) before attributing performance to the engine.

Data Formats, Catalogs, and Engine quirks

Polars’ Delta Lake support depends on delta-rs, which currently lacks deletion vector support.
DuckDB’s new “DuckLake” catalog sparks debate:
- Pro: RDBMS-backed metadata gives simple, strong ACID semantics and good performance.
- Con: Needing a SQL catalog undermines the “just files” simplicity that attracted people to Parquet; file-based catalogs (e.g., Iceberg) are cited as alternatives with concurrency trade-offs.
Some mention edge‑case limitations of DuckDB when spilling to disk and that DuckLake’s data inlining / flush-to-Parquet features are still maturing.

How “Big” is 650 GB?

Opinions diverge: some call 650 GB trivial (“fits in RAM/a phone”), others work with PB‑scale S3 footprints.
Others counter that most real-world “big data” deployments are far smaller than PB and that 650 GB is a very relevant scale for typical companies.
Critiques note the benchmark uses a simple aggregation over one column that fits in memory; results may not generalize to complex joins or truly larger‑than‑memory workloads.

Organizational, Cost, and Platform Considerations

Distributed platforms (Spark/Databricks, Snowflake, Trino, etc.) are defended for: managed operations, governance, multi-team access, and integrations—not just raw speed.
Several stories describe Databricks or Snowflake chosen for “big vendor” comfort, sometimes followed by sticker shock and re‑architecture.
Some attribute cluster adoption partly to resume-padding and “big impressive systems,” while others emphasize real benefits of managed, ephemeral query clusters and data catalogs.

Related topics