650GB of Data (Delta Lake on S3). Polars vs. DuckDB vs. Daft vs. Spark
Scope of Single-Node vs. Distributed Tools
- Many commenters argue that modern single-node engines (DuckDB, Polars, ClickHouse, etc.) can comfortably handle hundreds of GB to ~1 TB on a typical box; you often don’t need Spark until you’re in the multi‑TB or multi‑user regime.
- Spark is seen as overused “by default,” especially when the dataset is small enough that a well-written single-machine job (or even CLI tools) would suffice.
- At the same time, several point out that once you have lots of concurrent jobs, SLAs, or multi-stage pipelines, distributed systems still make sense even for moderately sized datasets.
IO, Network, and S3 vs Local Storage
- Many think the benchmark is fundamentally NIC/S3‑bound, not CPU‑bound: a 10 Gbps EC2 instance makes ~9 minutes a hard lower bound just to read 650 GB from S3.
- Column pruning means the query likely read far less than the nominal 650 GB, further complicating interpretation.
- Local NVMe is repeatedly described as vastly faster and cheaper than S3 for this kind of workload; a decent desktop could likely outperform the chosen cloud setup.
- Several stress understanding theoretical resource limits (network, disk, RAM) before attributing performance to the engine.
Data Formats, Catalogs, and Engine quirks
- Polars’ Delta Lake support depends on delta-rs, which currently lacks deletion vector support.
- DuckDB’s new “DuckLake” catalog sparks debate:
- Pro: RDBMS-backed metadata gives simple, strong ACID semantics and good performance.
- Con: Needing a SQL catalog undermines the “just files” simplicity that attracted people to Parquet; file-based catalogs (e.g., Iceberg) are cited as alternatives with concurrency trade-offs.
- Some mention edge‑case limitations of DuckDB when spilling to disk and that DuckLake’s data inlining / flush-to-Parquet features are still maturing.
How “Big” is 650 GB?
- Opinions diverge: some call 650 GB trivial (“fits in RAM/a phone”), others work with PB‑scale S3 footprints.
- Others counter that most real-world “big data” deployments are far smaller than PB and that 650 GB is a very relevant scale for typical companies.
- Critiques note the benchmark uses a simple aggregation over one column that fits in memory; results may not generalize to complex joins or truly larger‑than‑memory workloads.
Organizational, Cost, and Platform Considerations
- Distributed platforms (Spark/Databricks, Snowflake, Trino, etc.) are defended for: managed operations, governance, multi-team access, and integrations—not just raw speed.
- Several stories describe Databricks or Snowflake chosen for “big vendor” comfort, sometimes followed by sticker shock and re‑architecture.
- Some attribute cluster adoption partly to resume-padding and “big impressive systems,” while others emphasize real benefits of managed, ephemeral query clusters and data catalogs.