2024-12-14

Should you ditch Spark for DuckDB or Polars?

Spark vs. Single-Node Scale

Several commenters argue most “big data” workloads don’t justify Spark. 100GB–a few TB is often manageable on a single large machine with fast NVMe.
Others note real large-scale cases: 1TB/day FX or clickstream data, 40TB+ OLTP, and 100+ concurrent analysts on shared datasets. In such contexts, distributed systems like Spark are still warranted.
Single-node limits are not just disk size but also CPU, RAM, and bandwidth, plus redundancy concerns. Some treat compute as ephemeral and rebuild from object storage if a node dies.

DuckDB: Strengths, Limits, and Usage Patterns

Many report that DuckDB “just works,” especially due to out-of-core execution and good performance on TB-scale workloads.
Some have fully migrated warehouses to DuckDB’s native format for cost and speed, keeping Parquet only for interoperability. Others still prefer Parquet/Delta as a stable interchange format.
Concerns: lack of horizontal scaling, weaker catalog features vs. cloud warehouses, less mature optimizer (more manual query tuning), and uncertain best practices for very large, constantly-updated native files.
Integration with Spark and Arrow is seen as a big plus; some advocate combining DuckDB for most workloads with Spark/Databricks only when scale demands it.

Polars: Capabilities and Pain Points

Praised for complex transformations and a strong plugin story (e.g., Rust-based geospatial extensions with major speedups).
Criticized for frequent OOM on large workloads compared to DuckDB and for API instability in earlier phases.
SQL support exists but is not as central as in Spark/DuckDB; typically used in lazy DataFrame style.

Data Formats, Catalogs, and Alternatives

Debate over Delta vs. open formats: Delta is open but tightly tied to Spark/Databricks, with features (e.g., deletion vectors) that can break OSS compatibility. Iceberg is discussed as a unifying, multi-engine layer.
Multiple alternatives surface: Ray (general distributed compute), Daft (on Ray), ClickHouse (fast but catalog and OOM concerns), Lake Sail, DataFusion, LanceDB, and translation layers like Fugue and Ibis.

Orchestration, DX, and LLMs

Spark/Databricks win on built-in streaming, autoloading, checkpointing, and workflow management; with DuckDB/Polars you typically add Airflow/Kestra/dbt-like tools.
Some emphasize that migration/ops complexity can outweigh performance gains.
Commenters note LLM assistants are currently better with Pandas/Spark than with newer APIs like Polars or Ibis, which may slow adoption.

Related topics