Should you ditch Spark for DuckDB or Polars?
Spark vs. Single-Node Scale
- Several commenters argue most “big data” workloads don’t justify Spark. 100GB–a few TB is often manageable on a single large machine with fast NVMe.
- Others note real large-scale cases: 1TB/day FX or clickstream data, 40TB+ OLTP, and 100+ concurrent analysts on shared datasets. In such contexts, distributed systems like Spark are still warranted.
- Single-node limits are not just disk size but also CPU, RAM, and bandwidth, plus redundancy concerns. Some treat compute as ephemeral and rebuild from object storage if a node dies.
DuckDB: Strengths, Limits, and Usage Patterns
- Many report that DuckDB “just works,” especially due to out-of-core execution and good performance on TB-scale workloads.
- Some have fully migrated warehouses to DuckDB’s native format for cost and speed, keeping Parquet only for interoperability. Others still prefer Parquet/Delta as a stable interchange format.
- Concerns: lack of horizontal scaling, weaker catalog features vs. cloud warehouses, less mature optimizer (more manual query tuning), and uncertain best practices for very large, constantly-updated native files.
- Integration with Spark and Arrow is seen as a big plus; some advocate combining DuckDB for most workloads with Spark/Databricks only when scale demands it.
Polars: Capabilities and Pain Points
- Praised for complex transformations and a strong plugin story (e.g., Rust-based geospatial extensions with major speedups).
- Criticized for frequent OOM on large workloads compared to DuckDB and for API instability in earlier phases.
- SQL support exists but is not as central as in Spark/DuckDB; typically used in lazy DataFrame style.
Data Formats, Catalogs, and Alternatives
- Debate over Delta vs. open formats: Delta is open but tightly tied to Spark/Databricks, with features (e.g., deletion vectors) that can break OSS compatibility. Iceberg is discussed as a unifying, multi-engine layer.
- Multiple alternatives surface: Ray (general distributed compute), Daft (on Ray), ClickHouse (fast but catalog and OOM concerns), Lake Sail, DataFusion, LanceDB, and translation layers like Fugue and Ibis.
Orchestration, DX, and LLMs
- Spark/Databricks win on built-in streaming, autoloading, checkpointing, and workflow management; with DuckDB/Polars you typically add Airflow/Kestra/dbt-like tools.
- Some emphasize that migration/ops complexity can outweigh performance gains.
- Commenters note LLM assistants are currently better with Pandas/Spark than with newer APIs like Polars or Ibis, which may slow adoption.