Should you ditch Spark for DuckDB or Polars?

Spark vs. Single-Node Scale

  • Several commenters argue most “big data” workloads don’t justify Spark. 100GB–a few TB is often manageable on a single large machine with fast NVMe.
  • Others note real large-scale cases: 1TB/day FX or clickstream data, 40TB+ OLTP, and 100+ concurrent analysts on shared datasets. In such contexts, distributed systems like Spark are still warranted.
  • Single-node limits are not just disk size but also CPU, RAM, and bandwidth, plus redundancy concerns. Some treat compute as ephemeral and rebuild from object storage if a node dies.

DuckDB: Strengths, Limits, and Usage Patterns

  • Many report that DuckDB “just works,” especially due to out-of-core execution and good performance on TB-scale workloads.
  • Some have fully migrated warehouses to DuckDB’s native format for cost and speed, keeping Parquet only for interoperability. Others still prefer Parquet/Delta as a stable interchange format.
  • Concerns: lack of horizontal scaling, weaker catalog features vs. cloud warehouses, less mature optimizer (more manual query tuning), and uncertain best practices for very large, constantly-updated native files.
  • Integration with Spark and Arrow is seen as a big plus; some advocate combining DuckDB for most workloads with Spark/Databricks only when scale demands it.

Polars: Capabilities and Pain Points

  • Praised for complex transformations and a strong plugin story (e.g., Rust-based geospatial extensions with major speedups).
  • Criticized for frequent OOM on large workloads compared to DuckDB and for API instability in earlier phases.
  • SQL support exists but is not as central as in Spark/DuckDB; typically used in lazy DataFrame style.

Data Formats, Catalogs, and Alternatives

  • Debate over Delta vs. open formats: Delta is open but tightly tied to Spark/Databricks, with features (e.g., deletion vectors) that can break OSS compatibility. Iceberg is discussed as a unifying, multi-engine layer.
  • Multiple alternatives surface: Ray (general distributed compute), Daft (on Ray), ClickHouse (fast but catalog and OOM concerns), Lake Sail, DataFusion, LanceDB, and translation layers like Fugue and Ibis.

Orchestration, DX, and LLMs

  • Spark/Databricks win on built-in streaming, autoloading, checkpointing, and workflow management; with DuckDB/Polars you typically add Airflow/Kestra/dbt-like tools.
  • Some emphasize that migration/ops complexity can outweigh performance gains.
  • Commenters note LLM assistants are currently better with Pandas/Spark than with newer APIs like Polars or Ibis, which may slow adoption.