Why DuckDB is my first choice for data processing

File-based SQL and flexibility

  • Strong enthusiasm for DuckDB as a “Swiss army knife” for analytics on local files: CSV, JSON/JSONL, Parquet, Excel, S3, dataframes, etc.
  • People like querying files directly with SQL (including globs and union-by-name across many files) instead of importing into a database first.
  • The CSV reader gets special praise for speed, type inference, and robustness with messy data.

Performance, indexing, and “medium data”

  • Concern: CSV/JSON lack indexes, implying full scans for non-trivial queries.
  • Responses:
    • Native DuckDB tables have zonemaps; Parquet gets predicate pushdown and columnar scanning.
    • Full scans are acceptable for analytical workloads where large portions of the data are touched anyway.
    • Users report smooth performance on 100M+ rows and hundreds of GB of Parquet on a single machine; multi‑TB pipelines moved from Spark/Dask to DuckDB successfully.
    • Many convert CSV/JSON to Parquet after first read to avoid repeated raw scans.
  • Some argue the “single big machine” vision is valid for <~10GB to hundreds of GB; others note you can still hit OOM or scaling limits, especially with many concurrent users.

Embedding, WASM, and client APIs

  • DuckDB is praised for embedding analytics in apps, including in-browser via WASM; used in survey explorers, dashboards, notebooks, and domain tools (e.g., biodiversity data validation).
  • WASM size is debated: some call it “small” and ideal for bundling; others note 30+ MB artifacts and want a more minimal build.
  • Clients exist for Python, R, Java, etc.; Android and Node/Bun integration are seen as promising but currently rough (JDBC mismatch, segfaults, dynamic linking issues).

SQL vs dataframe APIs (Polars/Pandas)

  • Split opinions:
    • SQL fans like portability, declarative optimization, and stability over decades. DuckDB’s engine improvements can speed up existing SQL without code changes.
    • Dataframe users prefer Polars/Pandas for complex transforms, UDFs, ML and viz workflows, and often find SQL more awkward.
  • Several people use DuckDB plus Ibis or interop with Polars to mix paradigms.

Ecosystem and comparisons

  • DuckLake is discussed as an emerging lakehouse-style metadata layer backed by DuckDB tables, versus mature options like Iceberg.
  • Postgres is seen as less suited for analytics due to row storage; Spark remains necessary for true multi-node scale.
  • ClickHouse/chDB and Apache Arrow are mentioned as strong alternatives in similar problem spaces.

Limitations and concerns

  • Reports of memory leaks, Excel quirks, lack of HDF5 support, and complications from dynamic extension loading and LGPL components.
  • Some worry about SQL dialect differences and type system mismatches vs Arrow.