Why DuckDB is my first choice for data processing
File-based SQL and flexibility
- Strong enthusiasm for DuckDB as a “Swiss army knife” for analytics on local files: CSV, JSON/JSONL, Parquet, Excel, S3, dataframes, etc.
- People like querying files directly with SQL (including globs and union-by-name across many files) instead of importing into a database first.
- The CSV reader gets special praise for speed, type inference, and robustness with messy data.
Performance, indexing, and “medium data”
- Concern: CSV/JSON lack indexes, implying full scans for non-trivial queries.
- Responses:
- Native DuckDB tables have zonemaps; Parquet gets predicate pushdown and columnar scanning.
- Full scans are acceptable for analytical workloads where large portions of the data are touched anyway.
- Users report smooth performance on 100M+ rows and hundreds of GB of Parquet on a single machine; multi‑TB pipelines moved from Spark/Dask to DuckDB successfully.
- Many convert CSV/JSON to Parquet after first read to avoid repeated raw scans.
- Some argue the “single big machine” vision is valid for <~10GB to hundreds of GB; others note you can still hit OOM or scaling limits, especially with many concurrent users.
Embedding, WASM, and client APIs
- DuckDB is praised for embedding analytics in apps, including in-browser via WASM; used in survey explorers, dashboards, notebooks, and domain tools (e.g., biodiversity data validation).
- WASM size is debated: some call it “small” and ideal for bundling; others note 30+ MB artifacts and want a more minimal build.
- Clients exist for Python, R, Java, etc.; Android and Node/Bun integration are seen as promising but currently rough (JDBC mismatch, segfaults, dynamic linking issues).
SQL vs dataframe APIs (Polars/Pandas)
- Split opinions:
- SQL fans like portability, declarative optimization, and stability over decades. DuckDB’s engine improvements can speed up existing SQL without code changes.
- Dataframe users prefer Polars/Pandas for complex transforms, UDFs, ML and viz workflows, and often find SQL more awkward.
- Several people use DuckDB plus Ibis or interop with Polars to mix paradigms.
Ecosystem and comparisons
- DuckLake is discussed as an emerging lakehouse-style metadata layer backed by DuckDB tables, versus mature options like Iceberg.
- Postgres is seen as less suited for analytics due to row storage; Spark remains necessary for true multi-node scale.
- ClickHouse/chDB and Apache Arrow are mentioned as strong alternatives in similar problem spaces.
Limitations and concerns
- Reports of memory leaks, Excel quirks, lack of HDF5 support, and complications from dynamic extension loading and LGPL components.
- Some worry about SQL dialect differences and type system mismatches vs Arrow.