2026-01-16

Why DuckDB is my first choice for data processing

File-based SQL and flexibility

Strong enthusiasm for DuckDB as a “Swiss army knife” for analytics on local files: CSV, JSON/JSONL, Parquet, Excel, S3, dataframes, etc.
People like querying files directly with SQL (including globs and union-by-name across many files) instead of importing into a database first.
The CSV reader gets special praise for speed, type inference, and robustness with messy data.

Performance, indexing, and “medium data”

Concern: CSV/JSON lack indexes, implying full scans for non-trivial queries.
Responses:
- Native DuckDB tables have zonemaps; Parquet gets predicate pushdown and columnar scanning.
- Full scans are acceptable for analytical workloads where large portions of the data are touched anyway.
- Users report smooth performance on 100M+ rows and hundreds of GB of Parquet on a single machine; multi‑TB pipelines moved from Spark/Dask to DuckDB successfully.
- Many convert CSV/JSON to Parquet after first read to avoid repeated raw scans.
Some argue the “single big machine” vision is valid for <~10GB to hundreds of GB; others note you can still hit OOM or scaling limits, especially with many concurrent users.

Embedding, WASM, and client APIs

DuckDB is praised for embedding analytics in apps, including in-browser via WASM; used in survey explorers, dashboards, notebooks, and domain tools (e.g., biodiversity data validation).
WASM size is debated: some call it “small” and ideal for bundling; others note 30+ MB artifacts and want a more minimal build.
Clients exist for Python, R, Java, etc.; Android and Node/Bun integration are seen as promising but currently rough (JDBC mismatch, segfaults, dynamic linking issues).

SQL vs dataframe APIs (Polars/Pandas)

Split opinions:
- SQL fans like portability, declarative optimization, and stability over decades. DuckDB’s engine improvements can speed up existing SQL without code changes.
- Dataframe users prefer Polars/Pandas for complex transforms, UDFs, ML and viz workflows, and often find SQL more awkward.
Several people use DuckDB plus Ibis or interop with Polars to mix paradigms.

Ecosystem and comparisons

DuckLake is discussed as an emerging lakehouse-style metadata layer backed by DuckDB tables, versus mature options like Iceberg.
Postgres is seen as less suited for analytics due to row storage; Spark remains necessary for true multi-node scale.
ClickHouse/chDB and Apache Arrow are mentioned as strong alternatives in similar problem spaces.

Limitations and concerns

Reports of memory leaks, Excel quirks, lack of HDF5 support, and complications from dynamic extension loading and LGPL components.
Some worry about SQL dialect differences and type system mismatches vs Arrow.

Related topics