DuckDB Doesn't Need Data to Be a Database
Delta Lake and External Table Support
- DuckDB can read Delta tables via the
duckdb_deltaextension, but some users report Arrow datatype errors on checkpointed tables. - A fix exists in the underlying Delta kernel, but DuckDB hasn’t yet pulled the latest version; users expect these issues to resolve once updated.
Federated Databases, SQL/MED, and FDWs
- Several commenters relate DuckDB’s external-data behavior to older concepts: DB2 “federated databases,” SQL/MED, PostgreSQL foreign data wrappers, and Oracle external tables.
- There is debate over the historical link between SQL/MED and medical data, but agreement that it is about managing external data.
- Tools like Steampipe are cited as examples of using SQL + FDWs instead of classic ETL/API glue.
RDBMS Features vs Modern App Design
- Large subthread debates stored procedures, triggers, and database-enforced constraints.
- Critics say stored procedures are hard to source-control, debug, and scale, and split business logic awkwardly between app and DB.
- Defenders argue DB constraints and some procedural logic protect data integrity and are underused due to poor education and tooling.
- Foreign keys are seen by some as essential last-line defense; others report running high-scale systems (including financial) without FKs, relying on app tests and reconciliation.
- Broader disagreement on whether the database should be treated as core shared interface vs “implementation detail” hidden behind microservice APIs.
DuckDB over S3: Performance, Caching, and Formats
- DuckDB supports Parquet projection/filter pushdown and range reads from S3, especially effective with Hive-style partitioned paths.
- There is no built-in caching for S3 reads; some suggest using
fsspecfile caching when integrating via Python. - Latency and many small reads on S3 can be an issue; within AWS the monetary cost is deemed negligible, but queries on large datasets can still be slow.
- Parquet is described as the dominant open-source columnar format; ORC may be slightly better technically, but Parquet wins on ubiquity. CarbonData is largely unknown in the thread.
Client-Side DuckDB/WASM Use Cases
- Multiple commenters describe loading Parquet from S3/R2 into DuckDB WASM in the browser to power interactive “sheets” or analytics dashboards.
- Benefits cited: one bulk, compressed transfer instead of many JSON API calls; local SQL for complex aggregations; predictable performance on “medium” datasets (~100k+ rows).
- Others argue that plain JavaScript arrays or Arrow/Parquet libraries can be sufficient; DuckDB is justified mainly when OLAP-style queries and statistics are needed.
Views, ETL, and Data Sharing
- Some view DuckDB views over S3 Parquet as a lightweight abstraction to share datasets: recipients attach a DuckDB file and always see the latest definition.
- Skeptics say you could just share an S3 URL and/or example SQL; maintaining intermediate views may belong with analysts, not app developers.
- Concerns about stacking views: harder debugging, silent breakage on source changes, and dependency management vs more explicit ETL stages.
- Others emphasize that this pattern is not meant to replace a warehouse or full pipeline, just to offer a novel, convenient sharing mechanism.
SQLite vs DuckDB for Sharing Data
- For small exports, some prefer SQLite due to universality and tooling.
- Others note DuckDB’s advantage for large analytical workloads (aggregations, window functions) despite SQLite’s simplicity.
File Format Stability and Catalog Ideas
- Earlier DuckDB versions caused compatibility issues when tools lagged behind the file format; stability is reported to be better from 0.10 onward.
- Some avoid the issue by storing only views over Parquet and recreating DB files as needed.
- There is interest in treating DuckDB itself as a catalog over S3 (snapshots, time travel); Iceberg integration is mentioned but not a full catalog solution yet.
Ecosystem and Tooling
- Mentions of managed/extended DuckDB services (e.g., serverless warehouses, ingestion Lambdas) and a desktop SQL IDE integrating DuckDB.
- DuckDB is compared to Trino/Presto conceptually but distinguished as in-process rather than cluster-based.