2024-05-29

DuckDB Doesn't Need Data to Be a Database

Delta Lake and External Table Support

DuckDB can read Delta tables via the duckdb_delta extension, but some users report Arrow datatype errors on checkpointed tables.
A fix exists in the underlying Delta kernel, but DuckDB hasn’t yet pulled the latest version; users expect these issues to resolve once updated.

Federated Databases, SQL/MED, and FDWs

Several commenters relate DuckDB’s external-data behavior to older concepts: DB2 “federated databases,” SQL/MED, PostgreSQL foreign data wrappers, and Oracle external tables.
There is debate over the historical link between SQL/MED and medical data, but agreement that it is about managing external data.
Tools like Steampipe are cited as examples of using SQL + FDWs instead of classic ETL/API glue.

RDBMS Features vs Modern App Design

Large subthread debates stored procedures, triggers, and database-enforced constraints.
Critics say stored procedures are hard to source-control, debug, and scale, and split business logic awkwardly between app and DB.
Defenders argue DB constraints and some procedural logic protect data integrity and are underused due to poor education and tooling.
Foreign keys are seen by some as essential last-line defense; others report running high-scale systems (including financial) without FKs, relying on app tests and reconciliation.
Broader disagreement on whether the database should be treated as core shared interface vs “implementation detail” hidden behind microservice APIs.

DuckDB over S3: Performance, Caching, and Formats

DuckDB supports Parquet projection/filter pushdown and range reads from S3, especially effective with Hive-style partitioned paths.
There is no built-in caching for S3 reads; some suggest using fsspec file caching when integrating via Python.
Latency and many small reads on S3 can be an issue; within AWS the monetary cost is deemed negligible, but queries on large datasets can still be slow.
Parquet is described as the dominant open-source columnar format; ORC may be slightly better technically, but Parquet wins on ubiquity. CarbonData is largely unknown in the thread.

Client-Side DuckDB/WASM Use Cases

Multiple commenters describe loading Parquet from S3/R2 into DuckDB WASM in the browser to power interactive “sheets” or analytics dashboards.
Benefits cited: one bulk, compressed transfer instead of many JSON API calls; local SQL for complex aggregations; predictable performance on “medium” datasets (~100k+ rows).
Others argue that plain JavaScript arrays or Arrow/Parquet libraries can be sufficient; DuckDB is justified mainly when OLAP-style queries and statistics are needed.

Views, ETL, and Data Sharing

Some view DuckDB views over S3 Parquet as a lightweight abstraction to share datasets: recipients attach a DuckDB file and always see the latest definition.
Skeptics say you could just share an S3 URL and/or example SQL; maintaining intermediate views may belong with analysts, not app developers.
Concerns about stacking views: harder debugging, silent breakage on source changes, and dependency management vs more explicit ETL stages.
Others emphasize that this pattern is not meant to replace a warehouse or full pipeline, just to offer a novel, convenient sharing mechanism.

SQLite vs DuckDB for Sharing Data

For small exports, some prefer SQLite due to universality and tooling.
Others note DuckDB’s advantage for large analytical workloads (aggregations, window functions) despite SQLite’s simplicity.

File Format Stability and Catalog Ideas

Earlier DuckDB versions caused compatibility issues when tools lagged behind the file format; stability is reported to be better from 0.10 onward.
Some avoid the issue by storing only views over Parquet and recreating DB files as needed.
There is interest in treating DuckDB itself as a catalog over S3 (snapshots, time travel); Iceberg integration is mentioned but not a full catalog solution yet.

Ecosystem and Tooling

Mentions of managed/extended DuckDB services (e.g., serverless warehouses, ingestion Lambdas) and a desktop SQL IDE integrating DuckDB.
DuckDB is compared to Trino/Presto conceptually but distinguished as in-process rather than cluster-based.

Related topics