DuckDB Doesn't Need Data to Be a Database

Delta Lake and External Table Support

  • DuckDB can read Delta tables via the duckdb_delta extension, but some users report Arrow datatype errors on checkpointed tables.
  • A fix exists in the underlying Delta kernel, but DuckDB hasn’t yet pulled the latest version; users expect these issues to resolve once updated.

Federated Databases, SQL/MED, and FDWs

  • Several commenters relate DuckDB’s external-data behavior to older concepts: DB2 “federated databases,” SQL/MED, PostgreSQL foreign data wrappers, and Oracle external tables.
  • There is debate over the historical link between SQL/MED and medical data, but agreement that it is about managing external data.
  • Tools like Steampipe are cited as examples of using SQL + FDWs instead of classic ETL/API glue.

RDBMS Features vs Modern App Design

  • Large subthread debates stored procedures, triggers, and database-enforced constraints.
  • Critics say stored procedures are hard to source-control, debug, and scale, and split business logic awkwardly between app and DB.
  • Defenders argue DB constraints and some procedural logic protect data integrity and are underused due to poor education and tooling.
  • Foreign keys are seen by some as essential last-line defense; others report running high-scale systems (including financial) without FKs, relying on app tests and reconciliation.
  • Broader disagreement on whether the database should be treated as core shared interface vs “implementation detail” hidden behind microservice APIs.

DuckDB over S3: Performance, Caching, and Formats

  • DuckDB supports Parquet projection/filter pushdown and range reads from S3, especially effective with Hive-style partitioned paths.
  • There is no built-in caching for S3 reads; some suggest using fsspec file caching when integrating via Python.
  • Latency and many small reads on S3 can be an issue; within AWS the monetary cost is deemed negligible, but queries on large datasets can still be slow.
  • Parquet is described as the dominant open-source columnar format; ORC may be slightly better technically, but Parquet wins on ubiquity. CarbonData is largely unknown in the thread.

Client-Side DuckDB/WASM Use Cases

  • Multiple commenters describe loading Parquet from S3/R2 into DuckDB WASM in the browser to power interactive “sheets” or analytics dashboards.
  • Benefits cited: one bulk, compressed transfer instead of many JSON API calls; local SQL for complex aggregations; predictable performance on “medium” datasets (~100k+ rows).
  • Others argue that plain JavaScript arrays or Arrow/Parquet libraries can be sufficient; DuckDB is justified mainly when OLAP-style queries and statistics are needed.

Views, ETL, and Data Sharing

  • Some view DuckDB views over S3 Parquet as a lightweight abstraction to share datasets: recipients attach a DuckDB file and always see the latest definition.
  • Skeptics say you could just share an S3 URL and/or example SQL; maintaining intermediate views may belong with analysts, not app developers.
  • Concerns about stacking views: harder debugging, silent breakage on source changes, and dependency management vs more explicit ETL stages.
  • Others emphasize that this pattern is not meant to replace a warehouse or full pipeline, just to offer a novel, convenient sharing mechanism.

SQLite vs DuckDB for Sharing Data

  • For small exports, some prefer SQLite due to universality and tooling.
  • Others note DuckDB’s advantage for large analytical workloads (aggregations, window functions) despite SQLite’s simplicity.

File Format Stability and Catalog Ideas

  • Earlier DuckDB versions caused compatibility issues when tools lagged behind the file format; stability is reported to be better from 0.10 onward.
  • Some avoid the issue by storing only views over Parquet and recreating DB files as needed.
  • There is interest in treating DuckDB itself as a catalog over S3 (snapshots, time travel); Iceberg integration is mentioned but not a full catalog solution yet.

Ecosystem and Tooling

  • Mentions of managed/extended DuckDB services (e.g., serverless warehouses, ingestion Lambdas) and a desktop SQL IDE integrating DuckDB.
  • DuckDB is compared to Trino/Presto conceptually but distinguished as in-process rather than cluster-based.