2025-05-27

DuckLake is an integrated data lake and catalog format

Naming & Positioning

Many like the idea but dislike the name “DuckLake” for a supposedly general standard; tying it to DuckDB is seen as branding-heavy and potentially limiting.
Format itself appears open; some suggest a more neutral name for the table format and reserving “DuckLake” for the DuckDB extension.

Relationship to Iceberg / Delta / Existing Lakes

Widely viewed as an Iceberg-inspired system that fixes perceived issues (especially metadata-in-blob-storage), but not strictly a competitor:
- Can read Iceberg and sync to Iceberg by writing manifests/metadata on demand.
- Several commenters expect both to be used together in a “bi-directional” way.
Others note that SQL-backed catalogs already exist in Iceberg; the novelty here is pushing all metadata and stats into SQL.

Metadata in SQL vs Object Storage

Core value proposition: move metadata and stats from many small S3 files into a transactional SQL DB, so a single SQL query can resolve table state instead of many HTTP calls.
Claimed benefits: lower latency, fewer conflicts, easier maintenance, plus:
- Multi-statement / multi-table transactions
- SQL views, delta queries, encryption
- Inlined “small inserts” in the catalog
- Better time travel even after compaction, via references to parts of Parquet files.

Scale, Parallelism, and Use Cases

Debate over scalability: DuckLake currently assumes single-node DuckDB engines with good multicore parallelism vs distributed systems (Spark/Trino).
Some argue most orgs don’t need multi-node query execution; others question manifesto claims about “hundreds of terabytes and thousands of compute nodes.”
Horizontal scaling is framed as “many DuckDB nodes in parallel” (for many queries), not one distributed query.

Data Ingestion & Existing Files

Data must be written through DuckLake (INSERT/COPY) so the catalog is updated; just dropping Parquet files in S3 won’t work.
Multiple commenters want a way to “attach” existing immutable Parquet files without copying, by building catalog metadata over them.

Interoperability & Ecosystem

Catalog DB can be any SQL database (e.g., Postgres/MySQL); spec is published, so non-DuckDB implementations are possible, but none exist yet.
Unclear how/when Spark, Trino, Flink, etc. will integrate; current metadata layout is novel, so existing engines won’t understand it without adapters.
Concern about BI/analytics support until more engines or connectors natively speak DuckLake.

MotherDuck & Separation of Concerns

DuckLake is pitched as an open lakehouse layer with transactional metadata and compute/storage separation on user-controlled infra.
MotherDuck is described as hosted DuckDB with current limitations (e.g., single writer), but both sides say they’re working on tight integration, including hosting DuckLake catalogs.

Critiques, Open Questions, and Adoption

Some ask how updates, row-level changes, and time travel work in detail; updates are confirmed supported, but questions remain about stats tables and snapshot-versioning.
Questions about “what’s really new vs Hive + catalog over Parquet” keep coming up; proponents point to transactional semantics, richer metadata, and latency improvements.
Skepticism about big-enterprise adoption due to incumbent vendors and non-technical buying criteria, though others recall similar skepticism when Hadoop/Spark challenged legacy MPP databases.
There’s a long subthread on time-range partitioning and filename schemes; several argue a simple, widely adopted range-based naming convention could solve some problems without a full lakehouse stack, but current tools are all anchored on Hive-style partitioning.

Related topics