DuckLake is an integrated data lake and catalog format
Naming & Positioning
- Many like the idea but dislike the name “DuckLake” for a supposedly general standard; tying it to DuckDB is seen as branding-heavy and potentially limiting.
- Format itself appears open; some suggest a more neutral name for the table format and reserving “DuckLake” for the DuckDB extension.
Relationship to Iceberg / Delta / Existing Lakes
- Widely viewed as an Iceberg-inspired system that fixes perceived issues (especially metadata-in-blob-storage), but not strictly a competitor:
- Can read Iceberg and sync to Iceberg by writing manifests/metadata on demand.
- Several commenters expect both to be used together in a “bi-directional” way.
- Others note that SQL-backed catalogs already exist in Iceberg; the novelty here is pushing all metadata and stats into SQL.
Metadata in SQL vs Object Storage
- Core value proposition: move metadata and stats from many small S3 files into a transactional SQL DB, so a single SQL query can resolve table state instead of many HTTP calls.
- Claimed benefits: lower latency, fewer conflicts, easier maintenance, plus:
- Multi-statement / multi-table transactions
- SQL views, delta queries, encryption
- Inlined “small inserts” in the catalog
- Better time travel even after compaction, via references to parts of Parquet files.
Scale, Parallelism, and Use Cases
- Debate over scalability: DuckLake currently assumes single-node DuckDB engines with good multicore parallelism vs distributed systems (Spark/Trino).
- Some argue most orgs don’t need multi-node query execution; others question manifesto claims about “hundreds of terabytes and thousands of compute nodes.”
- Horizontal scaling is framed as “many DuckDB nodes in parallel” (for many queries), not one distributed query.
Data Ingestion & Existing Files
- Data must be written through DuckLake (INSERT/COPY) so the catalog is updated; just dropping Parquet files in S3 won’t work.
- Multiple commenters want a way to “attach” existing immutable Parquet files without copying, by building catalog metadata over them.
Interoperability & Ecosystem
- Catalog DB can be any SQL database (e.g., Postgres/MySQL); spec is published, so non-DuckDB implementations are possible, but none exist yet.
- Unclear how/when Spark, Trino, Flink, etc. will integrate; current metadata layout is novel, so existing engines won’t understand it without adapters.
- Concern about BI/analytics support until more engines or connectors natively speak DuckLake.
MotherDuck & Separation of Concerns
- DuckLake is pitched as an open lakehouse layer with transactional metadata and compute/storage separation on user-controlled infra.
- MotherDuck is described as hosted DuckDB with current limitations (e.g., single writer), but both sides say they’re working on tight integration, including hosting DuckLake catalogs.
Critiques, Open Questions, and Adoption
- Some ask how updates, row-level changes, and time travel work in detail; updates are confirmed supported, but questions remain about stats tables and snapshot-versioning.
- Questions about “what’s really new vs Hive + catalog over Parquet” keep coming up; proponents point to transactional semantics, richer metadata, and latency improvements.
- Skepticism about big-enterprise adoption due to incumbent vendors and non-technical buying criteria, though others recall similar skepticism when Hadoop/Spark challenged legacy MPP databases.
- There’s a long subthread on time-range partitioning and filename schemes; several argue a simple, widely adopted range-based naming convention could solve some problems without a full lakehouse stack, but current tools are all anchored on Hive-style partitioning.