DuckLake is an integrated data lake and catalog format

Naming & Positioning

  • Many like the idea but dislike the name “DuckLake” for a supposedly general standard; tying it to DuckDB is seen as branding-heavy and potentially limiting.
  • Format itself appears open; some suggest a more neutral name for the table format and reserving “DuckLake” for the DuckDB extension.

Relationship to Iceberg / Delta / Existing Lakes

  • Widely viewed as an Iceberg-inspired system that fixes perceived issues (especially metadata-in-blob-storage), but not strictly a competitor:
    • Can read Iceberg and sync to Iceberg by writing manifests/metadata on demand.
    • Several commenters expect both to be used together in a “bi-directional” way.
  • Others note that SQL-backed catalogs already exist in Iceberg; the novelty here is pushing all metadata and stats into SQL.

Metadata in SQL vs Object Storage

  • Core value proposition: move metadata and stats from many small S3 files into a transactional SQL DB, so a single SQL query can resolve table state instead of many HTTP calls.
  • Claimed benefits: lower latency, fewer conflicts, easier maintenance, plus:
    • Multi-statement / multi-table transactions
    • SQL views, delta queries, encryption
    • Inlined “small inserts” in the catalog
    • Better time travel even after compaction, via references to parts of Parquet files.

Scale, Parallelism, and Use Cases

  • Debate over scalability: DuckLake currently assumes single-node DuckDB engines with good multicore parallelism vs distributed systems (Spark/Trino).
  • Some argue most orgs don’t need multi-node query execution; others question manifesto claims about “hundreds of terabytes and thousands of compute nodes.”
  • Horizontal scaling is framed as “many DuckDB nodes in parallel” (for many queries), not one distributed query.

Data Ingestion & Existing Files

  • Data must be written through DuckLake (INSERT/COPY) so the catalog is updated; just dropping Parquet files in S3 won’t work.
  • Multiple commenters want a way to “attach” existing immutable Parquet files without copying, by building catalog metadata over them.

Interoperability & Ecosystem

  • Catalog DB can be any SQL database (e.g., Postgres/MySQL); spec is published, so non-DuckDB implementations are possible, but none exist yet.
  • Unclear how/when Spark, Trino, Flink, etc. will integrate; current metadata layout is novel, so existing engines won’t understand it without adapters.
  • Concern about BI/analytics support until more engines or connectors natively speak DuckLake.

MotherDuck & Separation of Concerns

  • DuckLake is pitched as an open lakehouse layer with transactional metadata and compute/storage separation on user-controlled infra.
  • MotherDuck is described as hosted DuckDB with current limitations (e.g., single writer), but both sides say they’re working on tight integration, including hosting DuckLake catalogs.

Critiques, Open Questions, and Adoption

  • Some ask how updates, row-level changes, and time travel work in detail; updates are confirmed supported, but questions remain about stats tables and snapshot-versioning.
  • Questions about “what’s really new vs Hive + catalog over Parquet” keep coming up; proponents point to transactional semantics, richer metadata, and latency improvements.
  • Skepticism about big-enterprise adoption due to incumbent vendors and non-technical buying criteria, though others recall similar skepticism when Hadoop/Spark challenged legacy MPP databases.
  • There’s a long subthread on time-range partitioning and filename schemes; several argue a simple, widely adopted range-based naming convention could solve some problems without a full lakehouse stack, but current tools are all anchored on Hive-style partitioning.