Pg_lake: Postgres with Iceberg and data lake access

Overall reception & positioning

  • Many see pg_lake as a big milestone: “Postgres with an open data lake,” close to an “open source Snowflake” for some workloads.
  • Others stress it is not a Snowflake replacement: Snowflake still leads on cross-org sharing, governance, large-scale analytics, and broader platform features.

Vendor lock-in, cost, and Snowflake strategy

  • One camp: “Just pay Snowflake” – managed infra, reliability, and focus on product value outweigh theoretical lock-in; everything has some lock-in (cloud, hardware, OSS ecosystems).
  • Opposing view: compute on proprietary warehouses is very expensive; you “pay to see your own data,” especially for BI/visualization workloads. Iceberg/Parquet-on-S3 avoids this by letting many tools query the same storage.
  • Several call out Snowflake’s high compute and storage pricing relative to raw cloud costs.
  • Some argue Snowflake supports Iceberg for strategic reasons: to stay competitive as Iceberg becomes a standard and enable bi‑directional migration.

Architecture & query execution model

  • Postgres remains the frontend and catalog; DuckDB is the analytical engine behind a separate pgduck_server process.
  • Foreign tables (USING iceberg) map to Iceberg/Parquet data; pg_lake analyzes queries and pushes down what DuckDB can execute efficiently.
  • Simple queries can be fully executed in DuckDB; more complex ones are split between DuckDB and Postgres.
  • Separate process chosen for threading, memory-safety, shared caching, restartability, and clearer resource limits.

Use cases and benefits

  • Periodically offloading “hot” Postgres data to cheap Iceberg/Parquet storage while still querying it (tiered storage).
  • Querying large S3/GCS-based datasets (e.g. logs, telemetry) from the same Postgres used for OLTP, including joins with local tables.
  • Simplifying ETL/ELT pipelines that currently shuffle data between Postgres and data lakes via custom jobs.
  • Easy COPY to/from Parquet; schema auto-detection from existing Parquet files.

Comparisons to related projects

  • DuckLake: DuckDB as frontend + engine, Postgres as catalog vs pg_lake: Postgres frontend + catalog, DuckDB as engine, using Iceberg for interoperability.
  • pg_mooncake: similar vision (Postgres+lakehouse), but commenters describe pg_lake as more mature and already used in heavy production.
  • pg_duckdb: embeds DuckDB per Postgres backend; pg_lake’s authors prefer a single external DuckDB instance for stability and resource control.

Access control & security

  • S3 access is configured via DuckDB “secrets” (credentials/IAM roles) in pgduck_server.
  • Postgres-side privileges are coarse-grained (pg_lake_read/write roles); finer-grained, per-table grants would need more work.
  • Some interest in integrating with enterprise IAM that understands SQL grants better than S3 policies.

Limitations, maturity, and open questions

  • Type mapping: most Postgres types are supported via Parquet equivalents or text; very large numerics and some edge cases are limited.
  • External Iceberg read-only support exists but is currently constrained; REST catalog support is new and not fully documented.
  • Scaling: today it’s a single DuckDB instance on the same machine; good for many workloads but not a distributed engine. Concerns raised about “hot neighbor” problems and memory-intensive queries; answer is mostly “use more RAM / careful sizing.”
  • One commenter is broadly skeptical of data lakes and filesystem-backed analytics in general, calling the whole paradigm misguided.