Pg_lake: Postgres with Iceberg and data lake access
Overall reception & positioning
- Many see pg_lake as a big milestone: “Postgres with an open data lake,” close to an “open source Snowflake” for some workloads.
- Others stress it is not a Snowflake replacement: Snowflake still leads on cross-org sharing, governance, large-scale analytics, and broader platform features.
Vendor lock-in, cost, and Snowflake strategy
- One camp: “Just pay Snowflake” – managed infra, reliability, and focus on product value outweigh theoretical lock-in; everything has some lock-in (cloud, hardware, OSS ecosystems).
- Opposing view: compute on proprietary warehouses is very expensive; you “pay to see your own data,” especially for BI/visualization workloads. Iceberg/Parquet-on-S3 avoids this by letting many tools query the same storage.
- Several call out Snowflake’s high compute and storage pricing relative to raw cloud costs.
- Some argue Snowflake supports Iceberg for strategic reasons: to stay competitive as Iceberg becomes a standard and enable bi‑directional migration.
Architecture & query execution model
- Postgres remains the frontend and catalog; DuckDB is the analytical engine behind a separate
pgduck_serverprocess. - Foreign tables (
USING iceberg) map to Iceberg/Parquet data; pg_lake analyzes queries and pushes down what DuckDB can execute efficiently. - Simple queries can be fully executed in DuckDB; more complex ones are split between DuckDB and Postgres.
- Separate process chosen for threading, memory-safety, shared caching, restartability, and clearer resource limits.
Use cases and benefits
- Periodically offloading “hot” Postgres data to cheap Iceberg/Parquet storage while still querying it (tiered storage).
- Querying large S3/GCS-based datasets (e.g. logs, telemetry) from the same Postgres used for OLTP, including joins with local tables.
- Simplifying ETL/ELT pipelines that currently shuffle data between Postgres and data lakes via custom jobs.
- Easy COPY to/from Parquet; schema auto-detection from existing Parquet files.
Comparisons to related projects
- DuckLake: DuckDB as frontend + engine, Postgres as catalog vs pg_lake: Postgres frontend + catalog, DuckDB as engine, using Iceberg for interoperability.
- pg_mooncake: similar vision (Postgres+lakehouse), but commenters describe pg_lake as more mature and already used in heavy production.
- pg_duckdb: embeds DuckDB per Postgres backend; pg_lake’s authors prefer a single external DuckDB instance for stability and resource control.
Access control & security
- S3 access is configured via DuckDB “secrets” (credentials/IAM roles) in
pgduck_server. - Postgres-side privileges are coarse-grained (
pg_lake_read/writeroles); finer-grained, per-table grants would need more work. - Some interest in integrating with enterprise IAM that understands SQL grants better than S3 policies.
Limitations, maturity, and open questions
- Type mapping: most Postgres types are supported via Parquet equivalents or text; very large numerics and some edge cases are limited.
- External Iceberg read-only support exists but is currently constrained; REST catalog support is new and not fully documented.
- Scaling: today it’s a single DuckDB instance on the same machine; good for many workloads but not a distributed engine. Concerns raised about “hot neighbor” problems and memory-intensive queries; answer is mostly “use more RAM / careful sizing.”
- One commenter is broadly skeptical of data lakes and filesystem-backed analytics in general, calling the whole paradigm misguided.