2025-11-04

Pg_lake: Postgres with Iceberg and data lake access

Overall reception & positioning

Many see pg_lake as a big milestone: “Postgres with an open data lake,” close to an “open source Snowflake” for some workloads.
Others stress it is not a Snowflake replacement: Snowflake still leads on cross-org sharing, governance, large-scale analytics, and broader platform features.

Vendor lock-in, cost, and Snowflake strategy

One camp: “Just pay Snowflake” – managed infra, reliability, and focus on product value outweigh theoretical lock-in; everything has some lock-in (cloud, hardware, OSS ecosystems).
Opposing view: compute on proprietary warehouses is very expensive; you “pay to see your own data,” especially for BI/visualization workloads. Iceberg/Parquet-on-S3 avoids this by letting many tools query the same storage.
Several call out Snowflake’s high compute and storage pricing relative to raw cloud costs.
Some argue Snowflake supports Iceberg for strategic reasons: to stay competitive as Iceberg becomes a standard and enable bi‑directional migration.

Architecture & query execution model

Postgres remains the frontend and catalog; DuckDB is the analytical engine behind a separate pgduck_server process.
Foreign tables (USING iceberg) map to Iceberg/Parquet data; pg_lake analyzes queries and pushes down what DuckDB can execute efficiently.
Simple queries can be fully executed in DuckDB; more complex ones are split between DuckDB and Postgres.
Separate process chosen for threading, memory-safety, shared caching, restartability, and clearer resource limits.

Use cases and benefits

Periodically offloading “hot” Postgres data to cheap Iceberg/Parquet storage while still querying it (tiered storage).
Querying large S3/GCS-based datasets (e.g. logs, telemetry) from the same Postgres used for OLTP, including joins with local tables.
Simplifying ETL/ELT pipelines that currently shuffle data between Postgres and data lakes via custom jobs.
Easy COPY to/from Parquet; schema auto-detection from existing Parquet files.

Comparisons to related projects

DuckLake: DuckDB as frontend + engine, Postgres as catalog vs pg_lake: Postgres frontend + catalog, DuckDB as engine, using Iceberg for interoperability.
pg_mooncake: similar vision (Postgres+lakehouse), but commenters describe pg_lake as more mature and already used in heavy production.
pg_duckdb: embeds DuckDB per Postgres backend; pg_lake’s authors prefer a single external DuckDB instance for stability and resource control.

Access control & security

S3 access is configured via DuckDB “secrets” (credentials/IAM roles) in pgduck_server.
Postgres-side privileges are coarse-grained (pg_lake_read/write roles); finer-grained, per-table grants would need more work.
Some interest in integrating with enterprise IAM that understands SQL grants better than S3 policies.

Limitations, maturity, and open questions

Type mapping: most Postgres types are supported via Parquet equivalents or text; very large numerics and some edge cases are limited.
External Iceberg read-only support exists but is currently constrained; REST catalog support is new and not fully documented.
Scaling: today it’s a single DuckDB instance on the same machine; good for many workloads but not a distributed engine. Concerns raised about “hot neighbor” problems and memory-intensive queries; answer is mostly “use more RAM / careful sizing.”
One commenter is broadly skeptical of data lakes and filesystem-backed analytics in general, calling the whole paradigm misguided.

Related topics