Show HN: BemiDB – Postgres read replica optimized for analytics
Overall concept & architecture
- BemiDB is positioned as a Postgres read-replica for analytics.
- Embeds DuckDB as the query engine, stores data in Apache Iceberg tables with columnar Parquet files (often ZSTD-compressed).
- Runs as a separate process (no Postgres extension), connects over the Postgres protocol, and writes to S3 or local disk.
Primary use cases discussed
- Time-series / IoT: keep recent months in Postgres for fast app queries, archive older data to S3 in Parquet/Iceberg, and run analytical or visualization queries over the full history.
- Auditing / change capture: potential to combine with existing logical-replication-based auditing tooling from the same team.
- Machine learning feature/data pipelines: replacing bespoke Postgres→Parquet→DuckDB flows.
Syncing, consistency, and CDC
- Current implementation: periodic full-table re-sync via
COPYto CSV then Iceberg. - Incremental sync with logical replication (CDC) is on the roadmap; planned approach is to buffer changes and flush to S3 based on time/size thresholds.
- Strong consistency is not guaranteed; users must accept delayed data for analytics.
- Questions were raised about how updates/deletes, data retention, and very large tables will be handled; answer: future Iceberg “diff” files and metadata-based stitching, enabling time travel and schema evolution.
Performance, scale, and latency
- Benchmarks cited: on TPC-H SF1/SF0.1, BemiDB’s Parquet data was much smaller than Postgres storage; some debate about the realism of unindexed Postgres baselines.
- One commenter questioned logical replication’s ability to keep up on multi-TB systems; authors position current target as small/medium Postgres and expect more pipelines at larger scale.
- S3-based analytics are said to have ~1s-level latency; local SSD-backed Iceberg is reported as “super fast.” Caching is on the roadmap.
Comparison with other tools
- DuckDB: used internally, but seen as still buggy by some; BemiDB adds Postgres-wire and Iceberg support, plus sync automation.
- ClickHouse: widely praised for performance and S3 support; some see it as a better production pairing with Postgres, others prefer BemiDB’s simpler single-binary + object storage model.
- Alternatives mentioned: pg_analytics (ParadeDB), pg-archiver, Debezium/Kafka→ClickHouse pipelines, Materialize/Feldera/Striim for incremental view maintenance.
Licensing debate
- AGPL choice sparked significant pushback due to perceived legal complexity and “fair source” dynamics.
- Others defended AGPL as aligned with user-freedom focused open source.
- Authors indicated openness to more permissive licensing over time.