2024-11-07

Show HN: BemiDB – Postgres read replica optimized for analytics

Overall concept & architecture

BemiDB is positioned as a Postgres read-replica for analytics.
Embeds DuckDB as the query engine, stores data in Apache Iceberg tables with columnar Parquet files (often ZSTD-compressed).
Runs as a separate process (no Postgres extension), connects over the Postgres protocol, and writes to S3 or local disk.

Primary use cases discussed

Time-series / IoT: keep recent months in Postgres for fast app queries, archive older data to S3 in Parquet/Iceberg, and run analytical or visualization queries over the full history.
Auditing / change capture: potential to combine with existing logical-replication-based auditing tooling from the same team.
Machine learning feature/data pipelines: replacing bespoke Postgres→Parquet→DuckDB flows.

Syncing, consistency, and CDC

Current implementation: periodic full-table re-sync via COPY to CSV then Iceberg.
Incremental sync with logical replication (CDC) is on the roadmap; planned approach is to buffer changes and flush to S3 based on time/size thresholds.
Strong consistency is not guaranteed; users must accept delayed data for analytics.
Questions were raised about how updates/deletes, data retention, and very large tables will be handled; answer: future Iceberg “diff” files and metadata-based stitching, enabling time travel and schema evolution.

Performance, scale, and latency

Benchmarks cited: on TPC-H SF1/SF0.1, BemiDB’s Parquet data was much smaller than Postgres storage; some debate about the realism of unindexed Postgres baselines.
One commenter questioned logical replication’s ability to keep up on multi-TB systems; authors position current target as small/medium Postgres and expect more pipelines at larger scale.
S3-based analytics are said to have ~1s-level latency; local SSD-backed Iceberg is reported as “super fast.” Caching is on the roadmap.

Comparison with other tools

DuckDB: used internally, but seen as still buggy by some; BemiDB adds Postgres-wire and Iceberg support, plus sync automation.
ClickHouse: widely praised for performance and S3 support; some see it as a better production pairing with Postgres, others prefer BemiDB’s simpler single-binary + object storage model.
Alternatives mentioned: pg_analytics (ParadeDB), pg-archiver, Debezium/Kafka→ClickHouse pipelines, Materialize/Feldera/Striim for incremental view maintenance.

Licensing debate

AGPL choice sparked significant pushback due to perceived legal complexity and “fair source” dynamics.
Others defended AGPL as aligned with user-freedom focused open source.
Authors indicated openness to more permissive licensing over time.

Related topics