2024-08-06

Full text search over Postgres: Elasticsearch vs. alternatives

Postgres FTS vs. Elasticsearch

Several commenters say Postgres FTS is fine for simple or low-scale use (blogs, small apps), but becomes painful at larger scales, especially for facets and relevance.
Postgres lacks BM25 and global term statistics in core, limiting ranking quality; extensions like RUM and ParadeDB/pg_search aim to close this gap.
Some argue you eventually end up re‑implementing an ETL‑like pipeline and specialized indexing inside Postgres, without the full feature set of a dedicated search engine.

Elasticsearch Reliability & Operations

Mixed experiences: some report “random” failures, GC issues, and operational headaches, others say Elasticsearch is reliable if sized and configured correctly.
Common pitfalls: dynamic mapping creating huge schemas, underprovisioned clusters, poor shard/index design, and weak backup practices.
Replication, corruption handling, and leader election reliability are debated; some feel it’s improved since ~7.x, others remain wary.
Serverless Elasticsearch/OpenSearch is seen as promising but with tradeoffs (S3-backed, eventual visibility of new data, cold starts).

Faceted Search & Analytics

Facets/aggregations over large result sets are cited as a major reason to prefer Elasticsearch/Solr/OpenSearch.
Postgres can do facets but it’s complex and slow at scale; extensions like pgfaceting and ParadeDB’s aggregations aim to help.
Facets + FTS + analytics in one query is described as a key “stickiness” factor for Elastic-style systems.

BM25, Ranking, and Multi‑Tenancy

BM25 support is viewed as a big differentiator; SQLite FTS5 and Elasticsearch have it; core Postgres does not.
Multi‑tenant BM25 is tricky: corpus‑wide stats can leak information or skew relevance between tenants. Suggested mitigations include per-tenant indexes, partial/partitioned indexes, or per-DB/collection isolation.
Shard-level stats can affect scoring; there are workarounds (e.g., special query modes, 1‑shard indices).

Semantic/Vector Search and Hybrid Approaches

Consensus that vector/semantic search complements, not replaces, keyword/BM25 search.
Hybrid approaches (BM25 + vectors, reciprocal rank fusion, query rewriting) are recommended, especially when domain-specific terms or IDs are important.
Keeping two separate datastores (vector + keyword engine) in sync is seen as operationally fragile; some advocate unified systems that combine both.

Alternatives: Meilisearch, Typesense, Solr, Others

Meilisearch and Typesense praised for simplicity, developer experience, and ease of self-hosting, though feature sets are narrower than Elastic.
Solr is positioned as offering most of Elasticsearch’s Lucene-based power without Elastic’s ecosystem baggage but with less mindshare and more manual scaling.
Other options mentioned: Manticore, ArangoDB (for FTS + graphs), Manticore’s MySQL-friendly protocol, and newer hybrid engines (e.g., Tantivy-based, Infinity).

ParadeDB / pg_search Notes

ParadeDB’s pg_search is described as bringing BM25 and “Elastic-like” FTS into Postgres, distributed as an extension or a separate, replicated Postgres instance.
Earlier versions were weakly consistent; current BM25 indexes are said to be strongly consistent, with minor transactional overhead.
Many users reportedly deploy ParadeDB as a separate “search Postgres” fed via logical replication from primary OLTP Postgres to decouple workloads.

Related topics