2025-05-23

OpenAI: Scaling PostgreSQL to the Next Level

Managed vs self‑hosted PostgreSQL

Several commenters initially assumed OpenAI self-hosts Postgres; clarification was given that they use Azure Database for PostgreSQL (managed).
Self-hosting is seen as attractive for flexibility (superuser, extensions) but “nerve‑wracking” for many due to responsibility for HA, backups, kernel/infra issues.
Others argue self-hosted multi-node Postgres can be very stable and “almost maintenance-free” once set up, but acknowledge it requires real DBA skill.

Oracle, Aurora, and other database options

One thread argues OpenAI would avoid many pain points by using managed Oracle (or Exadata) instead of Postgres: built‑in online schema changes, index invisibility, horizontal HA clusters, advanced pooling, rich telemetry, and no Postgres-style vacuum/bloat.
Counterpoints highlight Oracle licensing, audits, extra costs (DataGuard, backups), Unicode quirks, and non-standard isolation levels.
AWS Aurora is proposed as a simpler scaling solution; critics respond that it’s an over-marketed “black box” with underwhelming performance vs well-tuned self-hosted hardware. Supporters point to features like low-lag replication, parallel query, and cheap clones plus high-profile production users.
Some suggest NewSQL/distributed SQL systems (e.g., YugabyteDB) might be better suited than Postgres for this role.

Single-master architecture and sharding

Many are surprised OpenAI keeps a single primary with ~40 read replicas and no sharding, and has a “no new workloads” policy on that cluster.
Some argue sharding by user/org seems obvious and would ease pressure; others note retrofitting sharding into a large, complex app with hundreds of endpoints is extremely non-trivial.
The speaker’s message: if you’re read-heavy, you can scale quite far with one master plus replicas; sharding is deferred, not ruled out. Critics see accumulating tech debt and complex workarounds as the cost of avoiding sharding.

Operations, backups, and reliability

Strong emphasis on tested backups and periodic restore validation; these are seen as essential but time-consuming and error-prone.
Some say backup/restore is actually harder at scale; others argue you must validate backups regardless of managed vs self-hosted.
Practical tooling mentioned: barman, WAL archiving, separate hourly/daily restored instances used both for support/debugging and continuous backup validation.

Index management and planner control

A key wish: the ability to safely “disable” an index so the planner ignores it while it is still maintained, to assess whether it’s truly safe to drop.
Commenters stress that flipping pg_index.indisvalid is not a real feature, just poking internals without guarantees; managed services often block this.
Existing workarounds: planner GUCs per query, query tricks (e.g., indexed_col + 0 to avoid index use), and pg_hintplan to steer index selection.

ORMs and application design

The talk’s warning about ORMs causing inefficient queries resonates; several commenters argue generic ORMs push you toward least‑common‑denominator SQL and hide data access patterns.
Some advocate “Postgres-first” design with hand-written SQL and Postgres-specific features.
Others defend ORMs for portability and migrations (e.g., painless DB2→Postgres move), and tools like sqlc as a middle ground between raw SQL and full ORM.

Feature requests and contributing to PostgreSQL

Desired core features include: index invisibility, built-in schema change history/auditing, and more robust DDL tracking.
Commenters note many of these can be built today with event triggers and audit extensions, but acknowledge it’s complex and common enough to justify first-class support.
Discussion around “just open PRs” vs the reality of Postgres development: slow review cycles, heavy rebasing, consensus-driven mailing lists, and the need to work with existing committers rather than “railroading with money.”

Perception of the talk and OpenAI’s choices

Some find the content relatively basic and note that 25k QPS per replica isn’t exceptional; others praise the talk as a valuable “user story” at a developer-focused conference.
There is debate over whether choosing Postgres (on Azure) was the right decision for this workload: some see it as misusing a single-node RDBMS where distributed databases fit better; others argue the current architecture is reasonable given heavy read bias and the benefits of managed services.

Related topics