OpenAI: Scaling PostgreSQL to the Next Level

Managed vs self‑hosted PostgreSQL

  • Several commenters initially assumed OpenAI self-hosts Postgres; clarification was given that they use Azure Database for PostgreSQL (managed).
  • Self-hosting is seen as attractive for flexibility (superuser, extensions) but “nerve‑wracking” for many due to responsibility for HA, backups, kernel/infra issues.
  • Others argue self-hosted multi-node Postgres can be very stable and “almost maintenance-free” once set up, but acknowledge it requires real DBA skill.

Oracle, Aurora, and other database options

  • One thread argues OpenAI would avoid many pain points by using managed Oracle (or Exadata) instead of Postgres: built‑in online schema changes, index invisibility, horizontal HA clusters, advanced pooling, rich telemetry, and no Postgres-style vacuum/bloat.
  • Counterpoints highlight Oracle licensing, audits, extra costs (DataGuard, backups), Unicode quirks, and non-standard isolation levels.
  • AWS Aurora is proposed as a simpler scaling solution; critics respond that it’s an over-marketed “black box” with underwhelming performance vs well-tuned self-hosted hardware. Supporters point to features like low-lag replication, parallel query, and cheap clones plus high-profile production users.
  • Some suggest NewSQL/distributed SQL systems (e.g., YugabyteDB) might be better suited than Postgres for this role.

Single-master architecture and sharding

  • Many are surprised OpenAI keeps a single primary with ~40 read replicas and no sharding, and has a “no new workloads” policy on that cluster.
  • Some argue sharding by user/org seems obvious and would ease pressure; others note retrofitting sharding into a large, complex app with hundreds of endpoints is extremely non-trivial.
  • The speaker’s message: if you’re read-heavy, you can scale quite far with one master plus replicas; sharding is deferred, not ruled out. Critics see accumulating tech debt and complex workarounds as the cost of avoiding sharding.

Operations, backups, and reliability

  • Strong emphasis on tested backups and periodic restore validation; these are seen as essential but time-consuming and error-prone.
  • Some say backup/restore is actually harder at scale; others argue you must validate backups regardless of managed vs self-hosted.
  • Practical tooling mentioned: barman, WAL archiving, separate hourly/daily restored instances used both for support/debugging and continuous backup validation.

Index management and planner control

  • A key wish: the ability to safely “disable” an index so the planner ignores it while it is still maintained, to assess whether it’s truly safe to drop.
  • Commenters stress that flipping pg_index.indisvalid is not a real feature, just poking internals without guarantees; managed services often block this.
  • Existing workarounds: planner GUCs per query, query tricks (e.g., indexed_col + 0 to avoid index use), and pg_hintplan to steer index selection.

ORMs and application design

  • The talk’s warning about ORMs causing inefficient queries resonates; several commenters argue generic ORMs push you toward least‑common‑denominator SQL and hide data access patterns.
  • Some advocate “Postgres-first” design with hand-written SQL and Postgres-specific features.
  • Others defend ORMs for portability and migrations (e.g., painless DB2→Postgres move), and tools like sqlc as a middle ground between raw SQL and full ORM.

Feature requests and contributing to PostgreSQL

  • Desired core features include: index invisibility, built-in schema change history/auditing, and more robust DDL tracking.
  • Commenters note many of these can be built today with event triggers and audit extensions, but acknowledge it’s complex and common enough to justify first-class support.
  • Discussion around “just open PRs” vs the reality of Postgres development: slow review cycles, heavy rebasing, consensus-driven mailing lists, and the need to work with existing committers rather than “railroading with money.”

Perception of the talk and OpenAI’s choices

  • Some find the content relatively basic and note that 25k QPS per replica isn’t exceptional; others praise the talk as a valuable “user story” at a developer-focused conference.
  • There is debate over whether choosing Postgres (on Azure) was the right decision for this workload: some see it as misusing a single-node RDBMS where distributed databases fit better; others argue the current architecture is reasonable given heavy read bias and the benefits of managed services.