OpenAI: Scaling PostgreSQL to the Next Level
Managed vs self‑hosted PostgreSQL
- Several commenters initially assumed OpenAI self-hosts Postgres; clarification was given that they use Azure Database for PostgreSQL (managed).
- Self-hosting is seen as attractive for flexibility (superuser, extensions) but “nerve‑wracking” for many due to responsibility for HA, backups, kernel/infra issues.
- Others argue self-hosted multi-node Postgres can be very stable and “almost maintenance-free” once set up, but acknowledge it requires real DBA skill.
Oracle, Aurora, and other database options
- One thread argues OpenAI would avoid many pain points by using managed Oracle (or Exadata) instead of Postgres: built‑in online schema changes, index invisibility, horizontal HA clusters, advanced pooling, rich telemetry, and no Postgres-style vacuum/bloat.
- Counterpoints highlight Oracle licensing, audits, extra costs (DataGuard, backups), Unicode quirks, and non-standard isolation levels.
- AWS Aurora is proposed as a simpler scaling solution; critics respond that it’s an over-marketed “black box” with underwhelming performance vs well-tuned self-hosted hardware. Supporters point to features like low-lag replication, parallel query, and cheap clones plus high-profile production users.
- Some suggest NewSQL/distributed SQL systems (e.g., YugabyteDB) might be better suited than Postgres for this role.
Single-master architecture and sharding
- Many are surprised OpenAI keeps a single primary with ~40 read replicas and no sharding, and has a “no new workloads” policy on that cluster.
- Some argue sharding by user/org seems obvious and would ease pressure; others note retrofitting sharding into a large, complex app with hundreds of endpoints is extremely non-trivial.
- The speaker’s message: if you’re read-heavy, you can scale quite far with one master plus replicas; sharding is deferred, not ruled out. Critics see accumulating tech debt and complex workarounds as the cost of avoiding sharding.
Operations, backups, and reliability
- Strong emphasis on tested backups and periodic restore validation; these are seen as essential but time-consuming and error-prone.
- Some say backup/restore is actually harder at scale; others argue you must validate backups regardless of managed vs self-hosted.
- Practical tooling mentioned: barman, WAL archiving, separate hourly/daily restored instances used both for support/debugging and continuous backup validation.
Index management and planner control
- A key wish: the ability to safely “disable” an index so the planner ignores it while it is still maintained, to assess whether it’s truly safe to drop.
- Commenters stress that flipping
pg_index.indisvalidis not a real feature, just poking internals without guarantees; managed services often block this. - Existing workarounds: planner GUCs per query, query tricks (e.g.,
indexed_col + 0to avoid index use), andpg_hintplanto steer index selection.
ORMs and application design
- The talk’s warning about ORMs causing inefficient queries resonates; several commenters argue generic ORMs push you toward least‑common‑denominator SQL and hide data access patterns.
- Some advocate “Postgres-first” design with hand-written SQL and Postgres-specific features.
- Others defend ORMs for portability and migrations (e.g., painless DB2→Postgres move), and tools like sqlc as a middle ground between raw SQL and full ORM.
Feature requests and contributing to PostgreSQL
- Desired core features include: index invisibility, built-in schema change history/auditing, and more robust DDL tracking.
- Commenters note many of these can be built today with event triggers and audit extensions, but acknowledge it’s complex and common enough to justify first-class support.
- Discussion around “just open PRs” vs the reality of Postgres development: slow review cycles, heavy rebasing, consensus-driven mailing lists, and the need to work with existing committers rather than “railroading with money.”
Perception of the talk and OpenAI’s choices
- Some find the content relatively basic and note that 25k QPS per replica isn’t exceptional; others praise the talk as a valuable “user story” at a developer-focused conference.
- There is debate over whether choosing Postgres (on Azure) was the right decision for this workload: some see it as misusing a single-node RDBMS where distributed databases fit better; others argue the current architecture is reasonable given heavy read bias and the benefits of managed services.