2024-07-05

PostgreSQL and UUID as Primary Key

Performance and storage implications

Random UUIDs (especially v4) hurt B‑tree locality: more page fragmentation, cache misses, index and WAL bloat, and higher IOPS, particularly on network/EBS storage.
Time-ordered IDs (bigserial, UUIDv7, ULID, TSID, Snowflake-like) insert more sequentially and tend to be faster for inserts and friendlier to caches.
UUIDs are 16 bytes vs 4/8 for int/bigint; this multiplies across indexes and foreign keys. For very large tables, people report tens–hundreds of GB of extra storage.
Using Postgres’ native uuid type is consistently preferred over char(36)/text for UUID storage.

UUIDv4 vs UUIDv7 and other schemes

UUIDv4: fully random, great for distributed generation and security-sensitive tokens; worst for index locality.
UUIDv7: time-ordered, still random-suffixed; widely suggested as a good default for DB PKs, but has fewer random bits and embeds timestamps.
Alternatives mentioned: Snowflake IDs, ULID, TSID, TypeID, Stripe-style IDs; all aim for time-ordering + distribution + human/URL friendliness.

Security, information leakage, and “guessable” IDs

Concerns about auto-increment IDs: easy enumeration, German tank problem (inferring volume/growth), and some attack/abuse scenarios.
Concerns about UUIDv7/ULID: timestamp leakage can reveal creation times, growth patterns, even sensitive events (e.g., acquisitions); RFCs recommend v4 for security-critical uses.
Some argue these leaks are overblown for most apps; others treat “secure by default” as a strong requirement and avoid exposing ordered IDs at all.

Schema design: PK choice, foreign keys, and migration

Common pattern advocated: internal bigint/bigserial PK for joins and batching; separate external random ID (UUID/ULID/hash) for APIs and URLs.
“Bigint vs int vs smallint” debate:
- One side: always use bigint to avoid painful 32-bit exhaustion and future growth surprises.
- Other side: choose the smallest type that fits lifetime cardinality; storage and cache footprint matter, especially with many FKs.
Changing PK type later is possible but painful, especially with many foreign keys and external references.

Distributed systems and ID generation location

UUIDs shine when IDs must be generated client-side or across shards/tenants without coordination.
Some note you can still use monotonic integers in distributed settings with sharded sequences or central allocators, but that adds complexity.
Debate on whether IDs should be generated in the DB (simpler semantics) or app/clients (offload DB, offline-first, idempotency).

Collision risk and correctness

UUID collisions are treated as astronomically unlikely; most consider extra collision checking unnecessary beyond the DB’s unique index.
If a collision ever occurs, many would treat it as an indicator of a serious RNG or security problem rather than something to silently retry.

Tooling, ergonomics, and practicality

Debugging and ad-hoc querying with UUIDs is seen as more painful; numeric IDs are easier to read, compare, and batch (“high water mark” semantics).
Some engineers report billions-of-row systems on UUIDv4 without issues; others (especially DB-focused roles) see UUIDv4 as a consistent performance and cost footgun.
General sentiment: context matters; for many apps, UUID choice won’t be the first bottleneck, but making a sensible default (e.g., UUIDv7 + native type, or bigint PK + external ID) is cheap and avoids future regret.

Related topics