2024-08-25

TIL: Versions of UUID and when to use them

UUIDv4 vs UUIDv7 vs ULID

Many recommend: use v7 by default; use v4 when creation time could be sensitive or when IDs must be hard to guess.
v7 is k‑sortable by timestamp, good for database/index locality and time-based querying.
ULID previously filled this niche; now that v7 is standardized, some prefer v7 for ecosystem support while keeping ULID’s presentation format.
Python’s current uuid7 third‑party package is called out as unmaintained and non‑compliant (nanosecond vs millisecond precision), potentially breaking v7 monotonicity.

Deterministic / hash-based IDs

Some want deterministic IDs that can be regenerated when reprocessing data.
UUIDv5 (and v3) are highlighted as hash‑based, deterministic options (v5 uses SHA‑1).
There’s debate on how much standardization is possible, since what to hash is inherently application-specific.

Security and privacy concerns

Timestamps in IDs may leak creation time; v4 avoids this, v7/ULID do not.
Some see parts of the security industry as overemphasizing unlikely threats; others argue it’s still valuable to at least consider security trade‑offs.
Using non‑guessability as a security control is discussed; if you truly need that, random (v4‑style) IDs are preferred.

MAC-based and hash-based versions

Advice: avoid MAC‑based versions, especially v1; they can leak hardware info.
MD5 is criticized for cryptography, but noted as still usable as a non‑crypto hash, with performance and ubiquity trade‑offs.
RFC9562’s v8 example shows how to plug in stronger hashes like SHA‑256, though truncation to 128 bits is a limitation.

Shorter and alternative ID schemes

Strong interest in “short UUIDs” or URL‑friendly IDs: base64/base58 encoded UUIDs, ULID, Nanoid, Sqids, custom hashed integers, YouTube-style IDs, etc.
Trade‑offs: length vs collision risk, human readability, URL safety, lexicographic sortability, and standardization.
Some work is underway to standardize shorter encodings for 128‑bit UUIDs.

Semantics, history, and standards

UUIDs are fundamentally 128‑bit numbers; the hyphenated string is just one encoding.
Several argue programs should treat UUIDs as opaque binary values and not infer semantics.
Clarifications are made that v2 is specified (via DCE), contrary to the article’s “no known details” phrasing.
Historical context: early uses were for ephemeral message IDs keyed by time and hardware ID; later usage shifted to “canned” identifiers for objects and resources.

Database and system design considerations

For non‑distributed systems, many recommend simple auto‑increment integers; k‑sortable or encrypted integers can be exposed externally.
For distributed or client‑generated IDs, UUIDv7, snowflake-like schemes, or hash‑based IDs are preferred over central counters.
v7 is praised for improving performance in systems like S3 metadata stores and key‑value databases (e.g., DynamoDB) due to its timestamp ordering.
Reminder that even UUIDs don’t guarantee zero collisions—only extremely low probability—so requirements should match the actual risk and scale.

Overall practical guidance

Common pragmatic stance:
- Non‑distributed / simple app: use integers.
- Need distributed, opaque, query‑friendly IDs: use UUIDv7.
- Need maximum unpredictability or to hide time: use UUIDv4.
- Need deterministic IDs: use v5 (hash‑based) or an explicit hashing scheme.

Related topics