2025-09-12

UTF-8 is a brilliant design

Brilliance and Core Properties of UTF‑8

Widely praised as elegant, compact, and backwards‑compatible with ASCII without ugly hacks.
Key features highlighted: self‑synchronizing (continuation bytes start with 10), no embedded NUL or / in multibyte sequences, random seeking and recovery from truncation possible.
Continuation‑byte pattern also gives a strong heuristic for “is this UTF‑8?” on arbitrary data.

21‑Bit Limit and UTF‑16 Entanglement

Several comments note that UTF‑8’s original design could encode 31 bits; modern UTF‑8 is capped at 21 bits due to Unicode’s decision to stay compatible with UTF‑16 surrogates.
Disagreement on whether this is a real sacrifice: some argue 1.1M code points is effectively inexhaustible; others dislike the design coupling to UTF‑16 and would prefer UTF‑16 be deprecated in the long term.
Some point out the practical reality: today’s implementations, not the spec, will be the real limit.

UTF‑8 vs Other Encodings (UTF‑16, legacy code pages)

Many recount pain from pre‑UTF‑8 days (Shift‑JIS, EUC, GB2312, Big5, ISO‑8859‑x) and mojibake.
Debate over UTF‑16:
- Pro‑UTF‑16: simpler forward parsing, denser for many CJK texts.
- Anti‑UTF‑16: surrogates are easy to mishandle, endianness and BOM add complexity, real‑world documents often mix lots of ASCII so UTF‑8 is usually smaller overall.
Some note that early Windows, Java, JavaScript, and others locked in “16‑bit chars” before UTF‑8’s dominance.

Error Handling, Invalid Sequences, and Security

Overlong encodings and invalid sequences are a known attack surface; advice is to reject or map to the replacement character, not silently reinterpret.
Discussion of alternative variable‑length schemes (VLQ/LEB128‑like, unary headers) weighing compactness vs self‑synchronization and SIMD‑friendliness.

Unicode Design and Scope Issues

Critiques target Unicode, not UTF‑8:
- Han (CJK) unification complicates fonts and mixed‑language documents.
- Emoji proliferation and zero‑width‑joiner sequences blur “character” vs glyph.
- Combining characters and variation selectors mean “length” and “character” are inherently fuzzy.

String Representations and Indexing

Debate over internal representations: UTF‑8 vs UTF‑16 vs “wide chars” with index‑by‑code‑point.
Many argue O(1) indexing on code points is rarely needed; slices, cursors, or opaque indices over UTF‑8 are usually better.

Related topics