UTF-8 is a brilliant design

Brilliance and Core Properties of UTF‑8

  • Widely praised as elegant, compact, and backwards‑compatible with ASCII without ugly hacks.
  • Key features highlighted: self‑synchronizing (continuation bytes start with 10), no embedded NUL or / in multibyte sequences, random seeking and recovery from truncation possible.
  • Continuation‑byte pattern also gives a strong heuristic for “is this UTF‑8?” on arbitrary data.

21‑Bit Limit and UTF‑16 Entanglement

  • Several comments note that UTF‑8’s original design could encode 31 bits; modern UTF‑8 is capped at 21 bits due to Unicode’s decision to stay compatible with UTF‑16 surrogates.
  • Disagreement on whether this is a real sacrifice: some argue 1.1M code points is effectively inexhaustible; others dislike the design coupling to UTF‑16 and would prefer UTF‑16 be deprecated in the long term.
  • Some point out the practical reality: today’s implementations, not the spec, will be the real limit.

UTF‑8 vs Other Encodings (UTF‑16, legacy code pages)

  • Many recount pain from pre‑UTF‑8 days (Shift‑JIS, EUC, GB2312, Big5, ISO‑8859‑x) and mojibake.
  • Debate over UTF‑16:
    • Pro‑UTF‑16: simpler forward parsing, denser for many CJK texts.
    • Anti‑UTF‑16: surrogates are easy to mishandle, endianness and BOM add complexity, real‑world documents often mix lots of ASCII so UTF‑8 is usually smaller overall.
  • Some note that early Windows, Java, JavaScript, and others locked in “16‑bit chars” before UTF‑8’s dominance.

Error Handling, Invalid Sequences, and Security

  • Overlong encodings and invalid sequences are a known attack surface; advice is to reject or map to the replacement character, not silently reinterpret.
  • Discussion of alternative variable‑length schemes (VLQ/LEB128‑like, unary headers) weighing compactness vs self‑synchronization and SIMD‑friendliness.

Unicode Design and Scope Issues

  • Critiques target Unicode, not UTF‑8:
    • Han (CJK) unification complicates fonts and mixed‑language documents.
    • Emoji proliferation and zero‑width‑joiner sequences blur “character” vs glyph.
    • Combining characters and variation selectors mean “length” and “character” are inherently fuzzy.

String Representations and Indexing

  • Debate over internal representations: UTF‑8 vs UTF‑16 vs “wide chars” with index‑by‑code‑point.
  • Many argue O(1) indexing on code points is rarely needed; slices, cursors, or opaque indices over UTF‑8 are usually better.