2024-05-16

You probably don't need to validate UTF-8 strings

Scope of UTF-8 validation

One side argues you must validate if you do normalization, indexing/splitting by character, Unicode-aware regex, or interoperate with systems that assume valid UTF-8; otherwise length and semantics are ill-defined.
Others claim you can often defer validation or skip it: treat non‑UTF‑8 as raw bytes, use byte-oriented regex, and only validate/normalize at specific boundaries (e.g., before JSON output or as hash keys).
There’s a meta-point that scanning for invalid sequences at any point is already “validation,” just with different error handling (e.g., replacement vs failure).

Language string design (Rust, Go, Python, etc.)

Rust: str is guaranteed valid UTF‑8; this simplifies many operations and lets implementations assume correctness, but forces up-front validation or use of &[u8] for arbitrary data.
Some argue this is mostly “purity” plus minor performance wins; others say it’s crucial because it makes illegal states unrepresentable and centralizes robustness at the type boundary.
Go is cited as successful with “conventionally UTF‑8” strings that gracefully map invalid sequences to replacement characters.
Python strings are sequences of Unicode code points, not necessarily valid UTF‑8; surrogate issues show that “Unicode string” and “UTF‑8 encodable string” differ.
Several commenters favor byte strings as the fundamental type, with Unicode as an optional layer; others note ecosystems that strongly prefer UTF‑8 (web, many tools).

Semantics: equality, substrings, and length

Substring search in UTF‑8 is easier than in encodings like UTF‑16 due to self-delimiting code units, but normalization and combining marks mean byte-substring search often misses semantically equivalent text.
Canonical Unicode normalization is considered expensive and table-driven; there is “no cheap canonical UTF‑8.”
“String equality” is framed as context-dependent: byte equality, code-point equality (after normalization), locale-aware collation, visual equivalence, or application-specific notions (addresses, names).
Length is also context-dependent: bytes vs code points vs grapheme clusters vs rendered width.

File paths and non‑Unicode data

Debate on whether programs should insist on UTF‑8 file paths: some say it simplifies cross-platform handling; others argue many tools only need opaque byte sequences and rejecting valid but non‑UTF‑8 paths is user-hostile.
Strategies mentioned: using dedicated UTF‑8 path types, WTF‑8 for Windows surrogates, or keeping paths as raw bytes and only decoding when displaying.

Immutability and performance

Brief side thread: mutability gives O(1) in-place updates, while immutability can add log N overhead; counterpoint is that immutability can enable better global optimizations and parallelization, though benchmarks favor mutable designs today.

Related topics