You probably don't need to validate UTF-8 strings

Scope of UTF-8 validation

  • One side argues you must validate if you do normalization, indexing/splitting by character, Unicode-aware regex, or interoperate with systems that assume valid UTF-8; otherwise length and semantics are ill-defined.
  • Others claim you can often defer validation or skip it: treat non‑UTF‑8 as raw bytes, use byte-oriented regex, and only validate/normalize at specific boundaries (e.g., before JSON output or as hash keys).
  • There’s a meta-point that scanning for invalid sequences at any point is already “validation,” just with different error handling (e.g., replacement vs failure).

Language string design (Rust, Go, Python, etc.)

  • Rust: str is guaranteed valid UTF‑8; this simplifies many operations and lets implementations assume correctness, but forces up-front validation or use of &[u8] for arbitrary data.
  • Some argue this is mostly “purity” plus minor performance wins; others say it’s crucial because it makes illegal states unrepresentable and centralizes robustness at the type boundary.
  • Go is cited as successful with “conventionally UTF‑8” strings that gracefully map invalid sequences to replacement characters.
  • Python strings are sequences of Unicode code points, not necessarily valid UTF‑8; surrogate issues show that “Unicode string” and “UTF‑8 encodable string” differ.
  • Several commenters favor byte strings as the fundamental type, with Unicode as an optional layer; others note ecosystems that strongly prefer UTF‑8 (web, many tools).

Semantics: equality, substrings, and length

  • Substring search in UTF‑8 is easier than in encodings like UTF‑16 due to self-delimiting code units, but normalization and combining marks mean byte-substring search often misses semantically equivalent text.
  • Canonical Unicode normalization is considered expensive and table-driven; there is “no cheap canonical UTF‑8.”
  • “String equality” is framed as context-dependent: byte equality, code-point equality (after normalization), locale-aware collation, visual equivalence, or application-specific notions (addresses, names).
  • Length is also context-dependent: bytes vs code points vs grapheme clusters vs rendered width.

File paths and non‑Unicode data

  • Debate on whether programs should insist on UTF‑8 file paths: some say it simplifies cross-platform handling; others argue many tools only need opaque byte sequences and rejecting valid but non‑UTF‑8 paths is user-hostile.
  • Strategies mentioned: using dedicated UTF‑8 path types, WTF‑8 for Windows surrogates, or keeping paths as raw bytes and only decoding when displaying.

Immutability and performance

  • Brief side thread: mutability gives O(1) in-place updates, while immutability can add log N overhead; counterpoint is that immutability can enable better global optimizations and parallelization, though benchmarks favor mutable designs today.