You probably don't need to validate UTF-8 strings
Scope of UTF-8 validation
- One side argues you must validate if you do normalization, indexing/splitting by character, Unicode-aware regex, or interoperate with systems that assume valid UTF-8; otherwise length and semantics are ill-defined.
- Others claim you can often defer validation or skip it: treat non‑UTF‑8 as raw bytes, use byte-oriented regex, and only validate/normalize at specific boundaries (e.g., before JSON output or as hash keys).
- There’s a meta-point that scanning for invalid sequences at any point is already “validation,” just with different error handling (e.g., replacement vs failure).
Language string design (Rust, Go, Python, etc.)
- Rust:
stris guaranteed valid UTF‑8; this simplifies many operations and lets implementations assume correctness, but forces up-front validation or use of&[u8]for arbitrary data. - Some argue this is mostly “purity” plus minor performance wins; others say it’s crucial because it makes illegal states unrepresentable and centralizes robustness at the type boundary.
- Go is cited as successful with “conventionally UTF‑8” strings that gracefully map invalid sequences to replacement characters.
- Python strings are sequences of Unicode code points, not necessarily valid UTF‑8; surrogate issues show that “Unicode string” and “UTF‑8 encodable string” differ.
- Several commenters favor byte strings as the fundamental type, with Unicode as an optional layer; others note ecosystems that strongly prefer UTF‑8 (web, many tools).
Semantics: equality, substrings, and length
- Substring search in UTF‑8 is easier than in encodings like UTF‑16 due to self-delimiting code units, but normalization and combining marks mean byte-substring search often misses semantically equivalent text.
- Canonical Unicode normalization is considered expensive and table-driven; there is “no cheap canonical UTF‑8.”
- “String equality” is framed as context-dependent: byte equality, code-point equality (after normalization), locale-aware collation, visual equivalence, or application-specific notions (addresses, names).
- Length is also context-dependent: bytes vs code points vs grapheme clusters vs rendered width.
File paths and non‑Unicode data
- Debate on whether programs should insist on UTF‑8 file paths: some say it simplifies cross-platform handling; others argue many tools only need opaque byte sequences and rejecting valid but non‑UTF‑8 paths is user-hostile.
- Strategies mentioned: using dedicated UTF‑8 path types, WTF‑8 for Windows surrogates, or keeping paths as raw bytes and only decoding when displaying.
Immutability and performance
- Brief side thread: mutability gives O(1) in-place updates, while immutability can add log N overhead; counterpoint is that immutability can enable better global optimizations and parallelization, though benchmarks favor mutable designs today.