2024-11-24

Charset="WTF-8"

Human name validation pitfalls

Many examples of systems rejecting perfectly valid names: diacritics (e.g., “ł”, “æ”), hyphens, apostrophes, multiple surnames, no surname, or non‑Latin scripts.
Split “first/last” fields often fail for cultures with different name structures (no family name, multiple given names, patronymics, order differences).
Several commenters argue the only universally safe rule is “non‑empty Unicode string”; anything stricter will exclude real people.
Others note false assumptions: everyone has a single name, has exactly one legal name, name always matches one government record, etc.

What to validate (and what not to)

Common minimal checks proposed:
- Non‑zero length.
- Valid Unicode (no unpaired surrogates, no invalid code points).
- Exclude control characters (categories Cc, Cs, noncharacters in Cn, often Co).
Some suggest allowing all Unicode letters plus space, hyphen, apostrophe, comma; but edge cases include click consonants, okina, interpuncts, zero‑width characters, bidi controls.
Strict whitelists or ASCII‑only are widely criticized as unnecessary and hostile, though a few defend Latin‑only or even ASCII for specific domains.

Character sets, Unicode, and encodings

Multiple complaints that new software still blocks non‑ASCII decades after Unicode and UTF‑8 became mainstream.
Debate over Unicode’s complexity:
- One side blames emojis, invisible/control characters, combining marks, and CJK unification for pushing developers to ban “weird” characters.
- Others counter that these features are necessary to represent real languages and that better libraries and practices are the real missing piece.
WTF‑8 (the actual encoding) is discussed as a practical way to round‑trip invalid UTF‑16 (e.g., Windows paths), but not intended as an internet charset.

Transliteration, legal vs display names, and external systems

Strong consensus: do not auto‑transliterate names for other systems; rules are language‑ and jurisdiction‑specific and often ambiguous.
Recommended patterns:
- Store the original name exactly.
- Ask users explicitly for additional forms: “name as on passport/MRZ,” “name as on card,” pronunciation, or romanized version.
- Possibly have separate “legal name” and “preferred/display name” fields.
GDPR in the EU is cited as giving people a right to correct spelling of their names; some see limited charsets as legally problematic.

Security, abuse, Zalgo, and robustness

Input validation is often misused as a substitute for proper escaping and parameterized queries (SQL, XSS). Several argue to accept almost everything and sanitize at output/integration boundaries.
Others emphasize “defense in depth” and worry about upstream systems that can’t be fixed.
Zalgo text (excessive combining marks) is seen as a UI and performance attack vector. Suggested mitigations:
- Normalize (possibly canonically) and then limit consecutive combining marks to a small N per base character, tuned for languages that legitimately use multiple diacritics.
Unicode normalization and homoglyphs (Latin vs Cyrillic/Greek letters, fullwidth/halfwidth, emoji modifiers, bidi controls) are flagged as real usability and security concerns, not just cosmetic ones.

UX, localization, and messaging

Many stories where non‑ASCII names cause crashes or silent breakage in OSes, Java apps, government systems, banks, airlines, and payment gateways.
Localized UIs are often poor; some users prefer English to avoid bad translations, while others stress that proper localization (including names) is essential for less technical populations.
Error message wording matters: “your name is invalid” is widely viewed as insulting; suggested alternative: admit system limitations (“Sorry, our system cannot handle these characters yet”).

Related topics