Charset="WTF-8"
Human name validation pitfalls
- Many examples of systems rejecting perfectly valid names: diacritics (e.g., “ł”, “æ”), hyphens, apostrophes, multiple surnames, no surname, or non‑Latin scripts.
- Split “first/last” fields often fail for cultures with different name structures (no family name, multiple given names, patronymics, order differences).
- Several commenters argue the only universally safe rule is “non‑empty Unicode string”; anything stricter will exclude real people.
- Others note false assumptions: everyone has a single name, has exactly one legal name, name always matches one government record, etc.
What to validate (and what not to)
- Common minimal checks proposed:
- Non‑zero length.
- Valid Unicode (no unpaired surrogates, no invalid code points).
- Exclude control characters (categories Cc, Cs, noncharacters in Cn, often Co).
- Some suggest allowing all Unicode letters plus space, hyphen, apostrophe, comma; but edge cases include click consonants, okina, interpuncts, zero‑width characters, bidi controls.
- Strict whitelists or ASCII‑only are widely criticized as unnecessary and hostile, though a few defend Latin‑only or even ASCII for specific domains.
Character sets, Unicode, and encodings
- Multiple complaints that new software still blocks non‑ASCII decades after Unicode and UTF‑8 became mainstream.
- Debate over Unicode’s complexity:
- One side blames emojis, invisible/control characters, combining marks, and CJK unification for pushing developers to ban “weird” characters.
- Others counter that these features are necessary to represent real languages and that better libraries and practices are the real missing piece.
- WTF‑8 (the actual encoding) is discussed as a practical way to round‑trip invalid UTF‑16 (e.g., Windows paths), but not intended as an internet charset.
Transliteration, legal vs display names, and external systems
- Strong consensus: do not auto‑transliterate names for other systems; rules are language‑ and jurisdiction‑specific and often ambiguous.
- Recommended patterns:
- Store the original name exactly.
- Ask users explicitly for additional forms: “name as on passport/MRZ,” “name as on card,” pronunciation, or romanized version.
- Possibly have separate “legal name” and “preferred/display name” fields.
- GDPR in the EU is cited as giving people a right to correct spelling of their names; some see limited charsets as legally problematic.
Security, abuse, Zalgo, and robustness
- Input validation is often misused as a substitute for proper escaping and parameterized queries (SQL, XSS). Several argue to accept almost everything and sanitize at output/integration boundaries.
- Others emphasize “defense in depth” and worry about upstream systems that can’t be fixed.
- Zalgo text (excessive combining marks) is seen as a UI and performance attack vector. Suggested mitigations:
- Normalize (possibly canonically) and then limit consecutive combining marks to a small N per base character, tuned for languages that legitimately use multiple diacritics.
- Unicode normalization and homoglyphs (Latin vs Cyrillic/Greek letters, fullwidth/halfwidth, emoji modifiers, bidi controls) are flagged as real usability and security concerns, not just cosmetic ones.
UX, localization, and messaging
- Many stories where non‑ASCII names cause crashes or silent breakage in OSes, Java apps, government systems, banks, airlines, and payment gateways.
- Localized UIs are often poor; some users prefer English to avoid bad translations, while others stress that proper localization (including names) is essential for less technical populations.
- Error message wording matters: “your name is invalid” is widely viewed as insulting; suggested alternative: admit system limitations (“Sorry, our system cannot handle these characters yet”).