Charset="WTF-8"

Human name validation pitfalls

  • Many examples of systems rejecting perfectly valid names: diacritics (e.g., “ł”, “æ”), hyphens, apostrophes, multiple surnames, no surname, or non‑Latin scripts.
  • Split “first/last” fields often fail for cultures with different name structures (no family name, multiple given names, patronymics, order differences).
  • Several commenters argue the only universally safe rule is “non‑empty Unicode string”; anything stricter will exclude real people.
  • Others note false assumptions: everyone has a single name, has exactly one legal name, name always matches one government record, etc.

What to validate (and what not to)

  • Common minimal checks proposed:
    • Non‑zero length.
    • Valid Unicode (no unpaired surrogates, no invalid code points).
    • Exclude control characters (categories Cc, Cs, noncharacters in Cn, often Co).
  • Some suggest allowing all Unicode letters plus space, hyphen, apostrophe, comma; but edge cases include click consonants, okina, interpuncts, zero‑width characters, bidi controls.
  • Strict whitelists or ASCII‑only are widely criticized as unnecessary and hostile, though a few defend Latin‑only or even ASCII for specific domains.

Character sets, Unicode, and encodings

  • Multiple complaints that new software still blocks non‑ASCII decades after Unicode and UTF‑8 became mainstream.
  • Debate over Unicode’s complexity:
    • One side blames emojis, invisible/control characters, combining marks, and CJK unification for pushing developers to ban “weird” characters.
    • Others counter that these features are necessary to represent real languages and that better libraries and practices are the real missing piece.
  • WTF‑8 (the actual encoding) is discussed as a practical way to round‑trip invalid UTF‑16 (e.g., Windows paths), but not intended as an internet charset.

Transliteration, legal vs display names, and external systems

  • Strong consensus: do not auto‑transliterate names for other systems; rules are language‑ and jurisdiction‑specific and often ambiguous.
  • Recommended patterns:
    • Store the original name exactly.
    • Ask users explicitly for additional forms: “name as on passport/MRZ,” “name as on card,” pronunciation, or romanized version.
    • Possibly have separate “legal name” and “preferred/display name” fields.
  • GDPR in the EU is cited as giving people a right to correct spelling of their names; some see limited charsets as legally problematic.

Security, abuse, Zalgo, and robustness

  • Input validation is often misused as a substitute for proper escaping and parameterized queries (SQL, XSS). Several argue to accept almost everything and sanitize at output/integration boundaries.
  • Others emphasize “defense in depth” and worry about upstream systems that can’t be fixed.
  • Zalgo text (excessive combining marks) is seen as a UI and performance attack vector. Suggested mitigations:
    • Normalize (possibly canonically) and then limit consecutive combining marks to a small N per base character, tuned for languages that legitimately use multiple diacritics.
  • Unicode normalization and homoglyphs (Latin vs Cyrillic/Greek letters, fullwidth/halfwidth, emoji modifiers, bidi controls) are flagged as real usability and security concerns, not just cosmetic ones.

UX, localization, and messaging

  • Many stories where non‑ASCII names cause crashes or silent breakage in OSes, Java apps, government systems, banks, airlines, and payment gateways.
  • Localized UIs are often poor; some users prefer English to avoid bad translations, while others stress that proper localization (including names) is essential for less technical populations.
  • Error message wording matters: “your name is invalid” is widely viewed as insulting; suggested alternative: admit system limitations (“Sorry, our system cannot handle these characters yet”).