RFC 9839 and Bad Unicode

Role and Scope of RFC 9839

  • Seen as a small, focused spec to define a “normal, well‑behaved” subset of Unicode for text-based protocols and formats.
  • Intended for use in generic serialization/validation libraries (e.g., JSON encoders), not as an end-user policy for fields like usernames.
  • Some readers misread it as “JSON-specific” or as a recommendation to push all validation into the parser; others clarify it’s just a reusable definition of problematic code points.

Where Validation Should Happen

  • Strong split between:
    • Those who want parsers to reject ill‑formed or “bad” Unicode early (fail closed).
    • Those who insist low‑level protocols should pass through arbitrary byte or UTF‑16 sequences unchanged so higher layers can decide, and so legacy or corrupt data (filenames, logs) can roundtrip.
  • Several point out that invalid UTF‑8 and “weird but valid” code points are different problems and should be treated separately.

Security and Problematic Code Points

  • Directional overrides and bidi controls raised as concrete attack vectors: trojan source, unreadable admin pages, URL and file‑extension spoofing.
  • Surrogates and noncharacters can crash or confuse UTF‑16‑based systems when unpaired.
  • Some argue protocols should not outright ban bidi controls for compatibility, and that enforcement belongs to application semantics (usernames vs email bodies, etc.).

Unicode Complexity and Design Frustrations

  • Many comments describe Unicode as a “jungle” or an overgrown DSL: combining marks, surrogates, emoji sequences, flags, variation selectors, and different composition systems (Hangul jamo, ZWJ emoji chains).
  • Critiques include Han unification, the no‑retraction promise on code points, and inconsistent mechanisms across scripts and emoji.
  • Others defend Unicode as flawed but still better than any alternative.

Identifiers, Usernames, and Passwords

  • Some advocate ASCII-only for all machine-meaningful identifiers (usernames, passwords, logins) due to normalization, keyboard, and stability issues.
  • Others call that unnecessarily exclusionary, arguing for ASCII identifiers plus separate, less‑restricted display names.
  • PRECIS RFCs (8264/8265/8266) are cited as prior art for safely handling usernames/passwords/nicknames (e.g., disallowing bidi controls there).

Control Characters and Allowed Subsets

  • Debate over banning all legacy controls (C0/C1) except LF/HT:
    • Pro-ban: plain text should not contain ESC, NUL, etc.; that’s markup, not text.
    • Anti-ban: FF, RS, ESC, NUL have real use in source, printers, and data streams; rejecting them is too restrictive.
  • Some suggest a “safeunicode” profile that strips control/positioning characters, but there’s no consensus on where to draw the line.

Encodings and String Models

  • Long discussion contrasting:
    • Well‑formed UTF‑8 / Unicode scalars.
    • Potentially ill‑formed UTF‑16 (Windows/Java/JS).
    • WTF‑8 as a way to encode such strings into an 8‑bit channel.
  • Agreement that wire formats will see ill‑formed data; disagreement on whether internal string types should allow it.
  • Python, Rust, Go, JS, etc. are used as examples of differing philosophies on surrogates and validation.

Implementation and Tooling Concerns

  • Questions on how RFC 9839 compares to language helpers like Go’s unicode.IsPrint; answer: IsPrint is implementation-specific, RFC 9839 is protocol‑spec‑friendly.
  • Some find the ABNF listing of ranges awkward and ask for explicit test vectors; others point to the reference Go implementation as de facto tests.