RFC 9839 and Bad Unicode
Role and Scope of RFC 9839
- Seen as a small, focused spec to define a “normal, well‑behaved” subset of Unicode for text-based protocols and formats.
- Intended for use in generic serialization/validation libraries (e.g., JSON encoders), not as an end-user policy for fields like usernames.
- Some readers misread it as “JSON-specific” or as a recommendation to push all validation into the parser; others clarify it’s just a reusable definition of problematic code points.
Where Validation Should Happen
- Strong split between:
- Those who want parsers to reject ill‑formed or “bad” Unicode early (fail closed).
- Those who insist low‑level protocols should pass through arbitrary byte or UTF‑16 sequences unchanged so higher layers can decide, and so legacy or corrupt data (filenames, logs) can roundtrip.
- Several point out that invalid UTF‑8 and “weird but valid” code points are different problems and should be treated separately.
Security and Problematic Code Points
- Directional overrides and bidi controls raised as concrete attack vectors: trojan source, unreadable admin pages, URL and file‑extension spoofing.
- Surrogates and noncharacters can crash or confuse UTF‑16‑based systems when unpaired.
- Some argue protocols should not outright ban bidi controls for compatibility, and that enforcement belongs to application semantics (usernames vs email bodies, etc.).
Unicode Complexity and Design Frustrations
- Many comments describe Unicode as a “jungle” or an overgrown DSL: combining marks, surrogates, emoji sequences, flags, variation selectors, and different composition systems (Hangul jamo, ZWJ emoji chains).
- Critiques include Han unification, the no‑retraction promise on code points, and inconsistent mechanisms across scripts and emoji.
- Others defend Unicode as flawed but still better than any alternative.
Identifiers, Usernames, and Passwords
- Some advocate ASCII-only for all machine-meaningful identifiers (usernames, passwords, logins) due to normalization, keyboard, and stability issues.
- Others call that unnecessarily exclusionary, arguing for ASCII identifiers plus separate, less‑restricted display names.
- PRECIS RFCs (8264/8265/8266) are cited as prior art for safely handling usernames/passwords/nicknames (e.g., disallowing bidi controls there).
Control Characters and Allowed Subsets
- Debate over banning all legacy controls (C0/C1) except LF/HT:
- Pro-ban: plain text should not contain ESC, NUL, etc.; that’s markup, not text.
- Anti-ban: FF, RS, ESC, NUL have real use in source, printers, and data streams; rejecting them is too restrictive.
- Some suggest a “safeunicode” profile that strips control/positioning characters, but there’s no consensus on where to draw the line.
Encodings and String Models
- Long discussion contrasting:
- Well‑formed UTF‑8 / Unicode scalars.
- Potentially ill‑formed UTF‑16 (Windows/Java/JS).
- WTF‑8 as a way to encode such strings into an 8‑bit channel.
- Agreement that wire formats will see ill‑formed data; disagreement on whether internal string types should allow it.
- Python, Rust, Go, JS, etc. are used as examples of differing philosophies on surrogates and validation.
Implementation and Tooling Concerns
- Questions on how RFC 9839 compares to language helpers like Go’s
unicode.IsPrint; answer:IsPrintis implementation-specific, RFC 9839 is protocol‑spec‑friendly. - Some find the ABNF listing of ranges awkward and ask for explicit test vectors; others point to the reference Go implementation as de facto tests.