2025-08-23

RFC 9839 and Bad Unicode

Role and Scope of RFC 9839

Seen as a small, focused spec to define a “normal, well‑behaved” subset of Unicode for text-based protocols and formats.
Intended for use in generic serialization/validation libraries (e.g., JSON encoders), not as an end-user policy for fields like usernames.
Some readers misread it as “JSON-specific” or as a recommendation to push all validation into the parser; others clarify it’s just a reusable definition of problematic code points.

Where Validation Should Happen

Strong split between:
- Those who want parsers to reject ill‑formed or “bad” Unicode early (fail closed).
- Those who insist low‑level protocols should pass through arbitrary byte or UTF‑16 sequences unchanged so higher layers can decide, and so legacy or corrupt data (filenames, logs) can roundtrip.
Several point out that invalid UTF‑8 and “weird but valid” code points are different problems and should be treated separately.

Security and Problematic Code Points

Directional overrides and bidi controls raised as concrete attack vectors: trojan source, unreadable admin pages, URL and file‑extension spoofing.
Surrogates and noncharacters can crash or confuse UTF‑16‑based systems when unpaired.
Some argue protocols should not outright ban bidi controls for compatibility, and that enforcement belongs to application semantics (usernames vs email bodies, etc.).

Unicode Complexity and Design Frustrations

Many comments describe Unicode as a “jungle” or an overgrown DSL: combining marks, surrogates, emoji sequences, flags, variation selectors, and different composition systems (Hangul jamo, ZWJ emoji chains).
Critiques include Han unification, the no‑retraction promise on code points, and inconsistent mechanisms across scripts and emoji.
Others defend Unicode as flawed but still better than any alternative.

Identifiers, Usernames, and Passwords

Some advocate ASCII-only for all machine-meaningful identifiers (usernames, passwords, logins) due to normalization, keyboard, and stability issues.
Others call that unnecessarily exclusionary, arguing for ASCII identifiers plus separate, less‑restricted display names.
PRECIS RFCs (8264/8265/8266) are cited as prior art for safely handling usernames/passwords/nicknames (e.g., disallowing bidi controls there).

Control Characters and Allowed Subsets

Debate over banning all legacy controls (C0/C1) except LF/HT:
- Pro-ban: plain text should not contain ESC, NUL, etc.; that’s markup, not text.
- Anti-ban: FF, RS, ESC, NUL have real use in source, printers, and data streams; rejecting them is too restrictive.
Some suggest a “safeunicode” profile that strips control/positioning characters, but there’s no consensus on where to draw the line.

Encodings and String Models

Long discussion contrasting:
- Well‑formed UTF‑8 / Unicode scalars.
- Potentially ill‑formed UTF‑16 (Windows/Java/JS).
- WTF‑8 as a way to encode such strings into an 8‑bit channel.
Agreement that wire formats will see ill‑formed data; disagreement on whether internal string types should allow it.
Python, Rust, Go, JS, etc. are used as examples of differing philosophies on surrogates and validation.

Implementation and Tooling Concerns

Questions on how RFC 9839 compares to language helpers like Go’s unicode.IsPrint; answer: IsPrint is implementation-specific, RFC 9839 is protocol‑spec‑friendly.
Some find the ABNF listing of ranges awkward and ask for explicit test vectors; others point to the reference Go implementation as de facto tests.

Related topics