2024-12-06

Debian opens a can of username worms

Scope of Debian Username Changes

Debian (via shadow-utils and adduser) is loosening username rules, potentially allowing UTF‑8, numerics, and more punctuation.
Many see this as risky: it diverges from long‑standing conventions and could break tooling that assumes conservative, ASCII‑only usernames.
Others argue the previous Debian‑specific patch was itself a mistake, and aligning with upstream / modern Unicode reality is overdue.

Unicode in Identifiers

Several comments note Unicode already defines identifier rules (TR31, RFC 8264/8265) and security guidelines (confusables, spoofing).
Libraries like ICU, libunistring, libidn, libu8ident exist, but adoption is patchy; many tools (e.g., grep variants) still handle Unicode poorly.
Advocates say: use these standards, apply normalization (e.g., NFKC), and restrict to safe categories (letters, digits) rather than “all of Unicode.”
Critics emphasize normalization, bidirectional text, and homoglyphs as a deep well of complexity and subtle bugs.

Internationalization vs ASCII-only Usernames

Pro‑Unicode side: legacy codepages were worse; many languages (CJK, Cyrillic, accents) were effectively excluded; it’s unfair and user‑hostile to keep ASCII only.
Anti‑Unicode or cautious side: usernames are low‑level identifiers; ASCII is a useful common denominator, especially when logging in from random keyboards or debugging over SSH.
Some propose: ASCII‑only for login names, but UTF‑8 for full names / display fields; others insist people should be able to log in with their real‑script names.

POSIX, Standards, and Practicality

POSIX “portable username” set is [A‑Za‑z0‑9._-] (hyphen not first). Numeric usernames are allowed there.
Some call this outdated and want UTF‑8 everywhere; others say POSIX’s role is to describe existing practice, and a UTF‑8 transition would be a massive, decades‑long, compatibility project.
There is disagreement whether standards bodies should “lead” (mandate UTF‑8) or “follow” (codify what major OSes already do).

Security, Shells, and Bug Compatibility

Allowing shell metacharacters, spaces, and exotic Unicode in usernames is seen as a security foot‑gun: shell injection, misparsed scripts, ambiguous logs.
Real vulnerabilities are reported where unsanitized usernames passed into scripts allowed ;, &, > etc. to execute arbitrary commands.
Some argue broken scripts are already wrong and should break so they get fixed; others stress that enterprises care about systems working today, not theoretical correctness.
Comparison is made to filenames with spaces: Unix tools historically broke, but Windows forced adaptation by using spaces in system paths.

Numeric Usernames and Identifier Design

Purely numeric usernames are criticized for colliding conceptually with numeric UIDs; tools often interpret “all digits” as UID, else as name.
This can create confusing or insecure behavior if a numeric name doesn’t match its UID or matches someone else’s UID.
Others note POSIX allows it; they propose local policy (e.g., disallow names equal to existing UIDs) or better ID schemes (prefixes, checksums, redundancy).

User Experience Anecdotes

Many recount systems failing on:
- Diacritics in names (é, å, ç), apostrophes, or non‑Latin scripts.
- Non‑ASCII passwords that can be set but not used to log in.
- Windows and other systems mishandling Unicode in usernames or profile directories.
As a result, even users with non‑ASCII names often deliberately stick to ASCII for usernames and sometimes passwords.

Alternative Ideas and Side Discussions

Suggestions include:
- Punycode‑like encodings for usernames (machine‑safe, user‑friendly display).
- Treating usernames as opaque byte strings, punting encoding to higher layers.
- Keeping login identifiers simple and using richer UTF‑8 identifiers only where genuinely needed.
A tangential discussion explores graphical / visual programming vs text, concluding that visual systems often become unmanageable “spaghetti,” and text remains the most practical representation.

Related topics