2024-10-08

A popular but wrong way to convert a string to uppercase or lowercase

Scope of the problem (case conversion & Unicode)

Many agree the article correctly shows that naïve per-character upper/lowercasing is broken for Unicode, especially in UTF‑16/UTF‑32.
Key issues: multi-code-unit characters, case mappings that change string length, and language-specific rules (e.g., German ß/ẞ, Turkish dotted/dotless i, French accents, Greek sigma forms).
Several note that even defining “correct” casing is context- and time-dependent (orthography changes, historical data, versioned locales).

ASCII-only vs real-world text

One camp: 90–99% of their string manipulation is internal, ASCII-only (logs, config keys, protocols, parser internals). There, simple ASCII case ops are “good enough,” and Unicode is overkill.
Counter-camp: most user-visible software deals with real names, addresses, UI text, and search, where ASCII-only breaks many users and languages. These argue ASCII assumptions reflect a narrow “US/English bubble.”
Several point out that user-provided data and filenames can contain arbitrary bytes or invalid UTF, so you often must treat them as opaque byte sequences.

Locale and language dependence

Strong theme: you cannot correctly process human text without knowing its language/locale; different countries and even regions (e.g., Swiss vs German rules) can differ.
Strings may even mix languages, making locale assignment per-string ambiguous.
Case conversion is only one part of normalization; proper search, collation, and display need more.

C, C++, and libraries (ICU, Qt, others)

Widespread frustration that C/C++ standard facilities (char, wchar_t, std::tolower/std::toupper, std::string/std::wstring) are ill-suited for Unicode and locale-safe casing.
Backwards compatibility and binary size are cited as reasons full Unicode (ICU‑like) support isn’t in the C++ standard library.
Some argue legacy APIs should be deprecated; others defend them as necessary for ancient platforms and codebases.
Qt’s QString, Java/C#/Rust/Swift/Python are mentioned as having “better” or at least clearer Unicode-aware APIs, though none are perfect and often still need ICU or equivalents.

Practical advice & coping strategies

Frequent advice:
- Avoid changing case on user text at all; store and display as entered.
- Use locale-aware, whole-string APIs (OS or ICU) when you truly must.
- For internal identifiers, constrain yourself to ASCII and use simple, explicit logic.
- Treat many “simple text operations” (casing, splitting names, collation) as fundamentally hard, not something the language can magically get right.

Related topics