A popular but wrong way to convert a string to uppercase or lowercase
Scope of the problem (case conversion & Unicode)
- Many agree the article correctly shows that naïve per-character upper/lowercasing is broken for Unicode, especially in UTF‑16/UTF‑32.
- Key issues: multi-code-unit characters, case mappings that change string length, and language-specific rules (e.g., German ß/ẞ, Turkish dotted/dotless i, French accents, Greek sigma forms).
- Several note that even defining “correct” casing is context- and time-dependent (orthography changes, historical data, versioned locales).
ASCII-only vs real-world text
- One camp: 90–99% of their string manipulation is internal, ASCII-only (logs, config keys, protocols, parser internals). There, simple ASCII case ops are “good enough,” and Unicode is overkill.
- Counter-camp: most user-visible software deals with real names, addresses, UI text, and search, where ASCII-only breaks many users and languages. These argue ASCII assumptions reflect a narrow “US/English bubble.”
- Several point out that user-provided data and filenames can contain arbitrary bytes or invalid UTF, so you often must treat them as opaque byte sequences.
Locale and language dependence
- Strong theme: you cannot correctly process human text without knowing its language/locale; different countries and even regions (e.g., Swiss vs German rules) can differ.
- Strings may even mix languages, making locale assignment per-string ambiguous.
- Case conversion is only one part of normalization; proper search, collation, and display need more.
C, C++, and libraries (ICU, Qt, others)
- Widespread frustration that C/C++ standard facilities (char, wchar_t, std::tolower/std::toupper, std::string/std::wstring) are ill-suited for Unicode and locale-safe casing.
- Backwards compatibility and binary size are cited as reasons full Unicode (ICU‑like) support isn’t in the C++ standard library.
- Some argue legacy APIs should be deprecated; others defend them as necessary for ancient platforms and codebases.
- Qt’s QString, Java/C#/Rust/Swift/Python are mentioned as having “better” or at least clearer Unicode-aware APIs, though none are perfect and often still need ICU or equivalents.
Practical advice & coping strategies
- Frequent advice:
- Avoid changing case on user text at all; store and display as entered.
- Use locale-aware, whole-string APIs (OS or ICU) when you truly must.
- For internal identifiers, constrain yourself to ASCII and use simple, explicit logic.
- Treat many “simple text operations” (casing, splitting names, collation) as fundamentally hard, not something the language can magically get right.