The Turkish İ Problem and Why You Should Care (2012)
Unicode design and the Turkish İ choice
- Debate centers on whether Turkish should have had distinct lowercase “Turkish i” and “Latin i” code points.
- One side: current situation “breaks the idea of code points” because case mapping must depend on locale; they’d prefer separate letters like Greek/Cyrillic vs Latin.
- Other side: Unicode’s goal is graphemes, not semantics; Turkish reuse of ASCII
ireflects that, and the real “violations” are visually identical code points with different semantics elsewhere. - Some question this principle given the existence of invisible, semantic-only characters (e.g., zero-width space).
Legacy encodings and round-tripping
- A core justification: Unicode had to round-trip with existing 8‑bit Turkish encodings that used ASCII
iplus a high-bitİ. - Critics argue you could simply define that Latin
iis not encodable in the Turkish codepage, as is already true for many characters. - Defenders insist the design goal was: any string encodable in a legacy codepage must survive legacy→Unicode→legacy unchanged, including mixed-language text.
Keyboards, locales, and usability
- Turkish keyboards already have distinct keys for
I/ıandİ/i, but not for “English i” vs “Turkish i”. - Adding a separate lowercase code point would require an extra key or constant locale switching, seen as impractical.
- Others note similar layout-dependent confusions already exist (e.g., Greek question mark vs semicolon, Cyrillic
іvs Latini).
Security and confusables
- Participants note many existing lookalike characters (Latin vs Cyrillic/Greek) are already exploited for phishing (homograph URLs).
- Some argue another confusable
iwould just join an already-large set of problems.
Real-world bugs and ecosystem behavior
- Locale-sensitive case and parsing have caused long-lived bugs: PHP class-name handling, .NET capitalization, numeric parsing with commas vs periods.
- .NET has an “invariant globalization” switch now, but older frameworks and mixed responsibilities (UI vs protocol) remain fragile.
- SMS encoding (GSM 03.38 vs Unicode) likely contributed to some ı→i substitutions; several commenters doubt specific sensational stories.
Analogous language issues
- German
ß/ẞshows non-invertible casing (ß→SS→ss), which is confusing but at least locale-independent. - Typographic vs linguistic views of
ß(ligature vs full letter) are debated. - Other scripts (Thai, Hindi) illustrate how historic encodings also shaped inconsistent Unicode models.
How software should handle text and locales
- Strong theme: separate “technical/identifier” text from user-facing text; the former should use invariant, ASCII-ish rules, the latter full Unicode with locale-aware operations.
- Critics counter that underspecifying allowed characters for identifiers or disallowing non-ASCII letters can be both exclusionary and brittle.
- Some wish for type systems with distinct locale-aware and locale-independent string types to force explicit intent.
Cultural and social consequences
- The thread discusses a reported murder linked to ı/i mistransliteration.
- Many argue the root cause was violent individuals and/or honor culture, not Unicode; others stress that linguistic ambiguity can still act as a trigger in sensitive contexts.
- Turkish speakers provide real minimal pairs (e.g., boredom vs sex) to show how ı/i confusion can drastically change meaning, but most see the extreme violence as unrelated to mere typography.
Developer takeaways
- Locale and Unicode behavior are repeatedly described as “things programmers must know,” alongside addresses, names, and date/number formats.
- Commenters express a desire for a consolidated, practical guide to all these internationalization “falsehoods” to reduce recurring bugs.