The Turkish İ Problem and Why You Should Care (2012)

Unicode design and the Turkish İ choice

  • Debate centers on whether Turkish should have had distinct lowercase “Turkish i” and “Latin i” code points.
  • One side: current situation “breaks the idea of code points” because case mapping must depend on locale; they’d prefer separate letters like Greek/Cyrillic vs Latin.
  • Other side: Unicode’s goal is graphemes, not semantics; Turkish reuse of ASCII i reflects that, and the real “violations” are visually identical code points with different semantics elsewhere.
  • Some question this principle given the existence of invisible, semantic-only characters (e.g., zero-width space).

Legacy encodings and round-tripping

  • A core justification: Unicode had to round-trip with existing 8‑bit Turkish encodings that used ASCII i plus a high-bit İ.
  • Critics argue you could simply define that Latin i is not encodable in the Turkish codepage, as is already true for many characters.
  • Defenders insist the design goal was: any string encodable in a legacy codepage must survive legacy→Unicode→legacy unchanged, including mixed-language text.

Keyboards, locales, and usability

  • Turkish keyboards already have distinct keys for I/ı and İ/i, but not for “English i” vs “Turkish i”.
  • Adding a separate lowercase code point would require an extra key or constant locale switching, seen as impractical.
  • Others note similar layout-dependent confusions already exist (e.g., Greek question mark vs semicolon, Cyrillic і vs Latin i).

Security and confusables

  • Participants note many existing lookalike characters (Latin vs Cyrillic/Greek) are already exploited for phishing (homograph URLs).
  • Some argue another confusable i would just join an already-large set of problems.

Real-world bugs and ecosystem behavior

  • Locale-sensitive case and parsing have caused long-lived bugs: PHP class-name handling, .NET capitalization, numeric parsing with commas vs periods.
  • .NET has an “invariant globalization” switch now, but older frameworks and mixed responsibilities (UI vs protocol) remain fragile.
  • SMS encoding (GSM 03.38 vs Unicode) likely contributed to some ı→i substitutions; several commenters doubt specific sensational stories.

Analogous language issues

  • German ß/ẞ shows non-invertible casing (ß→SS→ss), which is confusing but at least locale-independent.
  • Typographic vs linguistic views of ß (ligature vs full letter) are debated.
  • Other scripts (Thai, Hindi) illustrate how historic encodings also shaped inconsistent Unicode models.

How software should handle text and locales

  • Strong theme: separate “technical/identifier” text from user-facing text; the former should use invariant, ASCII-ish rules, the latter full Unicode with locale-aware operations.
  • Critics counter that underspecifying allowed characters for identifiers or disallowing non-ASCII letters can be both exclusionary and brittle.
  • Some wish for type systems with distinct locale-aware and locale-independent string types to force explicit intent.

Cultural and social consequences

  • The thread discusses a reported murder linked to ı/i mistransliteration.
  • Many argue the root cause was violent individuals and/or honor culture, not Unicode; others stress that linguistic ambiguity can still act as a trigger in sensitive contexts.
  • Turkish speakers provide real minimal pairs (e.g., boredom vs sex) to show how ı/i confusion can drastically change meaning, but most see the extreme violence as unrelated to mere typography.

Developer takeaways

  • Locale and Unicode behavior are repeatedly described as “things programmers must know,” alongside addresses, names, and date/number formats.
  • Commenters express a desire for a consolidated, practical guide to all these internationalization “falsehoods” to reduce recurring bugs.