2025-05-06

The Turkish İ Problem and Why You Should Care (2012)

Unicode design and the Turkish İ choice

Debate centers on whether Turkish should have had distinct lowercase “Turkish i” and “Latin i” code points.
One side: current situation “breaks the idea of code points” because case mapping must depend on locale; they’d prefer separate letters like Greek/Cyrillic vs Latin.
Other side: Unicode’s goal is graphemes, not semantics; Turkish reuse of ASCII i reflects that, and the real “violations” are visually identical code points with different semantics elsewhere.
Some question this principle given the existence of invisible, semantic-only characters (e.g., zero-width space).

Legacy encodings and round-tripping

A core justification: Unicode had to round-trip with existing 8‑bit Turkish encodings that used ASCII i plus a high-bit İ.
Critics argue you could simply define that Latin i is not encodable in the Turkish codepage, as is already true for many characters.
Defenders insist the design goal was: any string encodable in a legacy codepage must survive legacy→Unicode→legacy unchanged, including mixed-language text.

Keyboards, locales, and usability

Turkish keyboards already have distinct keys for I/ı and İ/i, but not for “English i” vs “Turkish i”.
Adding a separate lowercase code point would require an extra key or constant locale switching, seen as impractical.
Others note similar layout-dependent confusions already exist (e.g., Greek question mark vs semicolon, Cyrillic і vs Latin i).

Security and confusables

Participants note many existing lookalike characters (Latin vs Cyrillic/Greek) are already exploited for phishing (homograph URLs).
Some argue another confusable i would just join an already-large set of problems.

Real-world bugs and ecosystem behavior

Locale-sensitive case and parsing have caused long-lived bugs: PHP class-name handling, .NET capitalization, numeric parsing with commas vs periods.
.NET has an “invariant globalization” switch now, but older frameworks and mixed responsibilities (UI vs protocol) remain fragile.
SMS encoding (GSM 03.38 vs Unicode) likely contributed to some ı→i substitutions; several commenters doubt specific sensational stories.

Analogous language issues

German ß/ẞ shows non-invertible casing (ß→SS→ss), which is confusing but at least locale-independent.
Typographic vs linguistic views of ß (ligature vs full letter) are debated.
Other scripts (Thai, Hindi) illustrate how historic encodings also shaped inconsistent Unicode models.

How software should handle text and locales

Strong theme: separate “technical/identifier” text from user-facing text; the former should use invariant, ASCII-ish rules, the latter full Unicode with locale-aware operations.
Critics counter that underspecifying allowed characters for identifiers or disallowing non-ASCII letters can be both exclusionary and brittle.
Some wish for type systems with distinct locale-aware and locale-independent string types to force explicit intent.

Cultural and social consequences

The thread discusses a reported murder linked to ı/i mistransliteration.
Many argue the root cause was violent individuals and/or honor culture, not Unicode; others stress that linguistic ambiguity can still act as a trigger in sensitive contexts.
Turkish speakers provide real minimal pairs (e.g., boredom vs sex) to show how ı/i confusion can drastically change meaning, but most see the extreme violence as unrelated to mere typography.

Developer takeaways

Locale and Unicode behavior are repeatedly described as “things programmers must know,” alongside addresses, names, and date/number formats.
Commenters express a desire for a consolidated, practical guide to all these internationalization “falsehoods” to reduce recurring bugs.

Related topics