2024-09-09

Bitten by Unicode

Dash‑like Unicode characters

Many visually similar hyphen/minus/dash characters exist (e.g., HYPHEN-MINUS, HYPHEN, MINUS SIGN, EN/EM dashes, figure dash, hyphen bullet, etc.).
Unicode “confusables” tables show numerous mappings between them, often to ASCII -.
Some characters (e.g., U+2010 HYPHEN) appear rarely in real documents, raising questions about their practical usefulness.
There is disagreement over whether these distinct code points add useful semantics or mostly introduce confusion.

Unicode in source code

Some advocate ASCII-only (or highlighting non-ASCII) in code to avoid subtle bugs and copy/paste issues, especially from Word, Outlook, PDFs, LaTeX, etc.
Others strongly favor extensive Unicode use in code (identifiers, comments, tests, diagrams, non-English text) for readability and domain fidelity.
Linters and syntax highlighters are suggested to differentiate: flag suspicious punctuation outside string literals while allowing rich text inside them.

Normalization, regex, and parsing strategies

Suggestions include:
- Use Unicode properties in regex (\p{Hyphen}, \p{Dash}, categories) where available.
- Apply Unicode compatibility normalization (NFKC/NFKD) or TR39 “confusables” skeleton mappings, though standard NFC/NFD do not merge hyphen/minus variants.
- Normalize all hyphen/minus–like characters to ASCII - before further parsing.
Others warn that such normalization can erase legitimate semantic differences and that TR39 skeletons were designed for security, not storage/display.

PDFs, spreadsheets, and dirty data

PDF text extraction is highlighted as especially error‑prone: fonts map to glyph IDs, and reverse mapping often picks obscure but “closest” Unicode characters.
Real‑world datasets (spreadsheets, financial exports, HTML from Word) routinely contain mixed dashes, smart quotes, non‑breaking spaces, odd bullets, and invisible characters.
Many argue that robust pipelines inevitably accumulate data‑cleaning rules; some even resort to machine learning for extraction.

Numbers, currency, and error handling

Debate over using floats for money: some say floats are common for interim calculations but not settlement; others insist on fixed‑point/decimal or smallest‑unit integers.
Locale issues (decimal vs thousands separators, digit sets, grouping conventions) complicate regex‑based numeric parsing.
Several recommend strict parsing: match entire strings, reject any unexpected character, and surface errors rather than guessing intent.
Others prefer more forgiving normalization of all plausible minus signs, accepting that user intent may trump strict Unicode semantics.

Related topics