Bitten by Unicode

Dash‑like Unicode characters

  • Many visually similar hyphen/minus/dash characters exist (e.g., HYPHEN-MINUS, HYPHEN, MINUS SIGN, EN/EM dashes, figure dash, hyphen bullet, etc.).
  • Unicode “confusables” tables show numerous mappings between them, often to ASCII -.
  • Some characters (e.g., U+2010 HYPHEN) appear rarely in real documents, raising questions about their practical usefulness.
  • There is disagreement over whether these distinct code points add useful semantics or mostly introduce confusion.

Unicode in source code

  • Some advocate ASCII-only (or highlighting non-ASCII) in code to avoid subtle bugs and copy/paste issues, especially from Word, Outlook, PDFs, LaTeX, etc.
  • Others strongly favor extensive Unicode use in code (identifiers, comments, tests, diagrams, non-English text) for readability and domain fidelity.
  • Linters and syntax highlighters are suggested to differentiate: flag suspicious punctuation outside string literals while allowing rich text inside them.

Normalization, regex, and parsing strategies

  • Suggestions include:
    • Use Unicode properties in regex (\p{Hyphen}, \p{Dash}, categories) where available.
    • Apply Unicode compatibility normalization (NFKC/NFKD) or TR39 “confusables” skeleton mappings, though standard NFC/NFD do not merge hyphen/minus variants.
    • Normalize all hyphen/minus–like characters to ASCII - before further parsing.
  • Others warn that such normalization can erase legitimate semantic differences and that TR39 skeletons were designed for security, not storage/display.

PDFs, spreadsheets, and dirty data

  • PDF text extraction is highlighted as especially error‑prone: fonts map to glyph IDs, and reverse mapping often picks obscure but “closest” Unicode characters.
  • Real‑world datasets (spreadsheets, financial exports, HTML from Word) routinely contain mixed dashes, smart quotes, non‑breaking spaces, odd bullets, and invisible characters.
  • Many argue that robust pipelines inevitably accumulate data‑cleaning rules; some even resort to machine learning for extraction.

Numbers, currency, and error handling

  • Debate over using floats for money: some say floats are common for interim calculations but not settlement; others insist on fixed‑point/decimal or smallest‑unit integers.
  • Locale issues (decimal vs thousands separators, digit sets, grouping conventions) complicate regex‑based numeric parsing.
  • Several recommend strict parsing: match entire strings, reject any unexpected character, and surface errors rather than guessing intent.
  • Others prefer more forgiving normalization of all plausible minus signs, accepting that user intent may trump strict Unicode semantics.