Bitten by Unicode
Dash‑like Unicode characters
- Many visually similar hyphen/minus/dash characters exist (e.g., HYPHEN-MINUS, HYPHEN, MINUS SIGN, EN/EM dashes, figure dash, hyphen bullet, etc.).
- Unicode “confusables” tables show numerous mappings between them, often to ASCII
-. - Some characters (e.g., U+2010 HYPHEN) appear rarely in real documents, raising questions about their practical usefulness.
- There is disagreement over whether these distinct code points add useful semantics or mostly introduce confusion.
Unicode in source code
- Some advocate ASCII-only (or highlighting non-ASCII) in code to avoid subtle bugs and copy/paste issues, especially from Word, Outlook, PDFs, LaTeX, etc.
- Others strongly favor extensive Unicode use in code (identifiers, comments, tests, diagrams, non-English text) for readability and domain fidelity.
- Linters and syntax highlighters are suggested to differentiate: flag suspicious punctuation outside string literals while allowing rich text inside them.
Normalization, regex, and parsing strategies
- Suggestions include:
- Use Unicode properties in regex (
\p{Hyphen},\p{Dash}, categories) where available. - Apply Unicode compatibility normalization (NFKC/NFKD) or TR39 “confusables” skeleton mappings, though standard NFC/NFD do not merge hyphen/minus variants.
- Normalize all hyphen/minus–like characters to ASCII
-before further parsing.
- Use Unicode properties in regex (
- Others warn that such normalization can erase legitimate semantic differences and that TR39 skeletons were designed for security, not storage/display.
PDFs, spreadsheets, and dirty data
- PDF text extraction is highlighted as especially error‑prone: fonts map to glyph IDs, and reverse mapping often picks obscure but “closest” Unicode characters.
- Real‑world datasets (spreadsheets, financial exports, HTML from Word) routinely contain mixed dashes, smart quotes, non‑breaking spaces, odd bullets, and invisible characters.
- Many argue that robust pipelines inevitably accumulate data‑cleaning rules; some even resort to machine learning for extraction.
Numbers, currency, and error handling
- Debate over using floats for money: some say floats are common for interim calculations but not settlement; others insist on fixed‑point/decimal or smallest‑unit integers.
- Locale issues (decimal vs thousands separators, digit sets, grouping conventions) complicate regex‑based numeric parsing.
- Several recommend strict parsing: match entire strings, reject any unexpected character, and surface errors rather than guessing intent.
- Others prefer more forgiving normalization of all plausible minus signs, accepting that user intent may trump strict Unicode semantics.