Weird Lexical Syntax

Lexing, comments, and Unicode edge cases

  • Forth: multiple string forms (.", S", C") and easily extensible comment syntaxes show you often must execute code to lex it.
  • Java: Unicode escapes are processed everywhere, even in comments, causing surprises and potential security/obfuscation risks (e.g., terminating comments via escapes).
  • C and ML: C lacks nested comments; ML/Pascal support them, which some find more ergonomic. Others argue nested comments aren’t worth the extra complexity or backward-compat issues.
  • Git commit messages: # as comment marker clashes with IDs and Markdown-style messages; workarounds include changing the comment char or cleanup mode.

Strings, quoting, and interpolation complexity

  • Many languages (C#, Python, JS, Ruby, shell, Make, Julia, Scala, SQL, PostgreSQL, Perl, Ruby heredocs, etc.) have elaborate interpolation and quoting: heredocs, dollar-quoting, raw strings, user-chosen delimiters, nested comments in strings, and even whitespace as a Ruby string delimiter.
  • Interpolation often embeds arbitrary expressions, making lexing non-regular. Some argue this pushes lexers into parser territory (pushdown automata); others show designs where the parser owns the nesting while the lexer just segments “string parts.”
  • Python f-strings evolved (nesting, quotes inside expressions), breaking some highlighters and making behavior version-dependent.

Undecidable or context-sensitive syntax

  • Perl, Bash, GNU Make, POSIX shell, C++ templates: parsing can depend on runtime info or type information, making full static parsing undecidable or context-sensitive.
  • Even C lexing can require symbol tables (lexer hack) if you want to distinguish certain constructs precisely.

Syntax highlighting strategies and limitations

  • Strong debate: DFA/regex-style highlighters (like Vim/joe) vs. full parsers or AST-based approaches.
  • Many argue full parsing is expensive and tricky on invalid code; good highlighters must work on incomplete or broken programs.
  • Some propose using LLMs or Tree-sitter for highlighting; concerns raised about dependencies (Rust, C++ package managers), coverage breadth, performance, and complexity.

Lisp, TeX, and extensible syntax

  • Lisp reader macros and TeX (and some Forths) can reprogram their own lexers, making full highlighting or static lexing impossible in general.
  • Discussion on Lisp’s “simple” syntax: some see it as reducing surface complexity; others note that macros and special forms reintroduce substantial syntactic/semantic complexity at a different layer.