Weird Lexical Syntax
Lexing, comments, and Unicode edge cases
- Forth: multiple string forms (
.",S",C") and easily extensible comment syntaxes show you often must execute code to lex it. - Java: Unicode escapes are processed everywhere, even in comments, causing surprises and potential security/obfuscation risks (e.g., terminating comments via escapes).
- C and ML: C lacks nested comments; ML/Pascal support them, which some find more ergonomic. Others argue nested comments aren’t worth the extra complexity or backward-compat issues.
- Git commit messages:
#as comment marker clashes with IDs and Markdown-style messages; workarounds include changing the comment char or cleanup mode.
Strings, quoting, and interpolation complexity
- Many languages (C#, Python, JS, Ruby, shell, Make, Julia, Scala, SQL, PostgreSQL, Perl, Ruby heredocs, etc.) have elaborate interpolation and quoting: heredocs, dollar-quoting, raw strings, user-chosen delimiters, nested comments in strings, and even whitespace as a Ruby string delimiter.
- Interpolation often embeds arbitrary expressions, making lexing non-regular. Some argue this pushes lexers into parser territory (pushdown automata); others show designs where the parser owns the nesting while the lexer just segments “string parts.”
- Python f-strings evolved (nesting, quotes inside expressions), breaking some highlighters and making behavior version-dependent.
Undecidable or context-sensitive syntax
- Perl, Bash, GNU Make, POSIX shell, C++ templates: parsing can depend on runtime info or type information, making full static parsing undecidable or context-sensitive.
- Even C lexing can require symbol tables (lexer hack) if you want to distinguish certain constructs precisely.
Syntax highlighting strategies and limitations
- Strong debate: DFA/regex-style highlighters (like Vim/joe) vs. full parsers or AST-based approaches.
- Many argue full parsing is expensive and tricky on invalid code; good highlighters must work on incomplete or broken programs.
- Some propose using LLMs or Tree-sitter for highlighting; concerns raised about dependencies (Rust, C++ package managers), coverage breadth, performance, and complexity.
Lisp, TeX, and extensible syntax
- Lisp reader macros and TeX (and some Forths) can reprogram their own lexers, making full highlighting or static lexing impossible in general.
- Discussion on Lisp’s “simple” syntax: some see it as reducing surface complexity; others note that macros and special forms reintroduce substantial syntactic/semantic complexity at a different layer.