2024-11-02

Weird Lexical Syntax

Lexing, comments, and Unicode edge cases

Forth: multiple string forms (.", S", C") and easily extensible comment syntaxes show you often must execute code to lex it.
Java: Unicode escapes are processed everywhere, even in comments, causing surprises and potential security/obfuscation risks (e.g., terminating comments via escapes).
C and ML: C lacks nested comments; ML/Pascal support them, which some find more ergonomic. Others argue nested comments aren’t worth the extra complexity or backward-compat issues.
Git commit messages: # as comment marker clashes with IDs and Markdown-style messages; workarounds include changing the comment char or cleanup mode.

Strings, quoting, and interpolation complexity

Many languages (C#, Python, JS, Ruby, shell, Make, Julia, Scala, SQL, PostgreSQL, Perl, Ruby heredocs, etc.) have elaborate interpolation and quoting: heredocs, dollar-quoting, raw strings, user-chosen delimiters, nested comments in strings, and even whitespace as a Ruby string delimiter.
Interpolation often embeds arbitrary expressions, making lexing non-regular. Some argue this pushes lexers into parser territory (pushdown automata); others show designs where the parser owns the nesting while the lexer just segments “string parts.”
Python f-strings evolved (nesting, quotes inside expressions), breaking some highlighters and making behavior version-dependent.

Undecidable or context-sensitive syntax

Perl, Bash, GNU Make, POSIX shell, C++ templates: parsing can depend on runtime info or type information, making full static parsing undecidable or context-sensitive.
Even C lexing can require symbol tables (lexer hack) if you want to distinguish certain constructs precisely.

Syntax highlighting strategies and limitations

Strong debate: DFA/regex-style highlighters (like Vim/joe) vs. full parsers or AST-based approaches.
Many argue full parsing is expensive and tricky on invalid code; good highlighters must work on incomplete or broken programs.
Some propose using LLMs or Tree-sitter for highlighting; concerns raised about dependencies (Rust, C++ package managers), coverage breadth, performance, and complexity.

Lisp, TeX, and extensible syntax

Lisp reader macros and TeX (and some Forths) can reprogram their own lexers, making full highlighting or static lexing impossible in general.
Discussion on Lisp’s “simple” syntax: some see it as reducing surface complexity; others note that macros and special forms reintroduce substantial syntactic/semantic complexity at a different layer.

Related topics