Scan HTML faster with SIMD instructions – Chrome edition

SIMD and HTML Parsing

  • Many are impressed that seemingly serial tasks like HTML or JSON parsing can gain large speedups from SIMD by isolating sub-tasks without strict byte-to-byte dependencies.
  • The specific Chrome optimization is seen as only one small piece of full HTML parsing, but still a useful, high-leverage hotspot.
  • Techniques discussed include:
    • Nibble-based lookup tables where target bytes share unique low bits.
    • Bitset-based membership tests via two lookups and shifts.
    • Using broadcast-from-memory and shuffles efficiently (e.g., PSHUFB).
  • There is debate about how far such tricks generalize, and mention of prior SIMD-parsing work (e.g., Parabix, simdjson).

Parallel Parsing and Language Design

  • Some argue HTML is “not fundamentally serial” and could be subdivided, though HTML’s lenient, context-sensitive rules make large-scale parallelism mostly an engineering challenge.
  • Others mention GPU- or array-language–style parsing and suggest grammars with fewer rules to make parallelization easier.
  • Ideas like bidirectional parsing (splitting the input and parsing forward/backward) are discussed as limited ~2× gains compared to broader SIMD approaches.

Rust, C++, and C# Debate

  • One side sees Rust’s memory safety as not worth the complexity, slow compilation, and difficulty of implementing pointer-heavy or highly mutable data structures.
  • Others counter that:
    • Rust data structures can favor indices/arrays over pointers, which can be faster and more compact.
    • Libraries like Rayon make parallel mutation patterns (e.g., matrix blocks) straightforward.
  • There is a parallel thread praising C#:
    • Strong memory safety with almost no unsafe in user data structures.
    • Good SIMD APIs in .NET and growing use of spans to avoid allocations.
  • GC tradeoffs are debated: some see .NET as viable even for realtime-ish media and games; others warn about allocation-heavy libraries (e.g., LINQ, YAML parsers).

Text vs Binary Formats

  • Some criticize HTML/JSON as human-oriented encodings that waste compute on serialization/deserialization, arguing for zero-copy binary formats.
  • Others defend text formats:
    • Human readability, debuggability, and rapid iteration outweigh raw efficiency.
    • JSON compresses well, is often not the bottleneck, and internal systems already use binary (protobuf/Thrift/Avro).
  • Benchmarks around JSON vs protobuf performance are contested; participants note parser quality and environment (e.g., JavaScript’s native JSON.parse) can dominate.

Practical Impact and Tooling

  • Even a 1% HTML parsing win in Chromium is framed as a large global energy and latency benefit.
  • Faster parsing could reduce the need for extra caching layers or heavy DOM structures.
  • Some see LLMs as helpful for understanding SIMD-heavy code; others warn they are unreliable, especially for nontrivial shuffle logic.

Miscellaneous Points

  • Clarification that CR (\r) detection is used for normalizing CRLF to LF in HTML parsing.
  • Questions arise about lack of modern C/C++ HTML parsers.
  • Some wonder if compilers could auto-vectorize such loops; others imply current compilers are not yet good enough at these patterns.