2024-06-11

Scan HTML faster with SIMD instructions – Chrome edition

SIMD and HTML Parsing

Many are impressed that seemingly serial tasks like HTML or JSON parsing can gain large speedups from SIMD by isolating sub-tasks without strict byte-to-byte dependencies.
The specific Chrome optimization is seen as only one small piece of full HTML parsing, but still a useful, high-leverage hotspot.
Techniques discussed include:
- Nibble-based lookup tables where target bytes share unique low bits.
- Bitset-based membership tests via two lookups and shifts.
- Using broadcast-from-memory and shuffles efficiently (e.g., PSHUFB).
There is debate about how far such tricks generalize, and mention of prior SIMD-parsing work (e.g., Parabix, simdjson).

Parallel Parsing and Language Design

Some argue HTML is “not fundamentally serial” and could be subdivided, though HTML’s lenient, context-sensitive rules make large-scale parallelism mostly an engineering challenge.
Others mention GPU- or array-language–style parsing and suggest grammars with fewer rules to make parallelization easier.
Ideas like bidirectional parsing (splitting the input and parsing forward/backward) are discussed as limited ~2× gains compared to broader SIMD approaches.

Rust, C++, and C# Debate

One side sees Rust’s memory safety as not worth the complexity, slow compilation, and difficulty of implementing pointer-heavy or highly mutable data structures.
Others counter that:
- Rust data structures can favor indices/arrays over pointers, which can be faster and more compact.
- Libraries like Rayon make parallel mutation patterns (e.g., matrix blocks) straightforward.
There is a parallel thread praising C#:
- Strong memory safety with almost no unsafe in user data structures.
- Good SIMD APIs in .NET and growing use of spans to avoid allocations.
GC tradeoffs are debated: some see .NET as viable even for realtime-ish media and games; others warn about allocation-heavy libraries (e.g., LINQ, YAML parsers).

Text vs Binary Formats

Some criticize HTML/JSON as human-oriented encodings that waste compute on serialization/deserialization, arguing for zero-copy binary formats.
Others defend text formats:
- Human readability, debuggability, and rapid iteration outweigh raw efficiency.
- JSON compresses well, is often not the bottleneck, and internal systems already use binary (protobuf/Thrift/Avro).
Benchmarks around JSON vs protobuf performance are contested; participants note parser quality and environment (e.g., JavaScript’s native JSON.parse) can dominate.

Practical Impact and Tooling

Even a 1% HTML parsing win in Chromium is framed as a large global energy and latency benefit.
Faster parsing could reduce the need for extra caching layers or heavy DOM structures.
Some see LLMs as helpful for understanding SIMD-heavy code; others warn they are unreliable, especially for nontrivial shuffle logic.

Miscellaneous Points

Clarification that CR (\r) detection is used for normalizing CRLF to LF in HTML parsing.
Questions arise about lack of modern C/C++ HTML parsers.
Some wonder if compilers could auto-vectorize such loops; others imply current compilers are not yet good enough at these patterns.

Related topics