Scan HTML faster with SIMD instructions – Chrome edition
SIMD and HTML Parsing
- Many are impressed that seemingly serial tasks like HTML or JSON parsing can gain large speedups from SIMD by isolating sub-tasks without strict byte-to-byte dependencies.
- The specific Chrome optimization is seen as only one small piece of full HTML parsing, but still a useful, high-leverage hotspot.
- Techniques discussed include:
- Nibble-based lookup tables where target bytes share unique low bits.
- Bitset-based membership tests via two lookups and shifts.
- Using broadcast-from-memory and shuffles efficiently (e.g., PSHUFB).
- There is debate about how far such tricks generalize, and mention of prior SIMD-parsing work (e.g., Parabix, simdjson).
Parallel Parsing and Language Design
- Some argue HTML is “not fundamentally serial” and could be subdivided, though HTML’s lenient, context-sensitive rules make large-scale parallelism mostly an engineering challenge.
- Others mention GPU- or array-language–style parsing and suggest grammars with fewer rules to make parallelization easier.
- Ideas like bidirectional parsing (splitting the input and parsing forward/backward) are discussed as limited ~2× gains compared to broader SIMD approaches.
Rust, C++, and C# Debate
- One side sees Rust’s memory safety as not worth the complexity, slow compilation, and difficulty of implementing pointer-heavy or highly mutable data structures.
- Others counter that:
- Rust data structures can favor indices/arrays over pointers, which can be faster and more compact.
- Libraries like Rayon make parallel mutation patterns (e.g., matrix blocks) straightforward.
- There is a parallel thread praising C#:
- Strong memory safety with almost no unsafe in user data structures.
- Good SIMD APIs in .NET and growing use of spans to avoid allocations.
- GC tradeoffs are debated: some see .NET as viable even for realtime-ish media and games; others warn about allocation-heavy libraries (e.g., LINQ, YAML parsers).
Text vs Binary Formats
- Some criticize HTML/JSON as human-oriented encodings that waste compute on serialization/deserialization, arguing for zero-copy binary formats.
- Others defend text formats:
- Human readability, debuggability, and rapid iteration outweigh raw efficiency.
- JSON compresses well, is often not the bottleneck, and internal systems already use binary (protobuf/Thrift/Avro).
- Benchmarks around JSON vs protobuf performance are contested; participants note parser quality and environment (e.g., JavaScript’s native JSON.parse) can dominate.
Practical Impact and Tooling
- Even a 1% HTML parsing win in Chromium is framed as a large global energy and latency benefit.
- Faster parsing could reduce the need for extra caching layers or heavy DOM structures.
- Some see LLMs as helpful for understanding SIMD-heavy code; others warn they are unreliable, especially for nontrivial shuffle logic.
Miscellaneous Points
- Clarification that CR (
\r) detection is used for normalizing CRLF to LF in HTML parsing. - Questions arise about lack of modern C/C++ HTML parsers.
- Some wonder if compilers could auto-vectorize such loops; others imply current compilers are not yet good enough at these patterns.