2024-06-20

Wc2: Investigates optimizing 'wc', the Unix word count program

State machines, types, and correctness

Many praise finite-state machines (FSMs) as a clean way to express parsers and avoid “partially initialized object” bugs.
Others argue that sum types / discriminated unions (e.g., Rust enums) are a better way to “make invalid states unrepresentable” than hand-written FSMs.
Discussion compares Go (always has a default, often unwanted) vs C++ vs Rust (opt‑in default, move semantics, Result instead of exceptions).
Some note that coroutines can hide an explicit state machine inside compiler-generated code.

How wc2’s approach differs

Core benefit seen as: one pass, fixed work per byte, using a DFA table instead of branches and heavyweight multibyte / locale routines like mbrtowc + iswspace.
Table encodes states like “in word”, “between words”, “after newline”; transitions depend on character class (word, space, newline).
This yields throughput that mostly depends only on input size, not content mix, unlike implementations with complex branching and library calls.
Several want better documentation of the state tables and generation process; existing comments are seen as opaque and “magic-number-like”.

SIMD, GPUs, and performance limits

Multiple comments highlight SIMD wc implementations that achieve 4–10× speedups over byte-at-a-time code on ASCII, sometimes disk- or SSD-limited.
Unicode-aware SIMD is seen as much harder; ASCII-only is called “cheating” by some.
One GPU prototype shows modest gains; others doubt GPU can beat a good SIMD CPU version once data transfer is considered.

“Asynchronous state-machine” terminology

Several point out the code is just a DFA over a byte stream, not truly asynchronous I/O.
Some argue “asynchronous” was misused; others interpret it loosely as “streaming, no full buffering”. Overall, meaning is considered unclear.

Unicode, locales, and POSIX compliance

wc2’s speedup is partly attributed to assuming UTF‑8 and bypassing locale‑aware C APIs; critics note this breaks correctness in non‑UTF‑8 locales and is not a drop-in POSIX wc.
Others counter that modern tools restricted to ASCII/UTF‑8 are acceptable in many domains, but agree that’s incompatible with being a full wc replacement.

Comparisons to existing wc implementations

Benchmarks show BSD wc can already be very fast; gains from wc2 are input- and option-dependent (e.g., line-only vs full counts).
Historical and alternative wc implementations (Unix, Plan 9, BusyBox, BSD, GNU, C++ versions) are discussed in terms of complexity, POSIX options, and readability.
Several note that unless these optimizations land in coreutils/BSD, most shell users will never benefit.

Related topics