2024-07-28

tolower() with AVX-512

ASCII vs Unicode case handling

Many comments stress the article is about ASCII-only lowercasing, which is common in protocols (DNS, some language runtimes) and far simpler than full Unicode case folding.
Several examples show Unicode complexity: German ß vs ẞ, length-changing uppercasing (“straße”→“STRASSE”), Turkish dotted/dotless i, and round-trips that are inherently non‑invertible.
There is disagreement over the introduction and real-world usefulness of capital ß; some see it as confusing and historically weak, others as a welcome addition now recommended in some style guides.
People note that changing Unicode libraries or specs can alter language semantics over time, especially for case-insensitive identifiers.

DNS and case-insensitivity tricks

DNS names are ASCII-only on the wire but case-preserving and case-insensitive.
A technique (“DNS-0x20”) randomizes case in queries to add entropy against spoofing; correct servers must match the exact case pattern, dramatically raising attack cost.

AVX-512 masking, tails, and performance

Central praise for AVX-512 is for masked loads/stores, which give smooth performance on short or non-multiple-of-vector-length strings without branches or scalar tails.
Several compare compiler-autovectorized loops vs hand-written intrinsics: auto code can be good for long loops but often mishandles tails (e.g., large scalar cleanups), causing throughput spikes.
Some detailed microarchitectural discussion (Zen 4, Ice Lake) suggests masking is effectively “free” versus scalar tails, especially for small strings and misaligned buffers.

Compilers, intrinsics, and SWAR

Clang vs GCC differences are highlighted: Clang often rewrites intrinsics into more complex sequences; sometimes better, sometimes noticeably worse.
There is frustration that there’s no “don’t second-guess my intrinsics” mode. Some projects ended up maintaining inline assembly for critical paths.
SWAR (“SIMD within a register”) tricks are mentioned but noted as often alignment-sensitive and not always faster once you add prologue/epilogue code.

Undefined behavior and out-of-bounds reads

Long subthread on “unsafe read beyond end” optimizations: very fast on real hardware but formally undefined in C/Rust/LLVM models.
Concerns: compilers may assume it never happens and misoptimize; sanitizers may miss or flag it awkwardly.
Masked AVX-512 loads that suppress faults are seen as the “proper” hardware solution; earlier masked AVX2 behavior on some AMD chips is called out as problematic.

RISC-V vectors and AVX adoption

RVV is pointed out as a cleaner, more uniform vector model with masking and scalable vector length, closer to ARM SVE than AVX-512.
On x86, there is debate about real-world AVX-512 uptake: runtime dispatch exists in numerics/crypto/media, but many hesitate to require more than AVX2.
Intel’s fragmented AVX-512 support and upcoming AVX10 vs AMD’s more straightforward Zen 4/5 story lead to mixed optimism about future wide-SIMD usage.

Related topics