tolower() with AVX-512
ASCII vs Unicode case handling
- Many comments stress the article is about ASCII-only lowercasing, which is common in protocols (DNS, some language runtimes) and far simpler than full Unicode case folding.
- Several examples show Unicode complexity: German ß vs ẞ, length-changing uppercasing (“straße”→“STRASSE”), Turkish dotted/dotless i, and round-trips that are inherently non‑invertible.
- There is disagreement over the introduction and real-world usefulness of capital ß; some see it as confusing and historically weak, others as a welcome addition now recommended in some style guides.
- People note that changing Unicode libraries or specs can alter language semantics over time, especially for case-insensitive identifiers.
DNS and case-insensitivity tricks
- DNS names are ASCII-only on the wire but case-preserving and case-insensitive.
- A technique (“DNS-0x20”) randomizes case in queries to add entropy against spoofing; correct servers must match the exact case pattern, dramatically raising attack cost.
AVX-512 masking, tails, and performance
- Central praise for AVX-512 is for masked loads/stores, which give smooth performance on short or non-multiple-of-vector-length strings without branches or scalar tails.
- Several compare compiler-autovectorized loops vs hand-written intrinsics: auto code can be good for long loops but often mishandles tails (e.g., large scalar cleanups), causing throughput spikes.
- Some detailed microarchitectural discussion (Zen 4, Ice Lake) suggests masking is effectively “free” versus scalar tails, especially for small strings and misaligned buffers.
Compilers, intrinsics, and SWAR
- Clang vs GCC differences are highlighted: Clang often rewrites intrinsics into more complex sequences; sometimes better, sometimes noticeably worse.
- There is frustration that there’s no “don’t second-guess my intrinsics” mode. Some projects ended up maintaining inline assembly for critical paths.
- SWAR (“SIMD within a register”) tricks are mentioned but noted as often alignment-sensitive and not always faster once you add prologue/epilogue code.
Undefined behavior and out-of-bounds reads
- Long subthread on “unsafe read beyond end” optimizations: very fast on real hardware but formally undefined in C/Rust/LLVM models.
- Concerns: compilers may assume it never happens and misoptimize; sanitizers may miss or flag it awkwardly.
- Masked AVX-512 loads that suppress faults are seen as the “proper” hardware solution; earlier masked AVX2 behavior on some AMD chips is called out as problematic.
RISC-V vectors and AVX adoption
- RVV is pointed out as a cleaner, more uniform vector model with masking and scalable vector length, closer to ARM SVE than AVX-512.
- On x86, there is debate about real-world AVX-512 uptake: runtime dispatch exists in numerics/crypto/media, but many hesitate to require more than AVX2.
- Intel’s fragmented AVX-512 support and upcoming AVX10 vs AMD’s more straightforward Zen 4/5 story lead to mixed optimism about future wide-SIMD usage.