tolower() with AVX-512

ASCII vs Unicode case handling

  • Many comments stress the article is about ASCII-only lowercasing, which is common in protocols (DNS, some language runtimes) and far simpler than full Unicode case folding.
  • Several examples show Unicode complexity: German ß vs ẞ, length-changing uppercasing (“straße”→“STRASSE”), Turkish dotted/dotless i, and round-trips that are inherently non‑invertible.
  • There is disagreement over the introduction and real-world usefulness of capital ß; some see it as confusing and historically weak, others as a welcome addition now recommended in some style guides.
  • People note that changing Unicode libraries or specs can alter language semantics over time, especially for case-insensitive identifiers.

DNS and case-insensitivity tricks

  • DNS names are ASCII-only on the wire but case-preserving and case-insensitive.
  • A technique (“DNS-0x20”) randomizes case in queries to add entropy against spoofing; correct servers must match the exact case pattern, dramatically raising attack cost.

AVX-512 masking, tails, and performance

  • Central praise for AVX-512 is for masked loads/stores, which give smooth performance on short or non-multiple-of-vector-length strings without branches or scalar tails.
  • Several compare compiler-autovectorized loops vs hand-written intrinsics: auto code can be good for long loops but often mishandles tails (e.g., large scalar cleanups), causing throughput spikes.
  • Some detailed microarchitectural discussion (Zen 4, Ice Lake) suggests masking is effectively “free” versus scalar tails, especially for small strings and misaligned buffers.

Compilers, intrinsics, and SWAR

  • Clang vs GCC differences are highlighted: Clang often rewrites intrinsics into more complex sequences; sometimes better, sometimes noticeably worse.
  • There is frustration that there’s no “don’t second-guess my intrinsics” mode. Some projects ended up maintaining inline assembly for critical paths.
  • SWAR (“SIMD within a register”) tricks are mentioned but noted as often alignment-sensitive and not always faster once you add prologue/epilogue code.

Undefined behavior and out-of-bounds reads

  • Long subthread on “unsafe read beyond end” optimizations: very fast on real hardware but formally undefined in C/Rust/LLVM models.
  • Concerns: compilers may assume it never happens and misoptimize; sanitizers may miss or flag it awkwardly.
  • Masked AVX-512 loads that suppress faults are seen as the “proper” hardware solution; earlier masked AVX2 behavior on some AMD chips is called out as problematic.

RISC-V vectors and AVX adoption

  • RVV is pointed out as a cleaner, more uniform vector model with masking and scalable vector length, closer to ARM SVE than AVX-512.
  • On x86, there is debate about real-world AVX-512 uptake: runtime dispatch exists in numerics/crypto/media, but many hesitate to require more than AVX2.
  • Intel’s fragmented AVX-512 support and upcoming AVX10 vs AMD’s more straightforward Zen 4/5 story lead to mixed optimism about future wide-SIMD usage.