2025-03-29

Towards fearless SIMD, 7 years later

Rust SIMD abstractions and lane counts

Several commenters describe hand-rolled SIMD wrappers in Rust (e.g., f32x8, Vec3x8, Quaternionx8) using a structure-of-arrays layout, used successfully in numerical code (molecular dynamics) with ~2–4× speedups over scalar code.
Concern: tying APIs to fixed widths (x4, x8, x16) harms performance portability across AVX, AVX-512, NEON, SVE, RVV.
Alternatives proposed: “machine-width” types like f32xn or a single type whose lane count is target-dependent; Google Highway is frequently cited as a good design reference.

Compiler support, intrinsics, and auto‑vectorization

Some examples show Rust nightly auto-vectorizing simple scalar functions (e.g., sigmoid) and ongoing work to make intrinsics safe.
Others report Rust miscompilations or ABI issues: SIMD args passed via stack, target_feature scoped to single functions breaking, forcing whole-program -C target-cpu=..., and difficulty querying the actual microarchitecture in code.

Portable vs architecture-specific SIMD

One camp sees standardized SIMD types as marginal: compilers already autovectorize many regular loops; harder cases (byte-level parsing, var-length codecs, mixed precision, scatter/gather) need hand-crafted intrinsics.
Counterexamples: projects using Highway (and some Rust crates) show that general-purpose SIMD wrappers can still handle complex byte-level, mixed-precision, and codec workloads with good performance.
Mask/predicate abstraction across AVX2 vs AVX‑512 (vector-of-bools vs packed mask registers) is debated: considered hard but solvable with opaque mask types and conversion helpers.

Rust vs C/C++ for high-performance work

One view: Rust makes exploiting cutting-edge hardware (AVX-512, AMX, SME, CUDA generations) too painful; better suited to “Python developers” than hardcore HPC.
Others strongly disagree, citing competitive SIMD/string libraries, Bevy/game-engine work, and easier reasoning about concurrency and aliasing.
Trade-off noted: Rust often reduces bugs and clarifies unsafe regions, but can feel over-abstracted, especially for mutable graphs, async runtimes, and bottom-up systems design; some find C++ faster for exploratory “advanced” projects, others the opposite.

Undefined behavior, hardware semantics, and SIMD

Long subthread on C/C++ UB vs implementation-defined behavior (signed overflow, shifts, invalid deref, reserved opcodes).
Point made that many scalar operations are UB in C but fully specified for SIMD intrinsics and vector ISAs, so SIMD code often leans into hardware realities rather than abstracting them away.
Disagreement over whether more behavior should be implementation-defined (e.g., wrapping overflow) vs left UB for optimization and portability; security implications and compiler flags like -fwrapv and -ftrapv discussed.

Concurrency and parallel iteration

Rust’s work-stealing libraries (Rayon, Bevy’s scheduler) are praised for making data-parallelism easy (“add par_iter() and if it compiles, it’s usually correct”).
Debate over cost: some argue lock+simple data structure is often faster than sophisticated lock-free/concurrent structures; others benchmark work-stealing queues as much faster than a single mutex-protected global queue for many small tasks.
Atomics are highlighted as expensive when contended; uncontended per-thread queues plus occasional steals are viewed as a good compromise.

RISC‑V vector detection and multiversioning

Current RISC‑V situation is seen as awkward: no direct user-space way to detect RVV universally.
Solutions discussed: OS syscalls like riscv_hwprobe, aux vectors, emerging non-ISA C APIs, and feature detection patterns used by Highway.
Rust’s is_riscv_feature_detected!("v") appears to just mirror compile-time target_feature rather than true runtime detection, which is called out as problematic.
The open encoding space (and vendor extensions like Xtheadvector) complicate relying on SIGILL semantics for probing.

Other ecosystems and tools

Portable SIMD in Rust is used for a Numpy-like array library targeting both NEON and x86.
C#’s SIMD support and docs are linked as another model.
A custom SIMD-oriented DSL (Singeli) is mentioned as a powerful, if niche, way to generate tuned vector code across ISAs.

Related topics