Towards fearless SIMD, 7 years later
Rust SIMD abstractions and lane counts
- Several commenters describe hand-rolled SIMD wrappers in Rust (e.g.,
f32x8,Vec3x8,Quaternionx8) using a structure-of-arrays layout, used successfully in numerical code (molecular dynamics) with ~2–4× speedups over scalar code. - Concern: tying APIs to fixed widths (
x4,x8,x16) harms performance portability across AVX, AVX-512, NEON, SVE, RVV. - Alternatives proposed: “machine-width” types like
f32xnor a single type whose lane count is target-dependent; Google Highway is frequently cited as a good design reference.
Compiler support, intrinsics, and auto‑vectorization
- Some examples show Rust nightly auto-vectorizing simple scalar functions (e.g., sigmoid) and ongoing work to make intrinsics safe.
- Others report Rust miscompilations or ABI issues: SIMD args passed via stack,
target_featurescoped to single functions breaking, forcing whole-program-C target-cpu=..., and difficulty querying the actual microarchitecture in code.
Portable vs architecture-specific SIMD
- One camp sees standardized SIMD types as marginal: compilers already autovectorize many regular loops; harder cases (byte-level parsing, var-length codecs, mixed precision, scatter/gather) need hand-crafted intrinsics.
- Counterexamples: projects using Highway (and some Rust crates) show that general-purpose SIMD wrappers can still handle complex byte-level, mixed-precision, and codec workloads with good performance.
- Mask/predicate abstraction across AVX2 vs AVX‑512 (vector-of-bools vs packed mask registers) is debated: considered hard but solvable with opaque mask types and conversion helpers.
Rust vs C/C++ for high-performance work
- One view: Rust makes exploiting cutting-edge hardware (AVX-512, AMX, SME, CUDA generations) too painful; better suited to “Python developers” than hardcore HPC.
- Others strongly disagree, citing competitive SIMD/string libraries, Bevy/game-engine work, and easier reasoning about concurrency and aliasing.
- Trade-off noted: Rust often reduces bugs and clarifies unsafe regions, but can feel over-abstracted, especially for mutable graphs, async runtimes, and bottom-up systems design; some find C++ faster for exploratory “advanced” projects, others the opposite.
Undefined behavior, hardware semantics, and SIMD
- Long subthread on C/C++ UB vs implementation-defined behavior (signed overflow, shifts, invalid deref, reserved opcodes).
- Point made that many scalar operations are UB in C but fully specified for SIMD intrinsics and vector ISAs, so SIMD code often leans into hardware realities rather than abstracting them away.
- Disagreement over whether more behavior should be implementation-defined (e.g., wrapping overflow) vs left UB for optimization and portability; security implications and compiler flags like
-fwrapvand-ftrapvdiscussed.
Concurrency and parallel iteration
- Rust’s work-stealing libraries (Rayon, Bevy’s scheduler) are praised for making data-parallelism easy (“add
par_iter()and if it compiles, it’s usually correct”). - Debate over cost: some argue lock+simple data structure is often faster than sophisticated lock-free/concurrent structures; others benchmark work-stealing queues as much faster than a single mutex-protected global queue for many small tasks.
- Atomics are highlighted as expensive when contended; uncontended per-thread queues plus occasional steals are viewed as a good compromise.
RISC‑V vector detection and multiversioning
- Current RISC‑V situation is seen as awkward: no direct user-space way to detect RVV universally.
- Solutions discussed: OS syscalls like
riscv_hwprobe, aux vectors, emerging non-ISA C APIs, and feature detection patterns used by Highway. - Rust’s
is_riscv_feature_detected!("v")appears to just mirror compile-timetarget_featurerather than true runtime detection, which is called out as problematic. - The open encoding space (and vendor extensions like Xtheadvector) complicate relying on SIGILL semantics for probing.
Other ecosystems and tools
- Portable SIMD in Rust is used for a Numpy-like array library targeting both NEON and x86.
- C#’s SIMD support and docs are linked as another model.
- A custom SIMD-oriented DSL (Singeli) is mentioned as a powerful, if niche, way to generate tuned vector code across ISAs.