Towards fearless SIMD, 7 years later

Rust SIMD abstractions and lane counts

  • Several commenters describe hand-rolled SIMD wrappers in Rust (e.g., f32x8, Vec3x8, Quaternionx8) using a structure-of-arrays layout, used successfully in numerical code (molecular dynamics) with ~2–4× speedups over scalar code.
  • Concern: tying APIs to fixed widths (x4, x8, x16) harms performance portability across AVX, AVX-512, NEON, SVE, RVV.
  • Alternatives proposed: “machine-width” types like f32xn or a single type whose lane count is target-dependent; Google Highway is frequently cited as a good design reference.

Compiler support, intrinsics, and auto‑vectorization

  • Some examples show Rust nightly auto-vectorizing simple scalar functions (e.g., sigmoid) and ongoing work to make intrinsics safe.
  • Others report Rust miscompilations or ABI issues: SIMD args passed via stack, target_feature scoped to single functions breaking, forcing whole-program -C target-cpu=..., and difficulty querying the actual microarchitecture in code.

Portable vs architecture-specific SIMD

  • One camp sees standardized SIMD types as marginal: compilers already autovectorize many regular loops; harder cases (byte-level parsing, var-length codecs, mixed precision, scatter/gather) need hand-crafted intrinsics.
  • Counterexamples: projects using Highway (and some Rust crates) show that general-purpose SIMD wrappers can still handle complex byte-level, mixed-precision, and codec workloads with good performance.
  • Mask/predicate abstraction across AVX2 vs AVX‑512 (vector-of-bools vs packed mask registers) is debated: considered hard but solvable with opaque mask types and conversion helpers.

Rust vs C/C++ for high-performance work

  • One view: Rust makes exploiting cutting-edge hardware (AVX-512, AMX, SME, CUDA generations) too painful; better suited to “Python developers” than hardcore HPC.
  • Others strongly disagree, citing competitive SIMD/string libraries, Bevy/game-engine work, and easier reasoning about concurrency and aliasing.
  • Trade-off noted: Rust often reduces bugs and clarifies unsafe regions, but can feel over-abstracted, especially for mutable graphs, async runtimes, and bottom-up systems design; some find C++ faster for exploratory “advanced” projects, others the opposite.

Undefined behavior, hardware semantics, and SIMD

  • Long subthread on C/C++ UB vs implementation-defined behavior (signed overflow, shifts, invalid deref, reserved opcodes).
  • Point made that many scalar operations are UB in C but fully specified for SIMD intrinsics and vector ISAs, so SIMD code often leans into hardware realities rather than abstracting them away.
  • Disagreement over whether more behavior should be implementation-defined (e.g., wrapping overflow) vs left UB for optimization and portability; security implications and compiler flags like -fwrapv and -ftrapv discussed.

Concurrency and parallel iteration

  • Rust’s work-stealing libraries (Rayon, Bevy’s scheduler) are praised for making data-parallelism easy (“add par_iter() and if it compiles, it’s usually correct”).
  • Debate over cost: some argue lock+simple data structure is often faster than sophisticated lock-free/concurrent structures; others benchmark work-stealing queues as much faster than a single mutex-protected global queue for many small tasks.
  • Atomics are highlighted as expensive when contended; uncontended per-thread queues plus occasional steals are viewed as a good compromise.

RISC‑V vector detection and multiversioning

  • Current RISC‑V situation is seen as awkward: no direct user-space way to detect RVV universally.
  • Solutions discussed: OS syscalls like riscv_hwprobe, aux vectors, emerging non-ISA C APIs, and feature detection patterns used by Highway.
  • Rust’s is_riscv_feature_detected!("v") appears to just mirror compile-time target_feature rather than true runtime detection, which is called out as problematic.
  • The open encoding space (and vendor extensions like Xtheadvector) complicate relying on SIGILL semantics for probing.

Other ecosystems and tools

  • Portable SIMD in Rust is used for a Numpy-like array library targeting both NEON and x86.
  • C#’s SIMD support and docs are linked as another model.
  • A custom SIMD-oriented DSL (Singeli) is mentioned as a powerful, if niche, way to generate tuned vector code across ISAs.