Fundamental flaws of SIMD ISAs (2021)
Fixed-width vs variable-length SIMD
- Strong disagreement over whether fixed-width SIMD is a “flaw”.
- Pro–fixed-width: simpler to reason about, easier to design data structures and shuffles around a known width, better for highly tuned algorithms (e.g., hash tables, string tricks, regex, some crypto).
- Pro–variable-length: most data-parallel loops can be written vector-length-agnostic; a single implementation can scale across 128/256/512+ bits, avoiding multiple code paths and rewrites when widths change.
- Several point out that current compilers and APIs often end up specializing by width anyway, so variable-length can devolve to “pick a size and stick to it”.
Reductions and horizontal ops
- Reductions (sum/min/max/etc.) are highlighted as an under-served area: inherently higher latency (tree depth) and less parallel than per-lane ops.
- Some ISAs (NEON, RVV, x86 with psadbw and others) have partial support, but many recommend avoiding heavy reliance on reductions in hot loops and leaving final scalar collapse to loop tails.
Tail handling and correctness
- Experiences differ by domain: some avoid tails via padded containers; others (audio, some SIMD in libraries) report tail handling nearly doubling code complexity and being a major source of bugs and CVEs.
- Masked loads/stores and AVX-512 per-lane masks help, but are not ubiquitous and can have nontrivial cost.
Compilers, abstractions, and autovectorization
- Skepticism that “sufficiently advanced” autovectorizers will soon make hand-written SIMD unnecessary, especially for non-BLAS workloads, complex shuffles, and layout-sensitive tricks.
- Others argue SIMD should be expressed in higher-level constructs (vector types, DSL-like loops), leaving width and tails to the compiler; some languages and libraries (e.g., Highway, Zig vectors, .NET numerics) are cited as partial steps.
WASM and portability
- Variable-width SIMD in WebAssembly is defended as a way to avoid per-feature binaries in an ecosystem without good runtime feature detection.
- Critics see it as complicating compiler work and potentially underutilizing hardware vs explicit fixed-width specializations.
GPUs, SIMT, and vector ISAs
- GPUs (SIMT) are seen as avoiding some CPU SIMD issues (fixed wide warps, auto-coalesced memory) but suffering badly from branch divergence.
- Several note industry trends toward vector-length-agnostic ISAs (Arm SVE, RISC-V RVV), but others report practical slowdowns or optimization challenges with truly dynamic vector lengths.
Loop unrolling and microarchitecture
- Debate over loop unrolling: some say modern OoO cores largely hide loop overhead; others stress unrolling is still crucial to break dependency chains (especially for reductions) and fully pipeline SIMD.
- Concerns raised about code size, uop cache pressure, and differing branch/port behavior across microarchitectures.