2024-11-25

Understanding SIMD: Infinite complexity of trivial problems

Terminology and CPU Parallelism

Several comments object to the term “hyperscalar”; “superscalar” already has a precise meaning (multiple different instructions per cycle), distinct from SIMD (one instruction, many data).
Correct terminology matters because modern cores combine multiple dimensions: superscalar, out-of-order, and SIMD execution.

VLIW, Explicit Data Graphs, and Scheduling

A proposed “instruction tree API” from software to hardware is likened to VLIW and to explicit data graph execution.
VLIW historically struggled on general-purpose CPUs (unpredictable memory, scheduling complexity), though it succeeds in niche DSPs and tight loop kernels.

CPU SIMD vs GPU / CUDA Ecosystems

Debate on whether “just use CUDA” is superior to x86 SIMD:
- Pro: PTX acts as a stable intermediate, giving NVIDIA room to radically change hardware while preserving code; many CUDA codes remain viable across generations.
- Con: For peak performance, kernels still get retuned or rewritten per architecture; PTX semantics have evolved and are not perfectly stable.
Comparison to CPUs: x86 code from 15 years ago runs but cannot magically exploit AVX-512, similar to old PTX not using tensor cores.

Intrinsics vs Higher-Level SIMD Abstractions

Split views:
- One camp favors direct intrinsics or per-ISA implementations, arguing abstractions can’t hide real architectural differences.
- Others advocate portable SIMD libraries (e.g., C# Vector<T>, C++ libraries, Rust std::simd) that:
  - Provide zero-cost mappings to intrinsics.
  - Allow portable arithmetic with “escape hatches” for ISA-specific operations.
Some report abstractions occasionally underperform, forcing rewrites with intrinsics; others show cases where portable code outperforms hand-tuned intrinsics.

Use Cases, Latency, and Memory Hierarchy

GPUs dominate for large, throughput-oriented workloads, especially matrix multiplication and AI kernels, helped by huge register files, shared memory crossbars, and high bandwidth.
CPUs remain preferable for:
- Low-latency, control-heavy tasks (e.g., order matching engines, small neural nets).
- Very large memory footprints where DRAM capacity and cache behavior matter.
Apple-style unified memory reduces but does not eliminate CPU–GPU synchronization overhead.

Numerical and Implementation Details

Discussion of using sqrt(a*b) vs sqrt(a)*sqrt(b): author defends the latter on accuracy and SIMD hardware behavior (many roots in parallel, same latency).
Requests and follow-ups about NumPy and corrected figure labeling.
Some see DSLs / GPU-style languages (CUDA, shader-like C++) as the most ergonomic way to write SIMD; Mojo is cited as aiming to bring such capabilities via MLIR-based compilation.

Related topics