Understanding SIMD: Infinite complexity of trivial problems
Terminology and CPU Parallelism
- Several comments object to the term “hyperscalar”; “superscalar” already has a precise meaning (multiple different instructions per cycle), distinct from SIMD (one instruction, many data).
- Correct terminology matters because modern cores combine multiple dimensions: superscalar, out-of-order, and SIMD execution.
VLIW, Explicit Data Graphs, and Scheduling
- A proposed “instruction tree API” from software to hardware is likened to VLIW and to explicit data graph execution.
- VLIW historically struggled on general-purpose CPUs (unpredictable memory, scheduling complexity), though it succeeds in niche DSPs and tight loop kernels.
CPU SIMD vs GPU / CUDA Ecosystems
- Debate on whether “just use CUDA” is superior to x86 SIMD:
- Pro: PTX acts as a stable intermediate, giving NVIDIA room to radically change hardware while preserving code; many CUDA codes remain viable across generations.
- Con: For peak performance, kernels still get retuned or rewritten per architecture; PTX semantics have evolved and are not perfectly stable.
- Comparison to CPUs: x86 code from 15 years ago runs but cannot magically exploit AVX-512, similar to old PTX not using tensor cores.
Intrinsics vs Higher-Level SIMD Abstractions
- Split views:
- One camp favors direct intrinsics or per-ISA implementations, arguing abstractions can’t hide real architectural differences.
- Others advocate portable SIMD libraries (e.g., C#
Vector<T>, C++ libraries, Ruststd::simd) that:- Provide zero-cost mappings to intrinsics.
- Allow portable arithmetic with “escape hatches” for ISA-specific operations.
- Some report abstractions occasionally underperform, forcing rewrites with intrinsics; others show cases where portable code outperforms hand-tuned intrinsics.
Use Cases, Latency, and Memory Hierarchy
- GPUs dominate for large, throughput-oriented workloads, especially matrix multiplication and AI kernels, helped by huge register files, shared memory crossbars, and high bandwidth.
- CPUs remain preferable for:
- Low-latency, control-heavy tasks (e.g., order matching engines, small neural nets).
- Very large memory footprints where DRAM capacity and cache behavior matter.
- Apple-style unified memory reduces but does not eliminate CPU–GPU synchronization overhead.
Numerical and Implementation Details
- Discussion of using
sqrt(a*b)vssqrt(a)*sqrt(b): author defends the latter on accuracy and SIMD hardware behavior (many roots in parallel, same latency). - Requests and follow-ups about NumPy and corrected figure labeling.
- Some see DSLs / GPU-style languages (CUDA, shader-like C++) as the most ergonomic way to write SIMD; Mojo is cited as aiming to bring such capabilities via MLIR-based compilation.