Understanding SIMD: Infinite complexity of trivial problems

Terminology and CPU Parallelism

  • Several comments object to the term “hyperscalar”; “superscalar” already has a precise meaning (multiple different instructions per cycle), distinct from SIMD (one instruction, many data).
  • Correct terminology matters because modern cores combine multiple dimensions: superscalar, out-of-order, and SIMD execution.

VLIW, Explicit Data Graphs, and Scheduling

  • A proposed “instruction tree API” from software to hardware is likened to VLIW and to explicit data graph execution.
  • VLIW historically struggled on general-purpose CPUs (unpredictable memory, scheduling complexity), though it succeeds in niche DSPs and tight loop kernels.

CPU SIMD vs GPU / CUDA Ecosystems

  • Debate on whether “just use CUDA” is superior to x86 SIMD:
    • Pro: PTX acts as a stable intermediate, giving NVIDIA room to radically change hardware while preserving code; many CUDA codes remain viable across generations.
    • Con: For peak performance, kernels still get retuned or rewritten per architecture; PTX semantics have evolved and are not perfectly stable.
  • Comparison to CPUs: x86 code from 15 years ago runs but cannot magically exploit AVX-512, similar to old PTX not using tensor cores.

Intrinsics vs Higher-Level SIMD Abstractions

  • Split views:
    • One camp favors direct intrinsics or per-ISA implementations, arguing abstractions can’t hide real architectural differences.
    • Others advocate portable SIMD libraries (e.g., C# Vector<T>, C++ libraries, Rust std::simd) that:
      • Provide zero-cost mappings to intrinsics.
      • Allow portable arithmetic with “escape hatches” for ISA-specific operations.
  • Some report abstractions occasionally underperform, forcing rewrites with intrinsics; others show cases where portable code outperforms hand-tuned intrinsics.

Use Cases, Latency, and Memory Hierarchy

  • GPUs dominate for large, throughput-oriented workloads, especially matrix multiplication and AI kernels, helped by huge register files, shared memory crossbars, and high bandwidth.
  • CPUs remain preferable for:
    • Low-latency, control-heavy tasks (e.g., order matching engines, small neural nets).
    • Very large memory footprints where DRAM capacity and cache behavior matter.
  • Apple-style unified memory reduces but does not eliminate CPU–GPU synchronization overhead.

Numerical and Implementation Details

  • Discussion of using sqrt(a*b) vs sqrt(a)*sqrt(b): author defends the latter on accuracy and SIMD hardware behavior (many roots in parallel, same latency).
  • Requests and follow-ups about NumPy and corrected figure labeling.
  • Some see DSLs / GPU-style languages (CUDA, shader-like C++) as the most ergonomic way to write SIMD; Mojo is cited as aiming to bring such capabilities via MLIR-based compilation.