The GPU is not always faster

When GPUs Lose to CPUs (Dot Products & Simple Ops)

  • Many note that for simple, low–compute-intensity ops like dot products, CPUs often win because memory bandwidth, not FLOPs, dominates.
  • If you must move data CPU↔GPU over PCIe for a single dot product, transfer cost dwarfs GPU compute advantages.
  • Several point out that in realistic GPU workflows (e.g., deep learning), model weights are kept on the GPU and only inputs/outputs move, dramatically changing the comparison.

Compute vs Communication & Roofline Thinking

  • Discussion repeatedly references the compute/communication ratio:
    • Dot products: O(N) data and O(N) compute → memory-bound.
    • Matrix multiply: O(N²) data and O(N³) compute → high data reuse and compute-bound.
  • Roofline and hierarchical roofline models are mentioned as good mental models, including adding PCIe bandwidth as another “roof.”
  • Batch processing in ML is cited as an example of turning memory-bound work into compute-bound work.

Matrix Multiplication Algorithms (Strassen, FFT, BLAS)

  • Multiple comments dispute the claim that “efficient matmuls” mainly use Strassen.
  • Consensus: high-performance BLAS/CUDA libraries mostly use carefully blocked O(N³) algorithms due to memory locality and numerical stability.
  • Strassen and FFT-based matmul are acknowledged as asymptotically better but with large constants, stability issues, and awkward constraints (e.g., power-of-two sizes).

Hardware, Bandwidth, and Unified Memory

  • Several criticize the article’s hardware: an old PCIe 3.0 GPU with low host-device bandwidth, making it effectively a PCIe benchmark.
  • Others note that modern GPUs with PCIe 4.0/5.0, NVLink, or Grace-Hopper-style setups dramatically reduce transfer bottlenecks.
  • Unified memory / integrated GPUs (Apple M-series, Intel iGPUs, mobile) are highlighted as making GPU use more viable for small or real-time tasks by avoiding explicit copies.

Practical Guidance and Critiques of the Article

  • Key lesson: only move data to GPUs when you can amortize transfer over many operations and keep subsequent computation on the device.
  • Several call the example “terrible” or “misleading,” arguing a simple CPU vs GPU+transfer comparison would have sufficed.
  • There’s also a correction that the original AVX CPU bandwidth was mis-measured and later revised downward.