2024-12-11

The GPU is not always faster

When GPUs Lose to CPUs (Dot Products & Simple Ops)

Many note that for simple, low–compute-intensity ops like dot products, CPUs often win because memory bandwidth, not FLOPs, dominates.
If you must move data CPU↔GPU over PCIe for a single dot product, transfer cost dwarfs GPU compute advantages.
Several point out that in realistic GPU workflows (e.g., deep learning), model weights are kept on the GPU and only inputs/outputs move, dramatically changing the comparison.

Compute vs Communication & Roofline Thinking

Discussion repeatedly references the compute/communication ratio:
- Dot products: O(N) data and O(N) compute → memory-bound.
- Matrix multiply: O(N²) data and O(N³) compute → high data reuse and compute-bound.
Roofline and hierarchical roofline models are mentioned as good mental models, including adding PCIe bandwidth as another “roof.”
Batch processing in ML is cited as an example of turning memory-bound work into compute-bound work.

Matrix Multiplication Algorithms (Strassen, FFT, BLAS)

Multiple comments dispute the claim that “efficient matmuls” mainly use Strassen.
Consensus: high-performance BLAS/CUDA libraries mostly use carefully blocked O(N³) algorithms due to memory locality and numerical stability.
Strassen and FFT-based matmul are acknowledged as asymptotically better but with large constants, stability issues, and awkward constraints (e.g., power-of-two sizes).

Hardware, Bandwidth, and Unified Memory

Several criticize the article’s hardware: an old PCIe 3.0 GPU with low host-device bandwidth, making it effectively a PCIe benchmark.
Others note that modern GPUs with PCIe 4.0/5.0, NVLink, or Grace-Hopper-style setups dramatically reduce transfer bottlenecks.
Unified memory / integrated GPUs (Apple M-series, Intel iGPUs, mobile) are highlighted as making GPU use more viable for small or real-time tasks by avoiding explicit copies.

Practical Guidance and Critiques of the Article

Key lesson: only move data to GPUs when you can amortize transfer over many operations and keep subsequent computation on the device.
Several call the example “terrible” or “misleading,” arguing a simple CPU vs GPU+transfer comparison would have sufficed.
There’s also a correction that the original AVX CPU bandwidth was mis-measured and later revised downward.

Related topics