The GPU is not always faster
When GPUs Lose to CPUs (Dot Products & Simple Ops)
- Many note that for simple, low–compute-intensity ops like dot products, CPUs often win because memory bandwidth, not FLOPs, dominates.
- If you must move data CPU↔GPU over PCIe for a single dot product, transfer cost dwarfs GPU compute advantages.
- Several point out that in realistic GPU workflows (e.g., deep learning), model weights are kept on the GPU and only inputs/outputs move, dramatically changing the comparison.
Compute vs Communication & Roofline Thinking
- Discussion repeatedly references the compute/communication ratio:
- Dot products: O(N) data and O(N) compute → memory-bound.
- Matrix multiply: O(N²) data and O(N³) compute → high data reuse and compute-bound.
- Roofline and hierarchical roofline models are mentioned as good mental models, including adding PCIe bandwidth as another “roof.”
- Batch processing in ML is cited as an example of turning memory-bound work into compute-bound work.
Matrix Multiplication Algorithms (Strassen, FFT, BLAS)
- Multiple comments dispute the claim that “efficient matmuls” mainly use Strassen.
- Consensus: high-performance BLAS/CUDA libraries mostly use carefully blocked O(N³) algorithms due to memory locality and numerical stability.
- Strassen and FFT-based matmul are acknowledged as asymptotically better but with large constants, stability issues, and awkward constraints (e.g., power-of-two sizes).
Hardware, Bandwidth, and Unified Memory
- Several criticize the article’s hardware: an old PCIe 3.0 GPU with low host-device bandwidth, making it effectively a PCIe benchmark.
- Others note that modern GPUs with PCIe 4.0/5.0, NVLink, or Grace-Hopper-style setups dramatically reduce transfer bottlenecks.
- Unified memory / integrated GPUs (Apple M-series, Intel iGPUs, mobile) are highlighted as making GPU use more viable for small or real-time tasks by avoiding explicit copies.
Practical Guidance and Critiques of the Article
- Key lesson: only move data to GPUs when you can amortize transfer over many operations and keep subsequent computation on the device.
- Several call the example “terrible” or “misleading,” arguing a simple CPU vs GPU+transfer comparison would have sufficed.
- There’s also a correction that the original AVX CPU bandwidth was mis-measured and later revised downward.