2024-11-18

AMD now has more compute on the top 500 than Nvidia

CUDA moat and ecosystem

Several commenters debate whether Nvidia’s CUDA advantage will erode in 2–5 years.
One side argues CUDA is a deep, 17‑year stack: C/C++/Fortran, PTX, rich libraries (cuBLAS, cuDNN, etc.), IDE integration, profilers, and relatively stable drivers. Competing vendors (AMD, Intel, Apple, Khronos/OpenCL) are seen as having failed to match the full ecosystem.
Others say most AI users interact with CUDA only via higher‑level frameworks (PyTorch, TensorFlow, JAX, Triton). For many LLM workloads, only a tiny fraction of CUDA is exercised, making porting more tractable.
Some claim the “CUDA moat” is overrated for AI: TPUs succeeded despite CUDA, because of better price/performance and framework support.
There is mention of emerging efforts (HIP, ZLUDA, SCALE, ROCm, Apple MPS) that aim to reduce lock‑in.

AMD vs Nvidia hardware and performance

Opinions are split:
- Critics say AMD lags badly in low‑level performance, collectives (AllReduce/AllGather), and practical MFU for transformers; cloud pricing for MI300x is cited as worse than H100.
- Others counter that AMD powers major inference workloads (e.g., large LLMs) and that hyperscalers are buying billions in AMD GPUs.
Nvidia is viewed as optimizing aggressively for low‑precision ML (FP16/FP8/INT8), while AMD’s MI300A is noted as strong in FP64 for traditional HPC.
Some see an emerging gap: Nvidia prioritizes ML margins; HPC users fear both Nvidia and AMD are de‑emphasizing high‑precision FP64.

Top500 list vs real‑world clusters

Multiple comments argue that the biggest AI clusters (Google, Meta, Microsoft, Amazon, XAI, etc.) are larger and more capable than Top500 entries but do not submit, so Top500 is an incomplete picture.
Others defend Top500 and LINPACK as a long‑running, simple baseline for public comparison, even if it over‑represents government/academic workloads and under‑represents cloud AI.
Some see this AMD “win” as partly a story of public sector and HPC buyers being priced out or supply‑constrained on Nvidia.

Metrics, power, and economics

FLOPS units (giga/tera/peta/exa) are explained and compared to consumer CPUs/GPUs.
Several argue that at large scale, power density and cooling, not GPU purchase price, are often the bottleneck, though there’s disagreement on how strongly they scale with power.
There is debate over whether a 30% slower but much cheaper GPU is attractive, given power and space constraints.

Related topics