AMD's MI300X Outperforms Nvidia's H100 for LLM Inference
Benchmark methodology & fairness
- Many see the benchmarks as marketing for AMD and TensorWave rather than neutral tests.
- Critiques include: tiny 128-token inputs, FP16-only (no FP8/INT8), using vLLM only (not faster engines like TensorRT-LLM or lmdeploy), and applying MK1 Flywheel on AMD but not on Nvidia.
- The comparison uses 1×MI300X (tp=1) vs 2×H100 (tp=2) and then doubles AMD throughput to “simulate” two GPUs; several commenters consider this misleading.
- Others argue that the setup is at least explicit and reproducible if you have access to both systems.
Workload realism (context lengths, batching)
- 128 input tokens is seen as unrepresentative. Real workloads include long system prompts, chat history, and document inputs.
- Suggested “real” benchmarks: 512+ tokens, with distributions like 50/500/2000/50k input tokens and ~300 output tokens, and multiple context-window sizes (512–8192+).
- Some question odd scaling behavior in the results (AMD throughput nearly doubling from batch size 1→2 on MoE models, which others say shouldn’t happen).
Performance, VRAM, and cost
- MI300X has more VRAM and bandwidth per GPU, letting it host large models without tensor parallelism, which is a real advantage.
- Others note Nvidia’s older H100 still gets within ~33% despite far fewer transistors and RAM, implying AMD’s efficiency gap isn’t decisive.
- Perf/$ and perf/W are repeatedly requested but mostly missing; electricity is deemed a small fraction of total cost compared to GPU rental prices.
- Cloud pricing for MI300X and H100 appears similar per hour; real street prices and availability remain unclear.
Software stack: CUDA vs ROCm and others
- Broad agreement that CUDA and its libraries/cuDNN are a major Nvidia moat; nearly all ML tooling is optimized for it.
- ROCm historically had poor UX, driver issues, and short-lived GPU support, but some say it has improved and PyTorch on AMD now “just works” for them.
- Debate over hyperscalers: some claim they use their own retargetable stacks and don’t depend on CUDA source; others counter that they still rely heavily on Nvidia’s lower-level stack.
Market dynamics, competition, and bubble talk
- Many welcome AMD’s progress as essential competitive pressure on Nvidia’s very high margins and quasi‑monopoly.
- Others argue Nvidia’s dominance is not just ecosystem lock‑in but also better microarchitecture and long-term AI focus.
- There’s extensive side discussion on whether current AI demand is a bubble, how long Nvidia’s lead will last, and whether future AMD generations (MI325/MI350) and other vendors will erode Nvidia’s position, especially for inference.
- Several note that training remains Nvidia’s strongest area; MI300X is seen as more competitive today on certain inference workloads than on end-to-end training.