Making AMD GPUs competitive for LLM inference (2023)

Memory vs. Compute Bottlenecks

  • Many comments say LLM inference (especially generation, GEMV-heavy) is strongly memory‑bandwidth bound; tensor/matrix cores matter less in that phase.
  • Others argue production inference with large batch sizes and continuous batching is still heavily compute‑bound, especially in the prefill phase.
  • Consensus: GPU VRAM bandwidth (often ~1 TB/s or more) dwarfs what CPUs + CXL/PCIe can offer; CXL links in the tens–hundreds of GB/s are seen as inadequate for high‑end LLMs.

AMD vs. Nvidia Hardware Characteristics

  • AMD consumer GPUs lack Nvidia “tensor cores” but do have matrix/multiply-accumulate instructions (MFMA/WMMA) and “matrix cores” on data‑center parts like MI300.
  • Bandwidth numbers: several GPUs (RTX 3090/3090 Ti/4090, Radeon VII, 7900 XTX) are around 1 TB/s, but Nvidia generally achieves higher real‑world efficiency.
  • Data‑center AMD (CDNA, e.g., MI300X) is distinct from consumer RDNA; performance work on RDNA doesn’t transfer directly. A unified “UDNA” is mentioned as future.

Software Ecosystem & CUDA Lock‑In

  • CUDA is widely perceived as mature and “just works” relative to ROCm and Intel’s stacks, which multiple commenters found fragile or poorly coordinated.
  • Some report repeated failed attempts to use AMD GPUs for serious ML, citing crashes, unstable multi‑GPU, or lack of timely driver fixes.
  • Others note progress: ROCm on RDNA3, WSL support, and improved inference engines (e.g., MLC‑LLM, vLLM, llama.cpp ports).
  • There is interest in open or alternative stacks (D3D, Vulkan, SYCL, CUDA compatibility layers) but major frameworks still orient around CUDA and, secondarily, ROCm.

Practical LLM Inference on AMD

  • MLC‑LLM on RDNA3 is reported as very fast for some models, sometimes beating llama.cpp’s ROCm backend on the same card, though with more rigid quantization and compilation requirements.
  • vLLM now supports AMD (including GGUF and some Radeons) but has large startup/compile times for big models, making it less attractive for local use.
  • Some claim MI300X can match or beat H100 in specific inference setups; others state AMD multi‑GPU systems remain unreliable compared to Nvidia.

Local Inference & Hardware Buying Decisions

  • Frequent recommendation: used RTX 3090/3090 Ti/4090 as the “sweet spot” for local LLMs (24 GB VRAM, strong bandwidth, CUDA ecosystem).
  • AMD and older cards (Radeon VII / Pro VII) are noted as interesting for bandwidth or FP64‑heavy workloads, but generally lag in ease of use and tooling.

Market Structure, Competition, and Outlook

  • Multiple startups and projects aim to make AMD (and other accelerators) viable to weaken Nvidia’s dominance.
  • Opinions diverge sharply on AMD’s prospects: some see chronic underinvestment and poor execution in software; others point to past CPU innovations and argue they were historically resource‑constrained but are improving.
  • Concern is expressed about Nvidia’s de‑facto dominance and potential antitrust scrutiny, but there is skepticism regulators will act effectively.