2024-12-24

Making AMD GPUs competitive for LLM inference (2023)

Memory vs. Compute Bottlenecks

Many comments say LLM inference (especially generation, GEMV-heavy) is strongly memory‑bandwidth bound; tensor/matrix cores matter less in that phase.
Others argue production inference with large batch sizes and continuous batching is still heavily compute‑bound, especially in the prefill phase.
Consensus: GPU VRAM bandwidth (often ~1 TB/s or more) dwarfs what CPUs + CXL/PCIe can offer; CXL links in the tens–hundreds of GB/s are seen as inadequate for high‑end LLMs.

AMD vs. Nvidia Hardware Characteristics

AMD consumer GPUs lack Nvidia “tensor cores” but do have matrix/multiply-accumulate instructions (MFMA/WMMA) and “matrix cores” on data‑center parts like MI300.
Bandwidth numbers: several GPUs (RTX 3090/3090 Ti/4090, Radeon VII, 7900 XTX) are around 1 TB/s, but Nvidia generally achieves higher real‑world efficiency.
Data‑center AMD (CDNA, e.g., MI300X) is distinct from consumer RDNA; performance work on RDNA doesn’t transfer directly. A unified “UDNA” is mentioned as future.

Software Ecosystem & CUDA Lock‑In

CUDA is widely perceived as mature and “just works” relative to ROCm and Intel’s stacks, which multiple commenters found fragile or poorly coordinated.
Some report repeated failed attempts to use AMD GPUs for serious ML, citing crashes, unstable multi‑GPU, or lack of timely driver fixes.
Others note progress: ROCm on RDNA3, WSL support, and improved inference engines (e.g., MLC‑LLM, vLLM, llama.cpp ports).
There is interest in open or alternative stacks (D3D, Vulkan, SYCL, CUDA compatibility layers) but major frameworks still orient around CUDA and, secondarily, ROCm.

Practical LLM Inference on AMD

MLC‑LLM on RDNA3 is reported as very fast for some models, sometimes beating llama.cpp’s ROCm backend on the same card, though with more rigid quantization and compilation requirements.
vLLM now supports AMD (including GGUF and some Radeons) but has large startup/compile times for big models, making it less attractive for local use.
Some claim MI300X can match or beat H100 in specific inference setups; others state AMD multi‑GPU systems remain unreliable compared to Nvidia.

Local Inference & Hardware Buying Decisions

Frequent recommendation: used RTX 3090/3090 Ti/4090 as the “sweet spot” for local LLMs (24 GB VRAM, strong bandwidth, CUDA ecosystem).
AMD and older cards (Radeon VII / Pro VII) are noted as interesting for bandwidth or FP64‑heavy workloads, but generally lag in ease of use and tooling.

Market Structure, Competition, and Outlook

Multiple startups and projects aim to make AMD (and other accelerators) viable to weaken Nvidia’s dominance.
Opinions diverge sharply on AMD’s prospects: some see chronic underinvestment and poor execution in software; others point to past CPU innovations and argue they were historically resource‑constrained but are improving.
Concern is expressed about Nvidia’s de‑facto dominance and potential antitrust scrutiny, but there is skepticism regulators will act effectively.

Related topics