2024-06-29

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

CUDA vs. AMD Software Ecosystem

One side argues CUDA is a deep moat: huge, complex surface area; PyTorch/TF “can be” portable, but ROCm backends lag far behind Nvidia’s in maturity and coverage.
Others counter that much of the value is higher in the stack (PyTorch, HF, vLLM), and if ROCm is “good enough” for those, Nvidia’s moat weakens, especially with better $/perf.
Debate over AMD’s strategy of mirroring CUDA via HIP: some see it as pragmatic, others as risky “yoking” to a competitor’s runtime.

Hardware, Chiplets, and Interconnects

Some see AMD’s chiplet + advanced packaging strength (HBM, 3D cache, APUs) as a long‑term cost and yield advantage vs. Nvidia’s large monolithic dies.
Counter‑arguments note Nvidia is already using advanced packaging and HBM in data‑center parts and has a major advantage with high‑speed interconnects (Mellanox/InfiniBand), though others expect emerging “Ultra Ethernet”‑style approaches to erode that.
AMD’s MI300X praised for fitting 70B models on a single GPU, vs. two H100s, and for strong perf/$ potential if software catches up.

LLM Inference Stacks: vLLM vs TensorRT-LLM / LMDeploy

vLLM is seen as easier to use and widely deployed, but several benchmarks show TensorRT-LLM and LMDeploy 2–3× faster in some settings.
Others report more modest gaps (≈10%) in internal tests and highlight that benchmark details (sequence lengths, batch sizes, quantization schemes, caching) can drastically change results.
GEMM tuning via AMD’s rocBLAS autotuning tool is credited for the gains in the article, but some wish for more technical detail.

Benchmark Results and Skepticism

Multiple commenters scrutinize the MI300X numbers, especially:
- Llama‑2‑70B, batch size 1, 256‑token prompt + 256‑token output reportedly completed in ~~1.63s (~~314 tok/s).
- Napkin math using model size (≈128 GB FP16) and HBM bandwidth (5.3 TB/s) suggests this should be impossible if the full weights are read once per generated token.
Concerns:
- Throughput figures appear to exceed theoretical bandwidth limits.
- 70B models seem only ~2× slower than 7B, which doesn’t fit intuition.
- Some spreadsheet columns (throughput metrics) don’t fully add up.
Others note that Docker images and configs are published and encourage independent replication, but until then several label the results “sketchy” or at least unclear.

Perceptions of AMD vs Nvidia Trajectory

Evidence cited that AMD is winning some large deals (hyperscalers, national labs), though Nvidia still massively dominates revenue and shipment volume.
View that if hardware $/perf is even ~10% better and software reaches ~90% of Nvidia, big customers will justify extra engineering to support AMD.
Strong skepticism from others who stress that closing the software gap likely requires sustained, massive investment and time.

Article Quality and Possible LLM Authorship

Several readers feel the blog post reads like LLM‑generated marketing: repetitive phrasing, generic “GPT‑isms,” and lack of specific technical insight.
Concern that heavy LLM drafting without clear verification undermines trust in both prose and technical claims.
Author‑adjacent comments suggest non‑native English plus LLM polishing as a possible explanation, and emphasize that the real value is in the released containers and data, not the prose.

Related topics