2024-06-25

Testing AMD's Giant MI300X

Original Article ↗ Hacker News Discussion ↗

Hardware performance and pricing

Commenters see MI300X as very strong hardware, especially on memory bandwidth and LLM inference, with some specific weaknesses (e.g., global atomics).
It appears competitive with NVIDIA’s H100 for inference, though not clearly for training or versus newer parts (H200/B100).
Rumored pricing (~$15–20k, NDA in practice) plus large VRAM (192 GB) leads many to think AMD can significantly undercut NVIDIA while remaining profitable.

CUDA moat and software ecosystem

The dominant theme is that CUDA and NVIDIA’s software stack are the real moat.
Many say CUDA’s maturity, stability, tooling, and huge ecosystem make switching costly; AMD’s repeated API resets in the past have damaged trust.
For most ML users, high-level frameworks (PyTorch, JAX, TensorRT-LLM, vLLM) hide CUDA, but those frameworks themselves tend to be CUDA-first.

ROCm and AMD software quality

ROCm is widely described as immature, flaky, or “spotty,” especially on consumer GPUs; it works better on carefully supported HPC setups.
Some report successful training on AMD with PyTorch, others recount severe instability and poor tooling.
Several posts argue AMD historically treated software as a cost center, with leadership heavily hardware-focused and defensive about software weaknesses.

Strategy and market positioning

One view: AMD correctly targets hyperscalers and HPC, where customers can afford engineers to patch around software gaps and care most about TCO.
Opposing view: ignoring hobbyists/academia and not flooding the market with cheaper, large-VRAM cards cedes mindshare and deepens CUDA lock-in.
Debate over whether AMD’s financial history (near-bankruptcy a decade ago) justifies the slow software buildup.

Ecosystem, evangelism, and developer access

Multiple commenters stress the need for better evangelism, easy trial access (e.g., “Colab-like” with minimal bureaucracy), and clear compatibility matrices.
A hosting provider loaning MI300X systems for benchmarking and free compute is praised as a good way to bootstrap a developer flywheel.
There is interest in CUDA-to-AMD translation layers (HIPIFY, ZLUDA) and portable models (SYCL, Triton, Modular), but support is still uneven.

Use cases, benchmarks, and limits

LLM inference discussion centers on bandwidth-limited tokens/sec; a simple model shows ~37 tokens/s for LLaMA 3 70B on a single MI300X, higher with batching.
Some explore potential in production rendering and unified CPU–GPU memory (MI300A APU), as well as home/desktop setups, but VRAM and cost remain major constraints.
Several call for more realistic multi-GPU, high-batch, concurrent-request benchmarks for LLM inference.