Testing AMD's Giant MI300X
Hardware performance and pricing
- Commenters see MI300X as very strong hardware, especially on memory bandwidth and LLM inference, with some specific weaknesses (e.g., global atomics).
- It appears competitive with NVIDIA’s H100 for inference, though not clearly for training or versus newer parts (H200/B100).
- Rumored pricing (~$15–20k, NDA in practice) plus large VRAM (192 GB) leads many to think AMD can significantly undercut NVIDIA while remaining profitable.
CUDA moat and software ecosystem
- The dominant theme is that CUDA and NVIDIA’s software stack are the real moat.
- Many say CUDA’s maturity, stability, tooling, and huge ecosystem make switching costly; AMD’s repeated API resets in the past have damaged trust.
- For most ML users, high-level frameworks (PyTorch, JAX, TensorRT-LLM, vLLM) hide CUDA, but those frameworks themselves tend to be CUDA-first.
ROCm and AMD software quality
- ROCm is widely described as immature, flaky, or “spotty,” especially on consumer GPUs; it works better on carefully supported HPC setups.
- Some report successful training on AMD with PyTorch, others recount severe instability and poor tooling.
- Several posts argue AMD historically treated software as a cost center, with leadership heavily hardware-focused and defensive about software weaknesses.
Strategy and market positioning
- One view: AMD correctly targets hyperscalers and HPC, where customers can afford engineers to patch around software gaps and care most about TCO.
- Opposing view: ignoring hobbyists/academia and not flooding the market with cheaper, large-VRAM cards cedes mindshare and deepens CUDA lock-in.
- Debate over whether AMD’s financial history (near-bankruptcy a decade ago) justifies the slow software buildup.
Ecosystem, evangelism, and developer access
- Multiple commenters stress the need for better evangelism, easy trial access (e.g., “Colab-like” with minimal bureaucracy), and clear compatibility matrices.
- A hosting provider loaning MI300X systems for benchmarking and free compute is praised as a good way to bootstrap a developer flywheel.
- There is interest in CUDA-to-AMD translation layers (HIPIFY, ZLUDA) and portable models (SYCL, Triton, Modular), but support is still uneven.
Use cases, benchmarks, and limits
- LLM inference discussion centers on bandwidth-limited tokens/sec; a simple model shows ~37 tokens/s for LLaMA 3 70B on a single MI300X, higher with batching.
- Some explore potential in production rendering and unified CPU–GPU memory (MI300A APU), as well as home/desktop setups, but VRAM and cost remain major constraints.
- Several call for more realistic multi-GPU, high-batch, concurrent-request benchmarks for LLM inference.