Testing AMD's Giant MI300X

Hardware performance and pricing

  • Commenters see MI300X as very strong hardware, especially on memory bandwidth and LLM inference, with some specific weaknesses (e.g., global atomics).
  • It appears competitive with NVIDIA’s H100 for inference, though not clearly for training or versus newer parts (H200/B100).
  • Rumored pricing (~$15–20k, NDA in practice) plus large VRAM (192 GB) leads many to think AMD can significantly undercut NVIDIA while remaining profitable.

CUDA moat and software ecosystem

  • The dominant theme is that CUDA and NVIDIA’s software stack are the real moat.
  • Many say CUDA’s maturity, stability, tooling, and huge ecosystem make switching costly; AMD’s repeated API resets in the past have damaged trust.
  • For most ML users, high-level frameworks (PyTorch, JAX, TensorRT-LLM, vLLM) hide CUDA, but those frameworks themselves tend to be CUDA-first.

ROCm and AMD software quality

  • ROCm is widely described as immature, flaky, or “spotty,” especially on consumer GPUs; it works better on carefully supported HPC setups.
  • Some report successful training on AMD with PyTorch, others recount severe instability and poor tooling.
  • Several posts argue AMD historically treated software as a cost center, with leadership heavily hardware-focused and defensive about software weaknesses.

Strategy and market positioning

  • One view: AMD correctly targets hyperscalers and HPC, where customers can afford engineers to patch around software gaps and care most about TCO.
  • Opposing view: ignoring hobbyists/academia and not flooding the market with cheaper, large-VRAM cards cedes mindshare and deepens CUDA lock-in.
  • Debate over whether AMD’s financial history (near-bankruptcy a decade ago) justifies the slow software buildup.

Ecosystem, evangelism, and developer access

  • Multiple commenters stress the need for better evangelism, easy trial access (e.g., “Colab-like” with minimal bureaucracy), and clear compatibility matrices.
  • A hosting provider loaning MI300X systems for benchmarking and free compute is praised as a good way to bootstrap a developer flywheel.
  • There is interest in CUDA-to-AMD translation layers (HIPIFY, ZLUDA) and portable models (SYCL, Triton, Modular), but support is still uneven.

Use cases, benchmarks, and limits

  • LLM inference discussion centers on bandwidth-limited tokens/sec; a simple model shows ~37 tokens/s for LLaMA 3 70B on a single MI300X, higher with batching.
  • Some explore potential in production rendering and unified CPU–GPU memory (MI300A APU), as well as home/desktop setups, but VRAM and cost remain major constraints.
  • Several call for more realistic multi-GPU, high-batch, concurrent-request benchmarks for LLM inference.