2024-12-22

CUDA Moat Still Alive

Local inference and consumer hardware

Several commenters want benchmarks for consumer‑grade inference, e.g., llama.cpp/ollama on low‑power systems for Home Assistant (STT/TTS/LLM).
Jetson‑class devices and the new Home Assistant Voice box are mentioned as promising, but typical showcase builds use power‑hungry GPUs (dual 3090s), seen as impractical for home use.

ROCm, drivers, and AMD consumer GPU experience

Multiple reports of ROCm on consumer AMD GPUs being painful or unusable, especially on Linux: limited officially supported SKUs, slow enablement, and frequent breakage across versions.
Some note that newer high‑end cards (e.g., Navi 21/31) can work with PyTorch on Linux, but support lagged by years and is fragile.
Complaints focus on lack of forward compatibility, dropped support for relatively recent GPUs, and the perception that AMD’s software is “not serious” compared to CUDA.

Training on MI300X vs NVIDIA H100/H200

The benchmark article is read as showing MI300X can be fast and cost‑competitive only after heavy tuning and direct help from AMD engineers over months.
Many see this as unusable for most teams; hyperscalers may tolerate custom stacks, but smaller users won’t.
Some argue this is still progress versus earlier crash‑prone states; others counter it’s far from production‑ready training.

CUDA moat vs hardware vs software

Debate over whether CUDA’s advantage is mainly software ecosystem, PTX compatibility, and tooling, or superior NVIDIA hardware design.
One side stresses NVIDIA’s long‑term binary and source compatibility and “just works” experience as the real moat.
Another side claims AMD’s underlying hardware is weaker and that software alone can’t close the gap; others dispute that, pointing to matmul inefficiencies as fixable engineering, not fundamental limits.

GEMM/matmul performance and technical issues

Many focus on AMD’s poor GEMM/matmul utilization versus theoretical peaks, especially compared to NVIDIA.
Discussion covers algorithm selection heuristics, tiling, cache/SRAM use, and kernel tuning; NVIDIA is seen as having invested deeply in variants and heuristics, AMD far less.
Recent ROCm BLAS tuning PRs are noted, but seen as years late given AI’s timeline.

Ecosystem, networking, and standards

AMD’s architectural split (CDNA vs RDNA), lack of a PTX‑like virtual ISA, fragmented drivers, and fast deprecation are criticized as strategic mistakes; there is mention of a planned unified “UDNA.”
Networking debate: some call InfiniBand an established standard others say it’s effectively a proprietary NVIDIA/Mellanox domain now; Ultra Ethernet is viewed either as sensible competition or needless reinvention. Status is unclear and contested.

Business, culture, and hiring

Several see AMD’s problems as cultural and organizational: under‑investment in software, lack of internal hardware access for their own engineers, slow hiring pipelines, and management downplaying software issues.
Others think AMD’s current focus is on large data‑center customers who can afford to co‑develop software; consumer/“out‑of‑box” quality is deprioritized.
Financially, some see an opportunity if AMD fixes software; others predict NVIDIA will keep pulling ahead before AMD’s stack matures.

Related topics