2024-07-15

Run CUDA, unmodified, on AMD GPUs

Project scope & promise

SCALE is a proprietary compiler that takes unmodified CUDA C++ (including host APIs and many device features) and targets AMD GPUs (RDNA/CDNA generations like gfx900, 10xx, 11xx).
It is source-to-target compilation, not “emulation” or binary translation; behaves like a drop‑in nvcc replacement for many projects.
Inline PTX is handled by translating PTX blocks into LLVM IR early, then compiling forward to AMD code; this avoids writing AMD asm directly and lets optimizations apply.

Current limitations & open questions

Tensor-core / MMA, TMA, advanced matrix ops, and tensor-heavy kernels (e.g., FlashAttention) are not fully supported yet or are in active development; performance will lag Nvidia where hardware is weaker.
Some CUDA libraries/APIs are missing or partial (e.g., cuBLASLt, NVTX, some 128‑bit atomics, bfloat16 headers; cuDNN wrappers not clearly feature‑complete).
Behavior with complex, hardware-tuned CUDA kernels, inline PTX tricks, and NCCL / multi‑GPU comms is unclear or expected to be more work.
Benchmarks are not yet published; some users report early tests revealing gaps vs. existing HIP/ROCm paths.

Legal and IP concerns

Authors claim a clean-room implementation based on public APIs and trial‑and‑error with open CUDA code.
Debate over whether Nvidia could still litigate (e.g., via SDK EULAs or discovery pressure), but others note API reimplementation and wrapper libraries around ROCm should be legally safer.
cuDNN/cuBLAS EULAs restrict use to Nvidia GPUs, but SCALE does not ship or run those binaries; it reimplements APIs or forwards to AMD libraries.

Open source vs. proprietary

Many commenters want this to be FOSS for longevity, auditability, and ecosystem health; suggestions include “delayed open source.”
Others argue proprietary is reasonable given potential value (e.g., to AMD/Intel, or via acquisition).
Comparison with ZLUDA (open-source PTX/CUDA-on-AMD) arises; ZLUDA lacks key deep‑learning libraries, while SCALE also is incomplete but moving faster and more integrated.

AMD vs. Nvidia & ecosystem strategy

Strong sentiment that AMD underinvested in software (ROCm, HIP, MIOpen, tooling), ceding AI to Nvidia’s CUDA ecosystem.
Some think AMD should back projects like SCALE; others argue AMD should instead push open standards (OpenCL, SYCL, “raw C++” on GPUs) rather than deepen CUDA’s dominance.
Skepticism that any compatibility layer can fully match Nvidia’s rapidly evolving, tightly integrated stack (CUDA + cuDNN/cuBLAS + NCCL + networking + systems).

Related topics