Run CUDA, unmodified, on AMD GPUs
Project scope & promise
- SCALE is a proprietary compiler that takes unmodified CUDA C++ (including host APIs and many device features) and targets AMD GPUs (RDNA/CDNA generations like gfx900, 10xx, 11xx).
- It is source-to-target compilation, not “emulation” or binary translation; behaves like a drop‑in
nvccreplacement for many projects. - Inline PTX is handled by translating PTX blocks into LLVM IR early, then compiling forward to AMD code; this avoids writing AMD asm directly and lets optimizations apply.
Current limitations & open questions
- Tensor-core / MMA, TMA, advanced matrix ops, and tensor-heavy kernels (e.g., FlashAttention) are not fully supported yet or are in active development; performance will lag Nvidia where hardware is weaker.
- Some CUDA libraries/APIs are missing or partial (e.g., cuBLASLt, NVTX, some 128‑bit atomics, bfloat16 headers; cuDNN wrappers not clearly feature‑complete).
- Behavior with complex, hardware-tuned CUDA kernels, inline PTX tricks, and NCCL / multi‑GPU comms is unclear or expected to be more work.
- Benchmarks are not yet published; some users report early tests revealing gaps vs. existing HIP/ROCm paths.
Legal and IP concerns
- Authors claim a clean-room implementation based on public APIs and trial‑and‑error with open CUDA code.
- Debate over whether Nvidia could still litigate (e.g., via SDK EULAs or discovery pressure), but others note API reimplementation and wrapper libraries around ROCm should be legally safer.
- cuDNN/cuBLAS EULAs restrict use to Nvidia GPUs, but SCALE does not ship or run those binaries; it reimplements APIs or forwards to AMD libraries.
Open source vs. proprietary
- Many commenters want this to be FOSS for longevity, auditability, and ecosystem health; suggestions include “delayed open source.”
- Others argue proprietary is reasonable given potential value (e.g., to AMD/Intel, or via acquisition).
- Comparison with ZLUDA (open-source PTX/CUDA-on-AMD) arises; ZLUDA lacks key deep‑learning libraries, while SCALE also is incomplete but moving faster and more integrated.
AMD vs. Nvidia & ecosystem strategy
- Strong sentiment that AMD underinvested in software (ROCm, HIP, MIOpen, tooling), ceding AI to Nvidia’s CUDA ecosystem.
- Some think AMD should back projects like SCALE; others argue AMD should instead push open standards (OpenCL, SYCL, “raw C++” on GPUs) rather than deepen CUDA’s dominance.
- Skepticism that any compatibility layer can fully match Nvidia’s rapidly evolving, tightly integrated stack (CUDA + cuDNN/cuBLAS + NCCL + networking + systems).