Run CUDA, unmodified, on AMD GPUs

Project scope & promise

  • SCALE is a proprietary compiler that takes unmodified CUDA C++ (including host APIs and many device features) and targets AMD GPUs (RDNA/CDNA generations like gfx900, 10xx, 11xx).
  • It is source-to-target compilation, not “emulation” or binary translation; behaves like a drop‑in nvcc replacement for many projects.
  • Inline PTX is handled by translating PTX blocks into LLVM IR early, then compiling forward to AMD code; this avoids writing AMD asm directly and lets optimizations apply.

Current limitations & open questions

  • Tensor-core / MMA, TMA, advanced matrix ops, and tensor-heavy kernels (e.g., FlashAttention) are not fully supported yet or are in active development; performance will lag Nvidia where hardware is weaker.
  • Some CUDA libraries/APIs are missing or partial (e.g., cuBLASLt, NVTX, some 128‑bit atomics, bfloat16 headers; cuDNN wrappers not clearly feature‑complete).
  • Behavior with complex, hardware-tuned CUDA kernels, inline PTX tricks, and NCCL / multi‑GPU comms is unclear or expected to be more work.
  • Benchmarks are not yet published; some users report early tests revealing gaps vs. existing HIP/ROCm paths.

Legal and IP concerns

  • Authors claim a clean-room implementation based on public APIs and trial‑and‑error with open CUDA code.
  • Debate over whether Nvidia could still litigate (e.g., via SDK EULAs or discovery pressure), but others note API reimplementation and wrapper libraries around ROCm should be legally safer.
  • cuDNN/cuBLAS EULAs restrict use to Nvidia GPUs, but SCALE does not ship or run those binaries; it reimplements APIs or forwards to AMD libraries.

Open source vs. proprietary

  • Many commenters want this to be FOSS for longevity, auditability, and ecosystem health; suggestions include “delayed open source.”
  • Others argue proprietary is reasonable given potential value (e.g., to AMD/Intel, or via acquisition).
  • Comparison with ZLUDA (open-source PTX/CUDA-on-AMD) arises; ZLUDA lacks key deep‑learning libraries, while SCALE also is incomplete but moving faster and more integrated.

AMD vs. Nvidia & ecosystem strategy

  • Strong sentiment that AMD underinvested in software (ROCm, HIP, MIOpen, tooling), ceding AI to Nvidia’s CUDA ecosystem.
  • Some think AMD should back projects like SCALE; others argue AMD should instead push open standards (OpenCL, SYCL, “raw C++” on GPUs) rather than deepen CUDA’s dominance.
  • Skepticism that any compatibility layer can fully match Nvidia’s rapidly evolving, tightly integrated stack (CUDA + cuDNN/cuBLAS + NCCL + networking + systems).