Apple's MLX adding CUDA support

What the PR Actually Does

  • Adds a CUDA backend for MLX, targeting Linux with CUDA 12 and SM 7.0+ GPUs.
  • It’s not CUDA on Apple Silicon, and not a reimplementation of the CUDA API.
  • Intended use: write MLX code on a Mac (Metal/Apple Silicon), run it on Nvidia clusters/supercomputers via CUDA.
  • Early tests show mlx-cuda wheels exist (currently for Python 3.12 only).

Why This Matters

  • Makes MLX a more serious competitor to PyTorch/JAX by giving it access to mainstream Nvidia infrastructure.
  • Improves developer experience for Mac users: prototype locally on Apple hardware, deploy at scale on Nvidia.
  • Some speculate this could slightly increase overall AI capacity if it eases use of existing clusters.
  • Others stress this does not threaten Nvidia; abstraction layers typically still land on Nvidia GPUs in production, which reinforces Nvidia’s position.

Unified Memory & Performance Discussion

  • MLX leans on unified memory; CUDA’s “Unified Memory” is implemented via page migration and on-demand faulting, not physically shared RAM.
  • On Apple Silicon, CPU and GPU truly share physical memory; on most CUDA systems, data must still be moved, just hidden by the runtime.
  • Several commenters note that CUDA Unified Memory can cause severe memory stalls without manual prefetching, especially for ML training; performance is highly workload-dependent.
  • High-end Nvidia setups (Grace Hopper, NVLink, Jetson) offer tighter CPU–GPU memory integration, but behavior and speed still differ from Apple’s UMA.

Legal / IP and CUDA Compatibility

  • Thread repeatedly clarifies: this PR does not reimplement CUDA APIs, so copyright/API issues aren’t directly engaged.
  • Google v. Oracle is cited as important precedent for reimplementing APIs under fair use, but people caution that the ruling is narrow and legally nuanced.
  • Multiple comments emphasize that CUDA is an ecosystem (compilers, libraries, tools, debuggers, profilers), not “just an API”, and cloning it fully would be enormously difficult and expensive, even aside from IP questions.

Broader Ecosystem & Apple Strategy

  • Some hope this is a step toward MLX as a vendor-neutral layer; others see it simply as Apple making its stack usable in Nvidia-centric research environments.
  • There is frustration that open standards (OpenCL, Khronos) failed to counter CUDA, with some blame placed on Apple for abandoning OpenCL just as demand rose.
  • Debate continues over Apple’s AI strategy, lack of Nvidia support on Macs, and whether Apple will ever ship datacenter- or Nvidia-based solutions; no consensus, and no concrete evidence in the thread.