Nvidia adds native Python support to CUDA

Scope of the Announcement and Existing Stack

  • Discussion clarifies that current cuda-python is mainly Cython bindings to the CUDA runtime/CUB; the “native Python” story is really about newer pieces:
    • cuda-core (“Pythonic” CUDA runtime)
    • NVMath/nvmath-python
    • Upcoming cuTile and a new Tile IR with driver-level JIT.
  • cuTile is described as Nvidia’s answer to OpenAI Triton: write GPU kernels in a Pythonic DSL that JITs to hardware-specific code.
  • Some argue the article is mostly marketing; others point to GTC talks and tweets showing genuinely new Python-first abstractions not yet fully released.

Ease of Use, Demos, and Correct Benchmarking

  • One user’s CuPy demo (matrix add) shows ~4× GPU speedup over CPU, but others note:
    • It’s a toy microbenchmark, likely not representative.
    • Correct GPU timing should use CUDA event APIs, not time.time() plus ad‑hoc synchronize().
    • Including data transfer time and avoiding unnecessary synchronization is crucial for realistic benchmarks.

Asynchrony and Programming Model

  • Explanation that CUDA launches are asynchronous and ordered via “streams”; you typically enqueue many operations then synchronize once.
  • Several comments argue mapping GPU async to language-level async/await is a bad fit, because coroutines tend to encourage early synchronization and kill throughput.

Relation to CuPy, Numba, JAX, Triton, etc.

  • CuPy, Numba, JAX, Taichi, Triton, tinygrad already enable Python-on-GPU in various forms.
  • New value is:
    • First-party Nvidia support and tighter integration (e.g., nvJitLink, Tile IR).
    • Python-first kernel authoring (cuTile) instead of C++-in-strings or external compilers.
  • Some want to see head‑to‑head benchmarks vs CuPy/JAX/Triton before getting excited.

Vendor Lock-in, AMD, and Portability

  • Concern that Tile IR widens the gap for reimplementations like ZLUDA and for AMD tooling, increasing Nvidia lock-in.
  • Others note AMD already has HIP, ROCm, and Triton support; their main problems are maturity, tooling, and delivery, not language bindings per se.
  • Question whether AMD could mirror the Python API; consensus is they could in theory, but historically haven’t executed well.

Rust, C, and Other Language Perspectives

  • Interest in Rust–CUDA (projects like rust-cuda, cudarc, Burn), but current support is seen as immature or fragile.
  • Debate over CUDA’s C++-centric design; some wish for a strict C variant for simpler interop.
  • Separate thread on shader languages like Slang as a candidate for general GPU compute.

Python’s Role and Broader Reflections

  • Many see this as further cementing Python as the “lingua franca” for numeric and ML work.
  • Side discussion on why Python dominates (ecosystem, ML/AI, teaching) vs its downsides (performance, packaging, dynamic typing).
  • Some hope for more general CPU–GPU abstractions (e.g., Mojo, Modular); others argue CPUs and GPUs are too different for a truly unified model.