2025-04-04

Nvidia adds native Python support to CUDA

Scope of the Announcement and Existing Stack

Discussion clarifies that current cuda-python is mainly Cython bindings to the CUDA runtime/CUB; the “native Python” story is really about newer pieces:
- cuda-core (“Pythonic” CUDA runtime)
- NVMath/nvmath-python
- Upcoming cuTile and a new Tile IR with driver-level JIT.
cuTile is described as Nvidia’s answer to OpenAI Triton: write GPU kernels in a Pythonic DSL that JITs to hardware-specific code.
Some argue the article is mostly marketing; others point to GTC talks and tweets showing genuinely new Python-first abstractions not yet fully released.

Ease of Use, Demos, and Correct Benchmarking

One user’s CuPy demo (matrix add) shows ~4× GPU speedup over CPU, but others note:
- It’s a toy microbenchmark, likely not representative.
- Correct GPU timing should use CUDA event APIs, not time.time() plus ad‑hoc synchronize().
- Including data transfer time and avoiding unnecessary synchronization is crucial for realistic benchmarks.

Asynchrony and Programming Model

Explanation that CUDA launches are asynchronous and ordered via “streams”; you typically enqueue many operations then synchronize once.
Several comments argue mapping GPU async to language-level async/await is a bad fit, because coroutines tend to encourage early synchronization and kill throughput.

Relation to CuPy, Numba, JAX, Triton, etc.

CuPy, Numba, JAX, Taichi, Triton, tinygrad already enable Python-on-GPU in various forms.
New value is:
- First-party Nvidia support and tighter integration (e.g., nvJitLink, Tile IR).
- Python-first kernel authoring (cuTile) instead of C++-in-strings or external compilers.
Some want to see head‑to‑head benchmarks vs CuPy/JAX/Triton before getting excited.

Vendor Lock-in, AMD, and Portability

Concern that Tile IR widens the gap for reimplementations like ZLUDA and for AMD tooling, increasing Nvidia lock-in.
Others note AMD already has HIP, ROCm, and Triton support; their main problems are maturity, tooling, and delivery, not language bindings per se.
Question whether AMD could mirror the Python API; consensus is they could in theory, but historically haven’t executed well.

Rust, C, and Other Language Perspectives

Interest in Rust–CUDA (projects like rust-cuda, cudarc, Burn), but current support is seen as immature or fragile.
Debate over CUDA’s C++-centric design; some wish for a strict C variant for simpler interop.
Separate thread on shader languages like Slang as a candidate for general GPU compute.

Python’s Role and Broader Reflections

Many see this as further cementing Python as the “lingua franca” for numeric and ML work.
Side discussion on why Python dominates (ecosystem, ML/AI, teaching) vs its downsides (performance, packaging, dynamic typing).
Some hope for more general CPU–GPU abstractions (e.g., Mojo, Modular); others argue CPUs and GPUs are too different for a truly unified model.

Related topics