Nvidia adds native Python support to CUDA
Scope of the Announcement and Existing Stack
- Discussion clarifies that current
cuda-pythonis mainly Cython bindings to the CUDA runtime/CUB; the “native Python” story is really about newer pieces:cuda-core(“Pythonic” CUDA runtime)- NVMath/nvmath-python
- Upcoming cuTile and a new Tile IR with driver-level JIT.
- cuTile is described as Nvidia’s answer to OpenAI Triton: write GPU kernels in a Pythonic DSL that JITs to hardware-specific code.
- Some argue the article is mostly marketing; others point to GTC talks and tweets showing genuinely new Python-first abstractions not yet fully released.
Ease of Use, Demos, and Correct Benchmarking
- One user’s CuPy demo (matrix add) shows ~4× GPU speedup over CPU, but others note:
- It’s a toy microbenchmark, likely not representative.
- Correct GPU timing should use CUDA event APIs, not
time.time()plus ad‑hocsynchronize(). - Including data transfer time and avoiding unnecessary synchronization is crucial for realistic benchmarks.
Asynchrony and Programming Model
- Explanation that CUDA launches are asynchronous and ordered via “streams”; you typically enqueue many operations then synchronize once.
- Several comments argue mapping GPU async to language-level
async/awaitis a bad fit, because coroutines tend to encourage early synchronization and kill throughput.
Relation to CuPy, Numba, JAX, Triton, etc.
- CuPy, Numba, JAX, Taichi, Triton, tinygrad already enable Python-on-GPU in various forms.
- New value is:
- First-party Nvidia support and tighter integration (e.g., nvJitLink, Tile IR).
- Python-first kernel authoring (cuTile) instead of C++-in-strings or external compilers.
- Some want to see head‑to‑head benchmarks vs CuPy/JAX/Triton before getting excited.
Vendor Lock-in, AMD, and Portability
- Concern that Tile IR widens the gap for reimplementations like ZLUDA and for AMD tooling, increasing Nvidia lock-in.
- Others note AMD already has HIP, ROCm, and Triton support; their main problems are maturity, tooling, and delivery, not language bindings per se.
- Question whether AMD could mirror the Python API; consensus is they could in theory, but historically haven’t executed well.
Rust, C, and Other Language Perspectives
- Interest in Rust–CUDA (projects like
rust-cuda,cudarc, Burn), but current support is seen as immature or fragile. - Debate over CUDA’s C++-centric design; some wish for a strict C variant for simpler interop.
- Separate thread on shader languages like Slang as a candidate for general GPU compute.
Python’s Role and Broader Reflections
- Many see this as further cementing Python as the “lingua franca” for numeric and ML work.
- Side discussion on why Python dominates (ecosystem, ML/AI, teaching) vs its downsides (performance, packaging, dynamic typing).
- Some hope for more general CPU–GPU abstractions (e.g., Mojo, Modular); others argue CPUs and GPUs are too different for a truly unified model.