2024-09-20

CuPy: NumPy and SciPy for GPU

CuPy as (Almost) Drop‑in NumPy/SciPy for GPU

Several users report big speedups with minimal code changes, sometimes literally import cupy as np.
Examples: radar signal processing from ~30s to ~1s; quantum eigenvalue computations far faster than optimized MKL; some users claim ~1000× speedups for heavy linear algebra / FFT workloads.
Works best when data stays on GPU and most work is in large, standard ops (matmul, FFTs, eigensolvers).

Performance Caveats & Data Movement

Multiple comments stress that PCIe / memory transfer costs can dominate if data moves CPU↔GPU frequently.
“Drop‑in” can be misleading: same API doesn’t mean same performance profile; algorithms often need redesign around data flow.
Some note PCIe and coherent memory improve the situation on newer platforms, but costs remain workload‑dependent.

Comparisons: JAX, PyTorch, Numba, Others

PyTorch: widely used as a NumPy‑like tensor library; easy CPU/GPU switching; good for ML and general linear algebra.
JAX: NumPy and partial SciPy API, auto‑diff, multi‑RHS solvers, pytrees; critiques include slower compile times, sharp edges, weak Windows support, and concern over Google’s long‑term commitment.
CuPy vs JAX: CuPy is closer to CUDA, considered more mature for some, supports in‑place mutation and custom kernels (RawKernel, JIT), but lacks automatic differentiation.
Numba: highlighted as an alternative for writing GPU kernels in Python with type hints; confirmed to support NVIDIA GPUs.

Ecosystem, Interop, and Standards

CuPy participates in the Python Array API standard alongside NumPy and PyTorch, enabling backend‑agnostic code via array-api-compat.
scikit‑learn already uses the Array API to run on multiple backends, including CuPy.
Low‑level memoryview is mentioned as a native Python way to interoperate without importing NumPy.

Installation, CUDA/ROCm, and Tooling

Installation can be tricky due to CUDA/driver/version matrix; many rely on Docker or Conda.
Conda‑forge provides CUDA toolkit components; CuPy has separate wheels per CUDA version (e.g., cupy-cuda12x).
A CuPy maintainer emphasizes small binary size, minimal dependencies, broad platform support, and willingness to help with install issues.
AMD: CuPy supports ROCm‑capable GPUs, but official ROCm hardware list is narrow; community Debian/Ubuntu packages reportedly enable more AMD GPUs (with caveats).
Alternatives and related tools: cuDF (Pandas‑like on GPU via RAPIDS), Dask and Polars‑on‑GPU for dataframes; Intel’s scikit‑learn‑intelex for Intel GPU/CPU offload.

Custom Kernels and Lower‑Level Control

CuPy praised as an easy bridge to custom CUDA kernels (C++ or JIT’ed Python syntax).
A C++ CUDA wrapper library is presented as giving more explicit control over memory and contexts, at the cost of verbosity.
Trade‑off noted: CuPy favors productivity and brevity; low‑level wrappers favor explicit control and predictability.

General Sentiment

Many are enthusiastic about CuPy’s practicality, speedups, and maturity.
Others favor JAX or PyTorch for auto‑diff, unified CPU/GPU code, or larger communities.
Consensus: CuPy is a strong option for GPU‑accelerated NumPy/SciPy when you don’t need gradients, but careful attention to data locality and hardware setup is essential.

Related topics