CuPy: NumPy and SciPy for GPU

CuPy as (Almost) Drop‑in NumPy/SciPy for GPU

  • Several users report big speedups with minimal code changes, sometimes literally import cupy as np.
  • Examples: radar signal processing from ~30s to ~1s; quantum eigenvalue computations far faster than optimized MKL; some users claim ~1000× speedups for heavy linear algebra / FFT workloads.
  • Works best when data stays on GPU and most work is in large, standard ops (matmul, FFTs, eigensolvers).

Performance Caveats & Data Movement

  • Multiple comments stress that PCIe / memory transfer costs can dominate if data moves CPU↔GPU frequently.
  • “Drop‑in” can be misleading: same API doesn’t mean same performance profile; algorithms often need redesign around data flow.
  • Some note PCIe and coherent memory improve the situation on newer platforms, but costs remain workload‑dependent.

Comparisons: JAX, PyTorch, Numba, Others

  • PyTorch: widely used as a NumPy‑like tensor library; easy CPU/GPU switching; good for ML and general linear algebra.
  • JAX: NumPy and partial SciPy API, auto‑diff, multi‑RHS solvers, pytrees; critiques include slower compile times, sharp edges, weak Windows support, and concern over Google’s long‑term commitment.
  • CuPy vs JAX: CuPy is closer to CUDA, considered more mature for some, supports in‑place mutation and custom kernels (RawKernel, JIT), but lacks automatic differentiation.
  • Numba: highlighted as an alternative for writing GPU kernels in Python with type hints; confirmed to support NVIDIA GPUs.

Ecosystem, Interop, and Standards

  • CuPy participates in the Python Array API standard alongside NumPy and PyTorch, enabling backend‑agnostic code via array-api-compat.
  • scikit‑learn already uses the Array API to run on multiple backends, including CuPy.
  • Low‑level memoryview is mentioned as a native Python way to interoperate without importing NumPy.

Installation, CUDA/ROCm, and Tooling

  • Installation can be tricky due to CUDA/driver/version matrix; many rely on Docker or Conda.
  • Conda‑forge provides CUDA toolkit components; CuPy has separate wheels per CUDA version (e.g., cupy-cuda12x).
  • A CuPy maintainer emphasizes small binary size, minimal dependencies, broad platform support, and willingness to help with install issues.
  • AMD: CuPy supports ROCm‑capable GPUs, but official ROCm hardware list is narrow; community Debian/Ubuntu packages reportedly enable more AMD GPUs (with caveats).
  • Alternatives and related tools: cuDF (Pandas‑like on GPU via RAPIDS), Dask and Polars‑on‑GPU for dataframes; Intel’s scikit‑learn‑intelex for Intel GPU/CPU offload.

Custom Kernels and Lower‑Level Control

  • CuPy praised as an easy bridge to custom CUDA kernels (C++ or JIT’ed Python syntax).
  • A C++ CUDA wrapper library is presented as giving more explicit control over memory and contexts, at the cost of verbosity.
  • Trade‑off noted: CuPy favors productivity and brevity; low‑level wrappers favor explicit control and predictability.

General Sentiment

  • Many are enthusiastic about CuPy’s practicality, speedups, and maturity.
  • Others favor JAX or PyTorch for auto‑diff, unified CPU/GPU code, or larger communities.
  • Consensus: CuPy is a strong option for GPU‑accelerated NumPy/SciPy when you don’t need gradients, but careful attention to data locality and hardware setup is essential.