CuPy: NumPy and SciPy for GPU
CuPy as (Almost) Drop‑in NumPy/SciPy for GPU
- Several users report big speedups with minimal code changes, sometimes literally
import cupy as np. - Examples: radar signal processing from ~30s to ~1s; quantum eigenvalue computations far faster than optimized MKL; some users claim ~1000× speedups for heavy linear algebra / FFT workloads.
- Works best when data stays on GPU and most work is in large, standard ops (matmul, FFTs, eigensolvers).
Performance Caveats & Data Movement
- Multiple comments stress that PCIe / memory transfer costs can dominate if data moves CPU↔GPU frequently.
- “Drop‑in” can be misleading: same API doesn’t mean same performance profile; algorithms often need redesign around data flow.
- Some note PCIe and coherent memory improve the situation on newer platforms, but costs remain workload‑dependent.
Comparisons: JAX, PyTorch, Numba, Others
- PyTorch: widely used as a NumPy‑like tensor library; easy CPU/GPU switching; good for ML and general linear algebra.
- JAX: NumPy and partial SciPy API, auto‑diff, multi‑RHS solvers, pytrees; critiques include slower compile times, sharp edges, weak Windows support, and concern over Google’s long‑term commitment.
- CuPy vs JAX: CuPy is closer to CUDA, considered more mature for some, supports in‑place mutation and custom kernels (
RawKernel, JIT), but lacks automatic differentiation. - Numba: highlighted as an alternative for writing GPU kernels in Python with type hints; confirmed to support NVIDIA GPUs.
Ecosystem, Interop, and Standards
- CuPy participates in the Python Array API standard alongside NumPy and PyTorch, enabling backend‑agnostic code via
array-api-compat. - scikit‑learn already uses the Array API to run on multiple backends, including CuPy.
- Low‑level
memoryviewis mentioned as a native Python way to interoperate without importing NumPy.
Installation, CUDA/ROCm, and Tooling
- Installation can be tricky due to CUDA/driver/version matrix; many rely on Docker or Conda.
- Conda‑forge provides CUDA toolkit components; CuPy has separate wheels per CUDA version (e.g.,
cupy-cuda12x). - A CuPy maintainer emphasizes small binary size, minimal dependencies, broad platform support, and willingness to help with install issues.
- AMD: CuPy supports ROCm‑capable GPUs, but official ROCm hardware list is narrow; community Debian/Ubuntu packages reportedly enable more AMD GPUs (with caveats).
- Alternatives and related tools: cuDF (Pandas‑like on GPU via RAPIDS), Dask and Polars‑on‑GPU for dataframes; Intel’s scikit‑learn‑intelex for Intel GPU/CPU offload.
Custom Kernels and Lower‑Level Control
- CuPy praised as an easy bridge to custom CUDA kernels (C++ or JIT’ed Python syntax).
- A C++ CUDA wrapper library is presented as giving more explicit control over memory and contexts, at the cost of verbosity.
- Trade‑off noted: CuPy favors productivity and brevity; low‑level wrappers favor explicit control and predictability.
General Sentiment
- Many are enthusiastic about CuPy’s practicality, speedups, and maturity.
- Others favor JAX or PyTorch for auto‑diff, unified CPU/GPU code, or larger communities.
- Consensus: CuPy is a strong option for GPU‑accelerated NumPy/SciPy when you don’t need gradients, but careful attention to data locality and hardware setup is essential.