2025-10-20

Alibaba Cloud says it cut Nvidia AI GPU use by 82% with new pooling system

Impact of US Tech Restrictions on China

Many see US export controls on GPUs and fab tools as having backfired: they forced China to optimize around constraints, spurring efficiency innovations like Alibaba’s pooling.
Others argue controls still “work” by keeping China about a generation behind in areas like jet engines and CPUs, even if China compensates with larger clusters and more power.
Several note that China’s own recent import ban on Nvidia chips shows the split is now mutual and likely irreversible.

Competing AI Ecosystems and “West vs China”

Some welcome a bifurcated AI stack (US+ vs China) as a live A/B test that could accelerate global progress, provided competition stays non-destructive.
There’s debate over Chinese LLMs:
- Pro side: models like Qwen, DeepSeek, Kimi, GLM are “good enough” for most tasks, much cheaper, and have caught up despite embargoes.
- Skeptic side: they’re valued mainly for efficiency, not absolute quality; most “serious work” still uses GPT/Gemini/Claude; benchmarks place Chinese models below state of the art.
Trend concerns: both US and Chinese labs are moving away from open weights; some Chinese flagships (e.g. certain Qwen/Huawei models) remain closed.

IP, “Western” Identity, and Immigration

Heated argument over whether China’s rise is mostly “stolen Western IP” vs genuine innovation; counter‑examples are offered, including historic US state‑backed IP theft.
Long subthread debates what “Western” means (geography, culture, wealth, alliances) and how the term can be a dog whistle.
Several argue the US’ real strategic edge is attracting global talent; anti‑immigrant politics are seen as self‑sabotaging when competing with China’s much larger population.

Alibaba’s GPU Pooling System (Technical Discussion)

Core issue: many “cold” models got dedicated GPUs but served only ~1.35% of requests, consuming ~17.7% of a 30k‑GPU cluster.
Paper claims token‑level scheduling and multi‑model sharing cut GPUs for a subset of unpopular models from 1,192 to 213 H20s (~82% reduction).
Commenters clarify this 82% applies to that subset; naive scaling to the full fleet suggests a more modest overall saving (~6–18% depending on assumptions).
Techniques involve:
- Packing multiple LLMs per GPU, including 1.8–7B and 32–72B models with tensor parallelism.
- Keeping models resident to avoid multi‑second load times and expensive Ray/NCCL initialization.
- Scheduling tokens across models to respect latency SLOs while maximizing utilization.
Some characterize the result as “stopping doing something stupid” (dedicating GPUs to rarely used models) but still a meaningful cost win.

Broader Implications

Several note this undercuts the “just buy more GPUs” mindset and illustrates how software and scheduling can materially reduce Nvidia demand.
Others question scalability to very large models and whether such optimizations materially dent the broader GPU/AI investment boom.

Related topics