Alibaba Cloud says it cut Nvidia AI GPU use by 82% with new pooling system
Impact of US Tech Restrictions on China
- Many see US export controls on GPUs and fab tools as having backfired: they forced China to optimize around constraints, spurring efficiency innovations like Alibaba’s pooling.
- Others argue controls still “work” by keeping China about a generation behind in areas like jet engines and CPUs, even if China compensates with larger clusters and more power.
- Several note that China’s own recent import ban on Nvidia chips shows the split is now mutual and likely irreversible.
Competing AI Ecosystems and “West vs China”
- Some welcome a bifurcated AI stack (US+ vs China) as a live A/B test that could accelerate global progress, provided competition stays non-destructive.
- There’s debate over Chinese LLMs:
- Pro side: models like Qwen, DeepSeek, Kimi, GLM are “good enough” for most tasks, much cheaper, and have caught up despite embargoes.
- Skeptic side: they’re valued mainly for efficiency, not absolute quality; most “serious work” still uses GPT/Gemini/Claude; benchmarks place Chinese models below state of the art.
- Trend concerns: both US and Chinese labs are moving away from open weights; some Chinese flagships (e.g. certain Qwen/Huawei models) remain closed.
IP, “Western” Identity, and Immigration
- Heated argument over whether China’s rise is mostly “stolen Western IP” vs genuine innovation; counter‑examples are offered, including historic US state‑backed IP theft.
- Long subthread debates what “Western” means (geography, culture, wealth, alliances) and how the term can be a dog whistle.
- Several argue the US’ real strategic edge is attracting global talent; anti‑immigrant politics are seen as self‑sabotaging when competing with China’s much larger population.
Alibaba’s GPU Pooling System (Technical Discussion)
- Core issue: many “cold” models got dedicated GPUs but served only ~1.35% of requests, consuming ~17.7% of a 30k‑GPU cluster.
- Paper claims token‑level scheduling and multi‑model sharing cut GPUs for a subset of unpopular models from 1,192 to 213 H20s (~82% reduction).
- Commenters clarify this 82% applies to that subset; naive scaling to the full fleet suggests a more modest overall saving (~6–18% depending on assumptions).
- Techniques involve:
- Packing multiple LLMs per GPU, including 1.8–7B and 32–72B models with tensor parallelism.
- Keeping models resident to avoid multi‑second load times and expensive Ray/NCCL initialization.
- Scheduling tokens across models to respect latency SLOs while maximizing utilization.
- Some characterize the result as “stopping doing something stupid” (dedicating GPUs to rarely used models) but still a meaningful cost win.
Broader Implications
- Several note this undercuts the “just buy more GPUs” mindset and illustrates how software and scheduling can materially reduce Nvidia demand.
- Others question scalability to very large models and whether such optimizations materially dent the broader GPU/AI investment boom.