Alibaba Cloud says it cut Nvidia AI GPU use by 82% with new pooling system

Impact of US Tech Restrictions on China

  • Many see US export controls on GPUs and fab tools as having backfired: they forced China to optimize around constraints, spurring efficiency innovations like Alibaba’s pooling.
  • Others argue controls still “work” by keeping China about a generation behind in areas like jet engines and CPUs, even if China compensates with larger clusters and more power.
  • Several note that China’s own recent import ban on Nvidia chips shows the split is now mutual and likely irreversible.

Competing AI Ecosystems and “West vs China”

  • Some welcome a bifurcated AI stack (US+ vs China) as a live A/B test that could accelerate global progress, provided competition stays non-destructive.
  • There’s debate over Chinese LLMs:
    • Pro side: models like Qwen, DeepSeek, Kimi, GLM are “good enough” for most tasks, much cheaper, and have caught up despite embargoes.
    • Skeptic side: they’re valued mainly for efficiency, not absolute quality; most “serious work” still uses GPT/Gemini/Claude; benchmarks place Chinese models below state of the art.
  • Trend concerns: both US and Chinese labs are moving away from open weights; some Chinese flagships (e.g. certain Qwen/Huawei models) remain closed.

IP, “Western” Identity, and Immigration

  • Heated argument over whether China’s rise is mostly “stolen Western IP” vs genuine innovation; counter‑examples are offered, including historic US state‑backed IP theft.
  • Long subthread debates what “Western” means (geography, culture, wealth, alliances) and how the term can be a dog whistle.
  • Several argue the US’ real strategic edge is attracting global talent; anti‑immigrant politics are seen as self‑sabotaging when competing with China’s much larger population.

Alibaba’s GPU Pooling System (Technical Discussion)

  • Core issue: many “cold” models got dedicated GPUs but served only ~1.35% of requests, consuming ~17.7% of a 30k‑GPU cluster.
  • Paper claims token‑level scheduling and multi‑model sharing cut GPUs for a subset of unpopular models from 1,192 to 213 H20s (~82% reduction).
  • Commenters clarify this 82% applies to that subset; naive scaling to the full fleet suggests a more modest overall saving (~6–18% depending on assumptions).
  • Techniques involve:
    • Packing multiple LLMs per GPU, including 1.8–7B and 32–72B models with tensor parallelism.
    • Keeping models resident to avoid multi‑second load times and expensive Ray/NCCL initialization.
    • Scheduling tokens across models to respect latency SLOs while maximizing utilization.
  • Some characterize the result as “stopping doing something stupid” (dedicating GPUs to rarely used models) but still a meaningful cost win.

Broader Implications

  • Several note this undercuts the “just buy more GPUs” mindset and illustrates how software and scheduling can materially reduce Nvidia demand.
  • Others question scalability to very large models and whether such optimizations materially dent the broader GPU/AI investment boom.