MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second

Model capabilities & speed claims

  • MiMo-v2.5-Pro-UltraSpeed claims >1000 tokens/s on a ~1T-parameter MoE model using an 8‑GPU “standard” node.
  • Many see this as a step change for interactivity, especially for coding agents and real‑time/voice use cases.
  • Some argue speed is “not game‑changing” since LLM latency is often not the main bottleneck.

Technical approach

  • Key ingredients mentioned: FP4 quantization (MXFP4) selectively applied to MoE experts, DFlash speculative decoding, persistent CUDA kernels, tiled/overlapped processing, and TileRT’s “megakernel” decode.
  • Sliding-window + full attention hybrid, sparse MoE with routed experts; design explicitly optimized for bandwidth and latency.
  • Commenters note that similar ideas exist, but integrating them into a 1T model at this speed is seen as notable.

Pricing & economics

  • Base MiMo pricing is considered very cheap; UltraSpeed is ~3× more, still viewed as competitive versus US labs.
  • Debate over whether Chinese providers are subsidized, benefit from cheaper energy/infrastructure, or just more aggressive on optimization.
  • Some think US labs focus less on efficiency and more on monetization and regulatory moats.

Comparisons with other models

  • Compared frequently to DeepSeek, Kimi, GLM, Qwen, Gemini, GPT, Claude, Cerebras‑hosted models, and ultra‑fast demos like Taalas’ “chatjimmy.”
  • Several users say MiMo 2.5 Pro (regular speed) is a top open‑weights agentic coding model; others report hallucinations or weaker coding vs DeepSeek Pro.
  • Open Chinese models are praised for cost and speed, but US frontier models are still considered stronger on difficult reasoning.

Censorship, alignment, and geopolitics

  • Multiple comments test Chinese models with sensitive prompts (e.g., Tiananmen, Taiwan) and note which answer factually vs deflect.
  • Contrast made with US models that refuse guidance on weapons or “armed resistance” but generally do not deny historical facts.
  • Some argue Chinese LLM censorship is easier to strip from open weights; others worry more about US models’ safety filters blocking normal workflows.

Developer workflows & productivity

  • Fast models change how people use agents: near‑real‑time refactors, CI‑driven bug‑hunting, and continuous code iteration without leaving “flow.”
  • Others feel AI makes work less satisfying, shifting from craftsmanship to prompt‑driven “slot machines,” and foresee more low‑quality “slop” software.

Access, gating & ecosystem concerns

  • UltraSpeed access is limited/invite‑only; users complain about sign‑up friction, regional locks, and “not available in your region” errors.
  • Some worry selective access to very high throughput will entrench large players and distort competition.