MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second
Model capabilities & speed claims
- MiMo-v2.5-Pro-UltraSpeed claims >1000 tokens/s on a ~1T-parameter MoE model using an 8‑GPU “standard” node.
- Many see this as a step change for interactivity, especially for coding agents and real‑time/voice use cases.
- Some argue speed is “not game‑changing” since LLM latency is often not the main bottleneck.
Technical approach
- Key ingredients mentioned: FP4 quantization (MXFP4) selectively applied to MoE experts, DFlash speculative decoding, persistent CUDA kernels, tiled/overlapped processing, and TileRT’s “megakernel” decode.
- Sliding-window + full attention hybrid, sparse MoE with routed experts; design explicitly optimized for bandwidth and latency.
- Commenters note that similar ideas exist, but integrating them into a 1T model at this speed is seen as notable.
Pricing & economics
- Base MiMo pricing is considered very cheap; UltraSpeed is ~3× more, still viewed as competitive versus US labs.
- Debate over whether Chinese providers are subsidized, benefit from cheaper energy/infrastructure, or just more aggressive on optimization.
- Some think US labs focus less on efficiency and more on monetization and regulatory moats.
Comparisons with other models
- Compared frequently to DeepSeek, Kimi, GLM, Qwen, Gemini, GPT, Claude, Cerebras‑hosted models, and ultra‑fast demos like Taalas’ “chatjimmy.”
- Several users say MiMo 2.5 Pro (regular speed) is a top open‑weights agentic coding model; others report hallucinations or weaker coding vs DeepSeek Pro.
- Open Chinese models are praised for cost and speed, but US frontier models are still considered stronger on difficult reasoning.
Censorship, alignment, and geopolitics
- Multiple comments test Chinese models with sensitive prompts (e.g., Tiananmen, Taiwan) and note which answer factually vs deflect.
- Contrast made with US models that refuse guidance on weapons or “armed resistance” but generally do not deny historical facts.
- Some argue Chinese LLM censorship is easier to strip from open weights; others worry more about US models’ safety filters blocking normal workflows.
Developer workflows & productivity
- Fast models change how people use agents: near‑real‑time refactors, CI‑driven bug‑hunting, and continuous code iteration without leaving “flow.”
- Others feel AI makes work less satisfying, shifting from craftsmanship to prompt‑driven “slot machines,” and foresee more low‑quality “slop” software.
Access, gating & ecosystem concerns
- UltraSpeed access is limited/invite‑only; users complain about sign‑up friction, regional locks, and “not available in your region” errors.
- Some worry selective access to very high throughput will entrench large players and distort competition.