2026-06-08

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second

Original Article ↗ Hacker News Discussion ↗

Model capabilities & speed claims

MiMo-v2.5-Pro-UltraSpeed claims >1000 tokens/s on a ~1T-parameter MoE model using an 8‑GPU “standard” node.
Many see this as a step change for interactivity, especially for coding agents and real‑time/voice use cases.
Some argue speed is “not game‑changing” since LLM latency is often not the main bottleneck.

Technical approach

Key ingredients mentioned: FP4 quantization (MXFP4) selectively applied to MoE experts, DFlash speculative decoding, persistent CUDA kernels, tiled/overlapped processing, and TileRT’s “megakernel” decode.
Sliding-window + full attention hybrid, sparse MoE with routed experts; design explicitly optimized for bandwidth and latency.
Commenters note that similar ideas exist, but integrating them into a 1T model at this speed is seen as notable.

Pricing & economics

Base MiMo pricing is considered very cheap; UltraSpeed is ~3× more, still viewed as competitive versus US labs.
Debate over whether Chinese providers are subsidized, benefit from cheaper energy/infrastructure, or just more aggressive on optimization.
Some think US labs focus less on efficiency and more on monetization and regulatory moats.

Comparisons with other models

Compared frequently to DeepSeek, Kimi, GLM, Qwen, Gemini, GPT, Claude, Cerebras‑hosted models, and ultra‑fast demos like Taalas’ “chatjimmy.”
Several users say MiMo 2.5 Pro (regular speed) is a top open‑weights agentic coding model; others report hallucinations or weaker coding vs DeepSeek Pro.
Open Chinese models are praised for cost and speed, but US frontier models are still considered stronger on difficult reasoning.

Censorship, alignment, and geopolitics

Multiple comments test Chinese models with sensitive prompts (e.g., Tiananmen, Taiwan) and note which answer factually vs deflect.
Contrast made with US models that refuse guidance on weapons or “armed resistance” but generally do not deny historical facts.
Some argue Chinese LLM censorship is easier to strip from open weights; others worry more about US models’ safety filters blocking normal workflows.

Developer workflows & productivity

Fast models change how people use agents: near‑real‑time refactors, CI‑driven bug‑hunting, and continuous code iteration without leaving “flow.”
Others feel AI makes work less satisfying, shifting from craftsmanship to prompt‑driven “slot machines,” and foresee more low‑quality “slop” software.

Access, gating & ecosystem concerns

UltraSpeed access is limited/invite‑only; users complain about sign‑up friction, regional locks, and “not available in your region” errors.
Some worry selective access to very high throughput will entrench large players and distort competition.