2026-02-24

Mercury 2: Fast reasoning LLM powered by diffusion

Perceived benefits of speed

Many see 1k–4k tokens/s as unlocking new interaction patterns: multi-shot prompting, nudging, and fast agent loops where extra internal calls are “free” from the user’s perspective.
Speed is framed as a new dimension of quality (“intelligence per second”), especially for workflows where iteration speed matters more than peak intelligence.
Faster models let more of a fixed latency budget be spent on reasoning, potentially raising effective quality.

Candidate use cases

Agentic work: multi-model arbitration, synthesis, parallel reasoning, code agents that explore multiple solution paths, validate via tools/tests, and surface only vetted options.
Everyday UX: spell-check, touch-keyboard disambiguation, syntax highlighting, database query planning, PDF-to-markdown parsing – replacing many small heuristic systems.
Coding: autocomplete, inline edits, fast “draft” models feeding slower AR “judge” models; edit-style tasks (e.g., Mercury Edit, Morph Fast Apply–like flows).
Voice: could reduce “thinking silence” if time-to-first-token is low enough; some see this as potentially game-changing for natural turn-taking.

Quality vs speed tradeoffs

Mercury 2 is described as roughly in the “fast agent” tier (Haiku 4.5 / GPT-mini class): strong for common coding and tool use, not frontier-level reasoning.
Debate over whether a faster but weaker model beats a slower, smarter one for real tasks; interest in benchmarks on end-to-end agent performance, not just static evals.
Some report it feels on par with good open models for math/engineering, others note failures on simple tests (car wash scenario, seahorse/snail emoji) and odd reasoning loops.

Views on diffusion LLMs

Split sentiment: some are underwhelmed and note diffusion systems have often trailed the quality/price Pareto frontier; others argue Mercury has shifted the speed–quality frontier by ~5× at equal quality.
Several note text diffusion is far less mature than transformers; with comparable investment it might surpass AR in multiple dimensions.
Concerns about closed weights and sparse technical details limiting broader research and progress.

Technical questions and open problems

Questions about KV-cache analogs, block diffusion, dynamic block length, and how sequential dependencies are handled when generating in parallel.
Curiosity about theory links between transformers, diffusion, flow matching, and whether one can be fitted to the other.
Open questions on scaling limits: could diffusion reach Opus-class intelligence while retaining speed?

Product, demo, and ecosystem feedback

Early users hit overloaded servers, queue latency, and cryptic errors, making it hard to feel the promised speed.
Requests for: server-side rendering so agents can read the site, OpenRouter support, a public status page, clearer max-output-token limits, visible reasoning traces, and better web-search behavior.
Some report UI performance issues (heavy animations) and intermittent chat reliability.
Others praise the “unbelievably fast” feel when it works and like instant follow-up questions for exploratory/PKM workflows.

Hardware and future outlook

Strong interest in pairing diffusion LLMs with specialized hardware (Cerebras, Groq-like systems, Taalas chips) and speculation about orders-of-magnitude speedups.
Discussion that algorithmic + hardware advances are still early; debate whether extra compute should go to more speed at current intelligence or pushing model capability further.

Related topics