Mercury 2: Fast reasoning LLM powered by diffusion
Perceived benefits of speed
- Many see 1k–4k tokens/s as unlocking new interaction patterns: multi-shot prompting, nudging, and fast agent loops where extra internal calls are “free” from the user’s perspective.
- Speed is framed as a new dimension of quality (“intelligence per second”), especially for workflows where iteration speed matters more than peak intelligence.
- Faster models let more of a fixed latency budget be spent on reasoning, potentially raising effective quality.
Candidate use cases
- Agentic work: multi-model arbitration, synthesis, parallel reasoning, code agents that explore multiple solution paths, validate via tools/tests, and surface only vetted options.
- Everyday UX: spell-check, touch-keyboard disambiguation, syntax highlighting, database query planning, PDF-to-markdown parsing – replacing many small heuristic systems.
- Coding: autocomplete, inline edits, fast “draft” models feeding slower AR “judge” models; edit-style tasks (e.g., Mercury Edit, Morph Fast Apply–like flows).
- Voice: could reduce “thinking silence” if time-to-first-token is low enough; some see this as potentially game-changing for natural turn-taking.
Quality vs speed tradeoffs
- Mercury 2 is described as roughly in the “fast agent” tier (Haiku 4.5 / GPT-mini class): strong for common coding and tool use, not frontier-level reasoning.
- Debate over whether a faster but weaker model beats a slower, smarter one for real tasks; interest in benchmarks on end-to-end agent performance, not just static evals.
- Some report it feels on par with good open models for math/engineering, others note failures on simple tests (car wash scenario, seahorse/snail emoji) and odd reasoning loops.
Views on diffusion LLMs
- Split sentiment: some are underwhelmed and note diffusion systems have often trailed the quality/price Pareto frontier; others argue Mercury has shifted the speed–quality frontier by ~5× at equal quality.
- Several note text diffusion is far less mature than transformers; with comparable investment it might surpass AR in multiple dimensions.
- Concerns about closed weights and sparse technical details limiting broader research and progress.
Technical questions and open problems
- Questions about KV-cache analogs, block diffusion, dynamic block length, and how sequential dependencies are handled when generating in parallel.
- Curiosity about theory links between transformers, diffusion, flow matching, and whether one can be fitted to the other.
- Open questions on scaling limits: could diffusion reach Opus-class intelligence while retaining speed?
Product, demo, and ecosystem feedback
- Early users hit overloaded servers, queue latency, and cryptic errors, making it hard to feel the promised speed.
- Requests for: server-side rendering so agents can read the site, OpenRouter support, a public status page, clearer max-output-token limits, visible reasoning traces, and better web-search behavior.
- Some report UI performance issues (heavy animations) and intermittent chat reliability.
- Others praise the “unbelievably fast” feel when it works and like instant follow-up questions for exploratory/PKM workflows.
Hardware and future outlook
- Strong interest in pairing diffusion LLMs with specialized hardware (Cerebras, Groq-like systems, Taalas chips) and speculation about orders-of-magnitude speedups.
- Discussion that algorithmic + hardware advances are still early; debate whether extra compute should go to more speed at current intelligence or pushing model capability further.