Mercury 2: Fast reasoning LLM powered by diffusion

Perceived benefits of speed

  • Many see 1k–4k tokens/s as unlocking new interaction patterns: multi-shot prompting, nudging, and fast agent loops where extra internal calls are “free” from the user’s perspective.
  • Speed is framed as a new dimension of quality (“intelligence per second”), especially for workflows where iteration speed matters more than peak intelligence.
  • Faster models let more of a fixed latency budget be spent on reasoning, potentially raising effective quality.

Candidate use cases

  • Agentic work: multi-model arbitration, synthesis, parallel reasoning, code agents that explore multiple solution paths, validate via tools/tests, and surface only vetted options.
  • Everyday UX: spell-check, touch-keyboard disambiguation, syntax highlighting, database query planning, PDF-to-markdown parsing – replacing many small heuristic systems.
  • Coding: autocomplete, inline edits, fast “draft” models feeding slower AR “judge” models; edit-style tasks (e.g., Mercury Edit, Morph Fast Apply–like flows).
  • Voice: could reduce “thinking silence” if time-to-first-token is low enough; some see this as potentially game-changing for natural turn-taking.

Quality vs speed tradeoffs

  • Mercury 2 is described as roughly in the “fast agent” tier (Haiku 4.5 / GPT-mini class): strong for common coding and tool use, not frontier-level reasoning.
  • Debate over whether a faster but weaker model beats a slower, smarter one for real tasks; interest in benchmarks on end-to-end agent performance, not just static evals.
  • Some report it feels on par with good open models for math/engineering, others note failures on simple tests (car wash scenario, seahorse/snail emoji) and odd reasoning loops.

Views on diffusion LLMs

  • Split sentiment: some are underwhelmed and note diffusion systems have often trailed the quality/price Pareto frontier; others argue Mercury has shifted the speed–quality frontier by ~5× at equal quality.
  • Several note text diffusion is far less mature than transformers; with comparable investment it might surpass AR in multiple dimensions.
  • Concerns about closed weights and sparse technical details limiting broader research and progress.

Technical questions and open problems

  • Questions about KV-cache analogs, block diffusion, dynamic block length, and how sequential dependencies are handled when generating in parallel.
  • Curiosity about theory links between transformers, diffusion, flow matching, and whether one can be fitted to the other.
  • Open questions on scaling limits: could diffusion reach Opus-class intelligence while retaining speed?

Product, demo, and ecosystem feedback

  • Early users hit overloaded servers, queue latency, and cryptic errors, making it hard to feel the promised speed.
  • Requests for: server-side rendering so agents can read the site, OpenRouter support, a public status page, clearer max-output-token limits, visible reasoning traces, and better web-search behavior.
  • Some report UI performance issues (heavy animations) and intermittent chat reliability.
  • Others praise the “unbelievably fast” feel when it works and like instant follow-up questions for exploratory/PKM workflows.

Hardware and future outlook

  • Strong interest in pairing diffusion LLMs with specialized hardware (Cerebras, Groq-like systems, Taalas chips) and speculation about orders-of-magnitude speedups.
  • Discussion that algorithmic + hardware advances are still early; debate whether extra compute should go to more speed at current intelligence or pushing model capability further.