Cerebras launches Qwen3-235B, achieving 1.5k tokens per second

Model, quantization & context confusion

  • Thread clarifies this is mainly an infrastructure / serving milestone, not a brand-new model architecture.
  • Uncertainty over whether Cerebras is serving Qwen3-235B quantized; past statements suggested they do not quantize, unlike some competitors, but nothing definitive is provided here.
  • Multiple people are confused by overlapping Qwen model variants (235B, coder/405B, “no-reasoning” vs reasoning, A22B, etc.).
  • Context length is messy: base model has ~32K native context; Cerebras advertises 131K via scaling methods (e.g. RoPE/YaRN). PR, tweets, OpenRouter and pricing docs appear inconsistent (32K, 40K, 64K, 131K), and free vs paid tiers differ.
  • One commenter notes theoretical 262K / 2M-token limits with aggressive scaling and accuses Cerebras of not serving the true max context.

Speed, use cases & coding agents

  • 1,000–1,500 tokens/s is seen as transformative for agent loops, code iteration and “time compression.” Many want the newer Qwen3-Coder hosted at similar speeds.
  • Some report API incompatibilities with existing agents (tool-call formatting, non-OpenAI-compliant behavior) making integration harder than with other providers.
  • There’s debate on deliberate “thinking” tokens: extra reasoning could improve performance, but also tends to relax constraints and derail tasks.
  • People explore using a fast model behind the scenes for context compaction for other LLMs, but say good production implementations are still rare.

Hardware architecture, economics & energy

  • Large subthread debates whether a 235B model with 131K context can realistically be run from SRAM alone; initial back-of-envelope numbers (tens of wafers, >$100M) are later challenged.
  • Others explain Cerebras uses large on-chip SRAM plus external memory (MemoryX), sparse weights, and streaming; SRAM is working memory, not total parameter store.
  • There’s contention over actual wafer/system pricing, profitability at current API rates, and whether this is effectively VC-subsidized.
  • Energy use per query is asked about but remains unanswered; only rough system TDP guesses are mentioned.

Model quality, censorship & competition

  • Early anecdotal feedback: very fast but not yet matching top-tier models for creative writing or coding; some find outputs repetitive or “over-deterministic.”
  • Qwen is praised as one of the strongest open-weight families, but described as heavily censored on sensitive topics (e.g., events in China), similar in spirit to other models’ safety filters.
  • Many see a fast Qwen3-Coder on Cerebras as a potential cheaper, faster rival to leading proprietary models, especially for IDE-integrated coding.

Tooling, workflows & broader implications

  • Users share setups for routing tools (Claude Code, Aider, IDEs) through proxies like OpenRouter or litellm to hit Cerebras endpoints; reports are mixed but generally enthusiastic about speed.
  • Some think near-instant LLMs will shift development toward highly interactive IDE workflows; others foresee spawning many parallel agent branches and post-hoc review instead.
  • There’s speculation that if LLMs get this fast widely, compiler performance, inference hardware design, and even number formats could become new optimization frontiers.