Cerebras launches Qwen3-235B, achieving 1.5k tokens per second
Model, quantization & context confusion
- Thread clarifies this is mainly an infrastructure / serving milestone, not a brand-new model architecture.
- Uncertainty over whether Cerebras is serving Qwen3-235B quantized; past statements suggested they do not quantize, unlike some competitors, but nothing definitive is provided here.
- Multiple people are confused by overlapping Qwen model variants (235B, coder/405B, “no-reasoning” vs reasoning, A22B, etc.).
- Context length is messy: base model has ~32K native context; Cerebras advertises 131K via scaling methods (e.g. RoPE/YaRN). PR, tweets, OpenRouter and pricing docs appear inconsistent (32K, 40K, 64K, 131K), and free vs paid tiers differ.
- One commenter notes theoretical 262K / 2M-token limits with aggressive scaling and accuses Cerebras of not serving the true max context.
Speed, use cases & coding agents
- 1,000–1,500 tokens/s is seen as transformative for agent loops, code iteration and “time compression.” Many want the newer Qwen3-Coder hosted at similar speeds.
- Some report API incompatibilities with existing agents (tool-call formatting, non-OpenAI-compliant behavior) making integration harder than with other providers.
- There’s debate on deliberate “thinking” tokens: extra reasoning could improve performance, but also tends to relax constraints and derail tasks.
- People explore using a fast model behind the scenes for context compaction for other LLMs, but say good production implementations are still rare.
Hardware architecture, economics & energy
- Large subthread debates whether a 235B model with 131K context can realistically be run from SRAM alone; initial back-of-envelope numbers (tens of wafers, >$100M) are later challenged.
- Others explain Cerebras uses large on-chip SRAM plus external memory (MemoryX), sparse weights, and streaming; SRAM is working memory, not total parameter store.
- There’s contention over actual wafer/system pricing, profitability at current API rates, and whether this is effectively VC-subsidized.
- Energy use per query is asked about but remains unanswered; only rough system TDP guesses are mentioned.
Model quality, censorship & competition
- Early anecdotal feedback: very fast but not yet matching top-tier models for creative writing or coding; some find outputs repetitive or “over-deterministic.”
- Qwen is praised as one of the strongest open-weight families, but described as heavily censored on sensitive topics (e.g., events in China), similar in spirit to other models’ safety filters.
- Many see a fast Qwen3-Coder on Cerebras as a potential cheaper, faster rival to leading proprietary models, especially for IDE-integrated coding.
Tooling, workflows & broader implications
- Users share setups for routing tools (Claude Code, Aider, IDEs) through proxies like OpenRouter or litellm to hit Cerebras endpoints; reports are mixed but generally enthusiastic about speed.
- Some think near-instant LLMs will shift development toward highly interactive IDE workflows; others foresee spawning many parallel agent branches and post-hoc review instead.
- There’s speculation that if LLMs get this fast widely, compiler performance, inference hardware design, and even number formats could become new optimization frontiers.