Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference

Performance & Latency Claims

  • Cerebras reports ~969 tokens/s on Llama 3.1 405B at bf16, effectively batch size 1, which many commenters find “wild” compared to typical GPU setups.
  • Multiple people running 70B/405B on 8×H100 report struggling to exceed ~80–100 tok/s (though some referenced reports claim 1,500–2,500 tok/s on 8×H100 with heavy optimization and batching).
  • Disagreement on whether GPU inference throughput scales well across multiple GPUs; one side says memory bandwidth is the bottleneck and doesn’t scale, others argue it does scale with enough concurrent users.
  • Several commenters caution that Cerebras’ numbers are likely under ideal, dedicated conditions; real-world latency will depend heavily on queuing and utilization.

Cerebras Architecture & Engineering

  • Uses wafer-scale integration: a single chip roughly the size of an entire wafer with 1M cores and massive on-chip SRAM (44 GB per wafer, ~21 PB/s on-chip bandwidth).
  • No HBM; off-chip memory bandwidth quoted around 125–150 GB/s. Speed largely comes from keeping weights in on-chip SRAM.
  • Defect handling via routing around bad cores, with a small percentage of spare cores; reported near-100% effective yield.
  • Individual systems pull on the order of 15–23 kW and are large, water-cooled “engine blocks.”

Scale, Cost & Practicality

  • To hold 405B bf16 weights plus KV cache, commenters estimate ~19–22 wafers/systems, implying ~20 racks, ~0.5 MW, and around $30M capital cost at current pricing.
  • This leads to debate on cost-per-token vs large GPU clusters; some think Cerebras is not dramatically cheaper, just different.
  • Many doubt wafer-scale systems will ever reach consumer-level prices, though some speculate costs could fall over years.

Comparisons to Other Hardware

  • Compared to Nvidia H100/MI-series GPUs: GPUs are more general-purpose and widely deployable; Cerebras is highly specialized and accessed via a tightly controlled API.
  • Groq is frequently mentioned as the other “speed” contender, but is seen as less competitive on 405B-scale models and constrained by capacity.
  • AMD’s MI325x is cited as a strong upcoming GPU option, focusing on large HBM capacity and bandwidth.

Use Cases, Impact & Future Directions

  • Fast 405B inference is seen as enabling much heavier chain-of-thought, tool use, multi-agent systems, and possibly real-time or near-real-time interactive applications (video-like, robotics, complex automation).
  • Some argue model accuracy is now “good enough” and latency is the main bottleneck for many applications.
  • Others question how broadly such ultra-fast, ultra-expensive setups will be deployed, suggesting they may stay niche for workloads that truly demand minimal latency (e.g., certain financial or high-value interactive systems).