2024-11-19

Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference

Performance & Latency Claims

Cerebras reports ~969 tokens/s on Llama 3.1 405B at bf16, effectively batch size 1, which many commenters find “wild” compared to typical GPU setups.
Multiple people running 70B/405B on 8×H100 report struggling to exceed ~80–100 tok/s (though some referenced reports claim 1,500–2,500 tok/s on 8×H100 with heavy optimization and batching).
Disagreement on whether GPU inference throughput scales well across multiple GPUs; one side says memory bandwidth is the bottleneck and doesn’t scale, others argue it does scale with enough concurrent users.
Several commenters caution that Cerebras’ numbers are likely under ideal, dedicated conditions; real-world latency will depend heavily on queuing and utilization.

Cerebras Architecture & Engineering

Uses wafer-scale integration: a single chip roughly the size of an entire wafer with ~~1M cores and massive on-chip SRAM (~~44 GB per wafer, ~21 PB/s on-chip bandwidth).
No HBM; off-chip memory bandwidth quoted around 125–150 GB/s. Speed largely comes from keeping weights in on-chip SRAM.
Defect handling via routing around bad cores, with a small percentage of spare cores; reported near-100% effective yield.
Individual systems pull on the order of 15–23 kW and are large, water-cooled “engine blocks.”

Scale, Cost & Practicality

To hold 405B bf16 weights plus KV cache, commenters estimate ~19–22 wafers/systems, implying ~20 racks, ~0.5 MW, and around $30M capital cost at current pricing.
This leads to debate on cost-per-token vs large GPU clusters; some think Cerebras is not dramatically cheaper, just different.
Many doubt wafer-scale systems will ever reach consumer-level prices, though some speculate costs could fall over years.

Comparisons to Other Hardware

Compared to Nvidia H100/MI-series GPUs: GPUs are more general-purpose and widely deployable; Cerebras is highly specialized and accessed via a tightly controlled API.
Groq is frequently mentioned as the other “speed” contender, but is seen as less competitive on 405B-scale models and constrained by capacity.
AMD’s MI325x is cited as a strong upcoming GPU option, focusing on large HBM capacity and bandwidth.

Use Cases, Impact & Future Directions

Fast 405B inference is seen as enabling much heavier chain-of-thought, tool use, multi-agent systems, and possibly real-time or near-real-time interactive applications (video-like, robotics, complex automation).
Some argue model accuracy is now “good enough” and latency is the main bottleneck for many applications.
Others question how broadly such ultra-fast, ultra-expensive setups will be deployed, suggesting they may stay niche for workloads that truly demand minimal latency (e.g., certain financial or high-value interactive systems).

Related topics