2025-08-07

Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs

Model integration & ecosystem

Several comments were surprised how much “massaging” is needed: models are not plug‑and‑play, especially with frameworks like TensorRT‑LLM (TRT‑LLM).
GPT‑OSS is architecturally conventional, but its new “Harmony” conversation format makes it a special case until tooling catches up.
TRT‑LLM is described as usually fastest on NVIDIA GPUs but also the hardest to set up, brittle, and often behind on architectures; vLLM is seen as easier and “flawless” for many setups.
Some note that for GPT‑OSS there was explicit coordination to ensure day‑1 support in inference engines.

Speculative decoding explanations

Multiple comments unpack speculative decoding: a small “draft” model proposes several tokens; the large model validates them in a single forward pass.
Key point: decoding is memory‑bandwidth bound; prefilling multiple tokens at once is cheaper than serial per‑token decoding.
Savings depend on the small model being fast and often correct; if it’s wrong too often, you lose speed and just burn extra memory for two models.
Both models must behave similarly for best gains; otherwise too many tokens are rejected and speed advantage collapses.

Performance on consumer hardware

People report GPT‑OSS‑20B and even 120B running on consumer setups with quantization and offloading: e.g., 20B on mid‑range GPUs, 120B partly in system RAM via llama.cpp/LM Studio.
Token rates shared: ~20–50 tok/s for 120B on CPU‑heavy boxes or mixed CPU+mid‑GPU; ~150 tok/s for 20B on newer high‑end GPUs; 60→30 tok/s on Apple Silicon as context grows.
Context length slowdown is debated: some emphasize quadratic compute with context size; others stress that token generation is dominated by memory bandwidth.

Datacenter GPUs, cost, and naming

H100s are described as “widely available” in the sense of rentable, not affordable to own; several point out you can rent them for a few dollars/hour.
Debate over whether it’s meaningful to call data‑center accelerators “GPUs” and how to distinguish them from consumer “graphics cards.”
Some argue many cheap consumer GPUs in aggregate could exceed one H100 on raw compute, but interconnect limits and product segmentation (e.g., loss of NVLink) prevent easy clustering.

Accuracy, tools, and local agents

Some users find GPT‑OSS models easy to run but unimpressive in factual accuracy; others insist LLMs should be paired with tools/RAG for facts rather than trusted directly.
Offline use highlights how much value now comes from tools: web search, MCPs, and coding agents degrade significantly without connectivity.
There is interest in fully local agentic coding on modest GPUs, but VRAM and model size remain main constraints.

Open‑source, alignment, and politics

One thread links GPT‑OSS to US policy goals about “open‑source AI” and “protecting American values,” raising concern about models as vehicles for particular ideological worldviews.
Views diverge on whether aligning models to a “Western” or “American” worldview is desirable, dangerous, or inherently contested; some worry about partisan RLHF swings over time.

Related topics