2026-01-15

Furiosa: 3.5x efficiency over H100s

Practical usability & ecosystem lock‑in

People ask how usable this is for typical non‑AI orgs and whether it locks them into a narrow ecosystem.
It’s compared to AWS Trainium/Inferentia: somewhat niche but still adopted by “normal” companies.
Major concern: how many models are actually supported vs hand‑implemented (Llama 3.1‑8B cited as “dated”).
A separate Furiosa post shows gpt‑oss‑120B running, which reassures some but doesn’t dispel worries about a limited, curated model set.
Memory (48GB per card, 8 cards per box) is seen as tight for large, batched open models. Networking is also questioned for data‑center use.
Several commenters say interest depends entirely on price, delivery timeline, and ability to drop into a standard air‑cooled rack.

Benchmarks, performance & power framing

The headline “3.5x over H100” draws scrutiny because it compares to 3× H100 PCIe, not the more common 8× H100 SXM or newer GB200/B200 setups.
The vendor defines a rack as 15kW, which makes 3× H100 look like <10% of rack power; some find this assumption unrealistic.
One reader computes ~86 tok/s per Furiosa chip vs ~2390 tok/s per H100 on one workload, concluding raw performance is worse; others note the chip is sold on efficiency (tokens per watt) and TCO, not peak speed.
There is confusion between latency and throughput in the comparison, and no clear, apples‑to‑apples tokens/W chart vs modern Nvidia parts.

Inference vs training & workload focus

Some lose interest when they realize it’s inference‑only; others argue inference will dominate future LLM costs and a 3× efficiency gain is significant.
A counterpoint: many AI clusters today are still primarily used for training, with LLM inference a minority of GPU usage.
One commenter notes that focusing on massive LLMs may be narrowing; many AI apps aren’t giant chatbots.

Power, cooling & AI economics

Several posts tie Furiosa’s positioning (efficient, air‑cooled inference) to broader worries about Nvidia’s multi‑kW GPUs forcing expensive, specialized datacenters.
There’s extended debate over whether current AI capex is sustainable: huge spend commitments, circular vendor relationships, and bubble analogies (railroads, OC‑768, crypto).
Some argue labs are capacity‑constrained and profitable on inference; others think the whole stack only works while investors subsidize training and free usage.

Competition, TPUs & Nvidia’s moat

Comparisons are made with TPUs (efficient but hard to program) and other inference‑first chips (Groq, Cerebras, Etched).
Consensus: Nvidia’s advantages are software maturity, developer ecosystem, networking, supply chain, and control of HBM capacity.
Skeptics predict many specialized inference startups will fail for familiar reasons: fragile assumptions about workloads, compiler/runtime “magic” that never arrives, and underestimating memory bandwidth as the real bottleneck.

Website & presentation issues

Multiple people cannot read the blog because it demands WebGL; they criticize this for a text article and note it even breaks on relatively new iPhones.
Workarounds like browser reader mode are mentioned; some speculate it’s just “glitter” for investors rather than a user‑first design.

Related topics