Furiosa: 3.5x efficiency over H100s

Practical usability & ecosystem lock‑in

  • People ask how usable this is for typical non‑AI orgs and whether it locks them into a narrow ecosystem.
  • It’s compared to AWS Trainium/Inferentia: somewhat niche but still adopted by “normal” companies.
  • Major concern: how many models are actually supported vs hand‑implemented (Llama 3.1‑8B cited as “dated”).
  • A separate Furiosa post shows gpt‑oss‑120B running, which reassures some but doesn’t dispel worries about a limited, curated model set.
  • Memory (48GB per card, 8 cards per box) is seen as tight for large, batched open models. Networking is also questioned for data‑center use.
  • Several commenters say interest depends entirely on price, delivery timeline, and ability to drop into a standard air‑cooled rack.

Benchmarks, performance & power framing

  • The headline “3.5x over H100” draws scrutiny because it compares to 3× H100 PCIe, not the more common 8× H100 SXM or newer GB200/B200 setups.
  • The vendor defines a rack as 15kW, which makes 3× H100 look like <10% of rack power; some find this assumption unrealistic.
  • One reader computes ~86 tok/s per Furiosa chip vs ~2390 tok/s per H100 on one workload, concluding raw performance is worse; others note the chip is sold on efficiency (tokens per watt) and TCO, not peak speed.
  • There is confusion between latency and throughput in the comparison, and no clear, apples‑to‑apples tokens/W chart vs modern Nvidia parts.

Inference vs training & workload focus

  • Some lose interest when they realize it’s inference‑only; others argue inference will dominate future LLM costs and a 3× efficiency gain is significant.
  • A counterpoint: many AI clusters today are still primarily used for training, with LLM inference a minority of GPU usage.
  • One commenter notes that focusing on massive LLMs may be narrowing; many AI apps aren’t giant chatbots.

Power, cooling & AI economics

  • Several posts tie Furiosa’s positioning (efficient, air‑cooled inference) to broader worries about Nvidia’s multi‑kW GPUs forcing expensive, specialized datacenters.
  • There’s extended debate over whether current AI capex is sustainable: huge spend commitments, circular vendor relationships, and bubble analogies (railroads, OC‑768, crypto).
  • Some argue labs are capacity‑constrained and profitable on inference; others think the whole stack only works while investors subsidize training and free usage.

Competition, TPUs & Nvidia’s moat

  • Comparisons are made with TPUs (efficient but hard to program) and other inference‑first chips (Groq, Cerebras, Etched).
  • Consensus: Nvidia’s advantages are software maturity, developer ecosystem, networking, supply chain, and control of HBM capacity.
  • Skeptics predict many specialized inference startups will fail for familiar reasons: fragile assumptions about workloads, compiler/runtime “magic” that never arrives, and underestimating memory bandwidth as the real bottleneck.

Website & presentation issues

  • Multiple people cannot read the blog because it demands WebGL; they criticize this for a text article and note it even breaks on relatively new iPhones.
  • Workarounds like browser reader mode are mentioned; some speculate it’s just “glitter” for investors rather than a user‑first design.