2025-04-09

Ironwood: The first Google TPU for the age of inference

Benchmarking and Marketing Claims

Many commenters criticize the blog for “silly games” with benchmarks:
- Comparing Ironwood’s FP8 flops to architectures without FP8 hardware support.
- Claiming >24× El Capitan performance by comparing FP8 flops vs FP64 flops, which are not comparable; some argue El Capitan may actually be faster on like-for-like FP8.
- Using the entire El Capitan machine as a comparison point and talking about an “El Capitan pod,” which doesn’t exist.
Others defend focusing on FP8 since that’s what end users want for ML, but several people say the choices feel designed to impress non-technical executives rather than serious buyers.
Some note Google also omits clear comparisons to Nvidia GPUs or recent TPU generations, which makes the messaging look defensive rather than confident.

Software, Ecosystem, and Lock-In

Multiple people argue the bigger issue than raw flops is the TPU software and developer experience:
- Today it heavily revolves around XLA/JAX/TensorFlow and out-of-tree drivers.
- Without serious improvements, usage is expected to remain limited to Google and a handful of large partners.
There is concern about cloud-only access and vendor lock-in: TPU is tightly bound to Google Cloud, unlike Nvidia GPUs that are widely available.
A minority respond that for big buyers TCO (performance-per-dollar including power and operations) dominates, and “walled garden” concerns matter less than cost.

TPUs vs GPUs and Other ASICs

TPUs and other AI ASICs (Cerebras, Groq, AWS Inferentia/Trainium, AMD MI series, Microsoft MAIA) are seen as part of a specialization trend as Moore’s law slows.
Several comments distinguish:
- GPUs: very strong for training, less efficient for large-scale inference due to off‑chip memory.
- TPUs/other ASICs: aim to optimize inference via low-precision math, high bandwidth, and tightly integrated fabrics.
Debate over whether inference will dominate long-term compute vs continuous retraining/fine‑tuning remains unresolved.

“First for Inference” and TPU History

People point out that the original TPU was inference-only and later there was a v4i (“i” for inference), so calling Ironwood “the first TPU for inference” is seen as factually wrong or marketing spin.
Former insiders clarify early TPUs were more like co-processors and were rethought multiple times as CNNs, RNNs, and transformers rose; Ironwood is framed as tuned for modern inference plus embeddings.

Access, Pricing, and Who Benefits

Ironwood will be available only via Google Cloud; individuals cannot buy the chips.
Some see this as a teaser for investors and large cloud customers rather than something for ordinary developers.
A few argue that even if one never uses TPUs, competition should pressure Nvidia GPU cloud pricing down.
Others are cynical: unless it translates into noticeably cheaper Gemini/API prices, it feels like internal self-congratulation.

Architecture, Efficiency, and Specialization

Discussion touches on:
- FP8 vs FP64 complexity and why ML can tolerate very low precision.
- 3D torus networking and liquid cooling in Google AI data centers; claimed to improve efficiency but details of “AI data centers” remain fuzzy.
- High HBM bandwidth numbers, but still behind Nvidia GB200 on paper.
Specialized TPUs are said to be poor fits for non-matrix workloads; Google already uses separate ASICs for video transcoding.

Coral, Edge, and Consumer Hopes

Some hoped this would lead to updated, cheap edge TPUs (like Coral) for homelabs and local ML, but those products are widely perceived as abandoned.
Overall sentiment: Ironwood is impressive technically, but its relevance is mostly at hyperscale, not personal computing.

Related topics