2024-12-28

Intel's $475M error: the silicon behind the Pentium division bug

FPU architecture and performance tricks

Discussion of the Pentium FPU as a “stack machine” at the ISA level but really 8 registers with renaming underneath.
fxch acts like a cheap rename, can issue in the secondary pipe and is 1-cycle, enabling dense scheduling of fadd/fmul.
Constraints: fmul only issues every other cycle; one operand must be TOS, leading to complex fxch patterns, especially across loops.
Compilers of the time varied: some did good stack scheduling; others spilled or overused fxch.

Reverse engineering and microcode

Microcode/ROM contents can be extracted from high-quality die photos using automated tools, but delayering and clarity are hard.
The real challenge is understanding the encoded micro-ops; early CPUs are better documented than later ones.
Reverse-engineering work is being done with a metallurgical microscope at home; optical resolution is nearing its limits for Pentium-scale geometries.

FDIV implementation, bug, and table design

Many readers appreciated the detailed explanation of how floating-point division is built from repeated integer-like steps and lookup tables.
Several comments focus on why unused lookup entries weren’t simply filled with 2 from the start.
Explanations offered: “zero” was treated as a normal value rather than a “don’t care”; table generation and PLA optimization were likely split across teams; once the PLA was “small enough” optimization may have stopped.
The later fix (filling all undefined entries with 2) both removed edge cases and simplified the hardware.

Error rates, real-world impact, and user perception

Intel’s claim: astronomically rare per-user error rate, comparable to DRAM bit flips.
IBM’s analysis: for a heavily used spreadsheet, an individual user might hit it every few weeks.
Some argue IBM’s scenario is unrealistic because spreadsheets often recompute the same stable values; others think IBM’s framing was misleading marketing.
Broader point: “1 in a billion” can be frequent at scale (large systems / many users), and averages can hide that a few users are hit constantly.

Intel’s response, QA, and trust

Commenters view the initial “doesn’t even qualify as errata” stance as wild for a CPU doing incorrect arithmetic, regardless of rarity.
The incident is seen as both a PR disaster and, paradoxically, a long-term brand amplifier.
Several recall having to ship workarounds or detection code, pushing Intel’s problem onto developers.
Discussion that Intel later invested heavily in verification, then allegedly cut verification staff to move faster, with some linking this to more recent reliability issues.

Comparisons to other companies and support models

Comparisons with Amazon and Apple quietly replacing defective devices highlight how strong support infrastructure can contain reputational damage.
Others note that for high-value infrastructure products (like CPUs in corporate fleets), quiet consumer-style replacement isn’t as simple; vendor contracts and large IT deployments complicate responses.

Broader tangents: strategy, GPUs, and mobile

Some argue Intel’s truly huge errors were strategic: neglecting GPUs and missing mobile/SoC opportunities (e.g., selling off XScale, declining early smartphone chips).
Debate over whether ISA (x86 vs ARM) or business focus and culture are the main reasons Intel lagged in low-power markets.
Mixed views on Intel iGPUs: praised as “good enough and solid” for everyday Linux use by some; others report frequent GPU hangs and see decades of underinvestment.

Attitudes toward numerical correctness

Several comments stress that users rarely check results; even visibly wrong outputs can go unnoticed without domain intuition.
Nevertheless, in finance, science, and engineering, silent arithmetic errors are considered unacceptable, regardless of how infrequently they occur.

Related topics