2025-03-02

The Pentium contains a complicated circuit to multiply by three

Pentium’s ×3 Multiplier and Radix‑8 Design

Discussion centers on how the Pentium FPU implements a radix‑8 Booth multiplier, where each 3‑bit “digit” of the multiplier selects a multiple in {‑4x,‑3x,‑2x,‑1x,0,1x,2x,3x}.
Shifts give ×2 and ×4 “for free”; Booth recoding turns ×7 into 8x–x and ×6 into 8x–2x by bumping the next digit’s value, so ×3 is the only “hard” multiple.
A dedicated ×3 circuit (about 9000 transistors) precomputes 3x once, then that value can be routed into any partial-product term without additional adders.
Clarifications: the radix‑8 scheme processes multiple bits in parallel; the overall multiplier is fully pipelined (one result per cycle, multi‑cycle latency), not “3 bits per cycle” in a serial sense.
Negation and sign extension in the Booth terms are handled via bitwise inversion plus carry‑in tricks inside the adder tree, rather than separate adders.

Use Cases and Other Architectures

The ×3 hardware is part of the floating‑point unit, operating on 64‑bit significands of x86 80‑bit extended precision. It is not used by integer LEA/addressing (which only scales by 1,2,4,8).
Several older designs are compared:
- MIPS line (R3000→R4400→R4200→R4300→R10000) shows a progression from iterative radix‑8 units to wide, pipelined adder arrays, with tradeoffs in power, area, and latency.
- Earlier CPUs and arcade hardware did multi‑cycle shift‑and‑add multiplies; one example used repeated 1‑bit steps over 24 bits.
- Datapath width differences (e.g., 4‑bit Z80 vs >64‑bit Pentium FPU) highlight why a small FPU subcircuit can exceed an entire 1970s CPU in transistor count.

Performance Growth, Moore’s Law, and Software Bloat

The 9000‑transistor ×3 block versus a whole Z80 is used to illustrate explosive complexity growth from 1970s microprocessors to 1990s FPUs.
Commenters debate whether hardware scaling is “at its limits”:
- One side: practical limits of silicon/physics and enormously expensive fabs imply slower effective progress.
- Others stress that transistor counts and absolute performance gains per generation are still huge; confusion between Moore’s law (density) and Dennard scaling (frequency/power) is noted.
There is a long exchange on absolute vs percentage gains: even smaller percentage increases now represent more raw capability than the dramatic percentage jumps of early decades.

Wirth’s Law, Developer Time, and User Time

The multiplier example prompts reflection that massive hardware gains encouraged bloated, inefficient software.
Wirth’s law (software gets slower faster than hardware gets faster) is cited; several argue current bloat now outpaces hardware improvements.
Tradeoffs are framed economically:
- Startups optimize for developer speed and accept 100× slower code if it validates a product.
- Industrial or embedded contexts justify spending engineering time to save machine time and even human time (e.g., boot‑time optimizations “saving lives”).
Some blame capitalism for externalizing user time/energy costs; others emphasize context‑dependent engineering goals rather than a single “correct” style.

Limits and Future Directions

3D transistor structures (FinFET, gate‑all‑around) are mentioned as ways the industry extended Moore‑style scaling, but also as one‑time “extra dimensions” with thermal constraints.
Quantum computing is debated:
- Sceptical view: beyond factoring and simulating quantum systems, few clear, proven advantages; huge constant‑factor overheads.
- Optimistic view: long‑term potential in linear algebra, search, ML, logistics, and secure quantum links—though timelines are acknowledged as far beyond current commercial planning.

Miscellaneous Technical Clarifications

Multiple questions dig into why you can’t just do 3x = 2x + x or derive 3x from 6x by shifting: answer is that every radix‑8 partial product must be generated with only shifts/negations and a shared ×3 source, without extra per‑term adders.
Pipeline timing is discussed: multiple adders can be traversed in one cycle as long as total combinational delay fits; a single adder does not automatically imply one clock of latency.
There is some side discussion about the 80286’s performance and descriptor compatibility with the 80386, with claims and counterclaims based on historical OS behavior.

Related topics