The Pentium contains a complicated circuit to multiply by three
Pentium’s ×3 Multiplier and Radix‑8 Design
- Discussion centers on how the Pentium FPU implements a radix‑8 Booth multiplier, where each 3‑bit “digit” of the multiplier selects a multiple in {‑4x,‑3x,‑2x,‑1x,0,1x,2x,3x}.
- Shifts give ×2 and ×4 “for free”; Booth recoding turns ×7 into 8x–x and ×6 into 8x–2x by bumping the next digit’s value, so ×3 is the only “hard” multiple.
- A dedicated ×3 circuit (about 9000 transistors) precomputes 3x once, then that value can be routed into any partial-product term without additional adders.
- Clarifications: the radix‑8 scheme processes multiple bits in parallel; the overall multiplier is fully pipelined (one result per cycle, multi‑cycle latency), not “3 bits per cycle” in a serial sense.
- Negation and sign extension in the Booth terms are handled via bitwise inversion plus carry‑in tricks inside the adder tree, rather than separate adders.
Use Cases and Other Architectures
- The ×3 hardware is part of the floating‑point unit, operating on 64‑bit significands of x86 80‑bit extended precision. It is not used by integer LEA/addressing (which only scales by 1,2,4,8).
- Several older designs are compared:
- MIPS line (R3000→R4400→R4200→R4300→R10000) shows a progression from iterative radix‑8 units to wide, pipelined adder arrays, with tradeoffs in power, area, and latency.
- Earlier CPUs and arcade hardware did multi‑cycle shift‑and‑add multiplies; one example used repeated 1‑bit steps over 24 bits.
- Datapath width differences (e.g., 4‑bit Z80 vs >64‑bit Pentium FPU) highlight why a small FPU subcircuit can exceed an entire 1970s CPU in transistor count.
Performance Growth, Moore’s Law, and Software Bloat
- The 9000‑transistor ×3 block versus a whole Z80 is used to illustrate explosive complexity growth from 1970s microprocessors to 1990s FPUs.
- Commenters debate whether hardware scaling is “at its limits”:
- One side: practical limits of silicon/physics and enormously expensive fabs imply slower effective progress.
- Others stress that transistor counts and absolute performance gains per generation are still huge; confusion between Moore’s law (density) and Dennard scaling (frequency/power) is noted.
- There is a long exchange on absolute vs percentage gains: even smaller percentage increases now represent more raw capability than the dramatic percentage jumps of early decades.
Wirth’s Law, Developer Time, and User Time
- The multiplier example prompts reflection that massive hardware gains encouraged bloated, inefficient software.
- Wirth’s law (software gets slower faster than hardware gets faster) is cited; several argue current bloat now outpaces hardware improvements.
- Tradeoffs are framed economically:
- Startups optimize for developer speed and accept 100× slower code if it validates a product.
- Industrial or embedded contexts justify spending engineering time to save machine time and even human time (e.g., boot‑time optimizations “saving lives”).
- Some blame capitalism for externalizing user time/energy costs; others emphasize context‑dependent engineering goals rather than a single “correct” style.
Limits and Future Directions
- 3D transistor structures (FinFET, gate‑all‑around) are mentioned as ways the industry extended Moore‑style scaling, but also as one‑time “extra dimensions” with thermal constraints.
- Quantum computing is debated:
- Sceptical view: beyond factoring and simulating quantum systems, few clear, proven advantages; huge constant‑factor overheads.
- Optimistic view: long‑term potential in linear algebra, search, ML, logistics, and secure quantum links—though timelines are acknowledged as far beyond current commercial planning.
Miscellaneous Technical Clarifications
- Multiple questions dig into why you can’t just do 3x = 2x + x or derive 3x from 6x by shifting: answer is that every radix‑8 partial product must be generated with only shifts/negations and a shared ×3 source, without extra per‑term adders.
- Pipeline timing is discussed: multiple adders can be traversed in one cycle as long as total combinational delay fits; a single adder does not automatically imply one clock of latency.
- There is some side discussion about the 80286’s performance and descriptor compatibility with the 80386, with claims and counterclaims based on historical OS behavior.