The Pentium contains a complicated circuit to multiply by three

Pentium’s ×3 Multiplier and Radix‑8 Design

  • Discussion centers on how the Pentium FPU implements a radix‑8 Booth multiplier, where each 3‑bit “digit” of the multiplier selects a multiple in {‑4x,‑3x,‑2x,‑1x,0,1x,2x,3x}.
  • Shifts give ×2 and ×4 “for free”; Booth recoding turns ×7 into 8x–x and ×6 into 8x–2x by bumping the next digit’s value, so ×3 is the only “hard” multiple.
  • A dedicated ×3 circuit (about 9000 transistors) precomputes 3x once, then that value can be routed into any partial-product term without additional adders.
  • Clarifications: the radix‑8 scheme processes multiple bits in parallel; the overall multiplier is fully pipelined (one result per cycle, multi‑cycle latency), not “3 bits per cycle” in a serial sense.
  • Negation and sign extension in the Booth terms are handled via bitwise inversion plus carry‑in tricks inside the adder tree, rather than separate adders.

Use Cases and Other Architectures

  • The ×3 hardware is part of the floating‑point unit, operating on 64‑bit significands of x86 80‑bit extended precision. It is not used by integer LEA/addressing (which only scales by 1,2,4,8).
  • Several older designs are compared:
    • MIPS line (R3000→R4400→R4200→R4300→R10000) shows a progression from iterative radix‑8 units to wide, pipelined adder arrays, with tradeoffs in power, area, and latency.
    • Earlier CPUs and arcade hardware did multi‑cycle shift‑and‑add multiplies; one example used repeated 1‑bit steps over 24 bits.
    • Datapath width differences (e.g., 4‑bit Z80 vs >64‑bit Pentium FPU) highlight why a small FPU subcircuit can exceed an entire 1970s CPU in transistor count.

Performance Growth, Moore’s Law, and Software Bloat

  • The 9000‑transistor ×3 block versus a whole Z80 is used to illustrate explosive complexity growth from 1970s microprocessors to 1990s FPUs.
  • Commenters debate whether hardware scaling is “at its limits”:
    • One side: practical limits of silicon/physics and enormously expensive fabs imply slower effective progress.
    • Others stress that transistor counts and absolute performance gains per generation are still huge; confusion between Moore’s law (density) and Dennard scaling (frequency/power) is noted.
  • There is a long exchange on absolute vs percentage gains: even smaller percentage increases now represent more raw capability than the dramatic percentage jumps of early decades.

Wirth’s Law, Developer Time, and User Time

  • The multiplier example prompts reflection that massive hardware gains encouraged bloated, inefficient software.
  • Wirth’s law (software gets slower faster than hardware gets faster) is cited; several argue current bloat now outpaces hardware improvements.
  • Tradeoffs are framed economically:
    • Startups optimize for developer speed and accept 100× slower code if it validates a product.
    • Industrial or embedded contexts justify spending engineering time to save machine time and even human time (e.g., boot‑time optimizations “saving lives”).
  • Some blame capitalism for externalizing user time/energy costs; others emphasize context‑dependent engineering goals rather than a single “correct” style.

Limits and Future Directions

  • 3D transistor structures (FinFET, gate‑all‑around) are mentioned as ways the industry extended Moore‑style scaling, but also as one‑time “extra dimensions” with thermal constraints.
  • Quantum computing is debated:
    • Sceptical view: beyond factoring and simulating quantum systems, few clear, proven advantages; huge constant‑factor overheads.
    • Optimistic view: long‑term potential in linear algebra, search, ML, logistics, and secure quantum links—though timelines are acknowledged as far beyond current commercial planning.

Miscellaneous Technical Clarifications

  • Multiple questions dig into why you can’t just do 3x = 2x + x or derive 3x from 6x by shifting: answer is that every radix‑8 partial product must be generated with only shifts/negations and a shared ×3 source, without extra per‑term adders.
  • Pipeline timing is discussed: multiple adders can be traversed in one cycle as long as total combinational delay fits; a single adder does not automatically imply one clock of latency.
  • There is some side discussion about the 80286’s performance and descriptor compatibility with the 80386, with claims and counterclaims based on historical OS behavior.