Testing AMD's Bergamo: Zen 4c

Zen 4c / Bergamo Characteristics

  • Zen 4c cores have similar per-cycle performance to Zen 4 but:
    • Less L3 cache per core.
    • Lower maximum clock, especially vs desktop Ryzen; not much lower than dense server Zen 4.
    • Slightly better performance per watt is suggested, but magnitude is unclear.
  • Bergamo uses 8 compute chiplets; speculation/roadmap mention a follow-on part with 12 chiplets and 192 Zen 5c cores.

Caches, SRAM, and Coherency

  • SRAM (especially cache) is seen as the primary scaling bottleneck: more cores imply more cache area.
  • 3D V-Cache extends shared L3, but L1/L2 remain per core.
  • Chiplet-based designs can scale cores and cache capacity, but:
    • Cache coherency across many dies is complex.
    • NUMA “cliffs” and memory latency can dominate as core counts rise.

Memory Models: x86 vs ARM

  • x86 has a strong consistency model, making multithreaded programming easier but imposing higher synchronization and coherency costs.
  • ARM offers weaker guarantees; can scale better but demands more careful software.
  • Some argue these semantics may eventually limit how far x86 can scale in-package.

ISA and Decoder Complexity Debate

  • One view: ISA choice barely matters; x86 decode occupies a small die/power budget (single-digit percent).
  • Counterview: variable-length x86 instructions make wide, multi-instruction decode and branch prediction significantly more complex and power-hungry than fixed-width ARM/RISC-V.
  • There is discussion of uop caches, instruction density, and complex multiplexing logic; no consensus on how decisive this is for future many-core designs.

Many-Core Futures and Software Limits

  • Some see Bergamo-style “core spam” as the natural future: many simpler cores to hide memory latency, echoing earlier ideas like Sun Niagara and research projects.
  • Others stress single-thread performance remains crucial; in many real workloads, most cores sit idle.
  • Amdahl’s law and the difficulty of writing correct, scalable multithreaded code are repeatedly cited.
  • Modern languages (Go, Rust, functional / reactive approaches) help but do not solve parallelism for typical applications.
  • Removing bottlenecks like Python’s GIL is seen as necessary but far from sufficient for effective use of 100+ cores.

Networking, Caches, and Near-Memory Compute

  • Some speculate about “RAM-less” or RAM-minimized servers that keep hot code and state entirely in cache, streaming data from NIC to CPU and back.
  • Existing technologies (e.g., NIC-to-cache DMA, DPDK) already approximate “process in L3” models.
  • There is interest in pushing more compute to or near the NIC, but current solutions (e.g., FPGA-based smart NICs) are expensive and specialized.