2024-06-22

Testing AMD's Bergamo: Zen 4c

Zen 4c / Bergamo Characteristics

Zen 4c cores have similar per-cycle performance to Zen 4 but:
- Less L3 cache per core.
- Lower maximum clock, especially vs desktop Ryzen; not much lower than dense server Zen 4.
- Slightly better performance per watt is suggested, but magnitude is unclear.
Bergamo uses 8 compute chiplets; speculation/roadmap mention a follow-on part with 12 chiplets and 192 Zen 5c cores.

Caches, SRAM, and Coherency

SRAM (especially cache) is seen as the primary scaling bottleneck: more cores imply more cache area.
3D V-Cache extends shared L3, but L1/L2 remain per core.
Chiplet-based designs can scale cores and cache capacity, but:
- Cache coherency across many dies is complex.
- NUMA “cliffs” and memory latency can dominate as core counts rise.

Memory Models: x86 vs ARM

x86 has a strong consistency model, making multithreaded programming easier but imposing higher synchronization and coherency costs.
ARM offers weaker guarantees; can scale better but demands more careful software.
Some argue these semantics may eventually limit how far x86 can scale in-package.

ISA and Decoder Complexity Debate

One view: ISA choice barely matters; x86 decode occupies a small die/power budget (single-digit percent).
Counterview: variable-length x86 instructions make wide, multi-instruction decode and branch prediction significantly more complex and power-hungry than fixed-width ARM/RISC-V.
There is discussion of uop caches, instruction density, and complex multiplexing logic; no consensus on how decisive this is for future many-core designs.

Many-Core Futures and Software Limits

Some see Bergamo-style “core spam” as the natural future: many simpler cores to hide memory latency, echoing earlier ideas like Sun Niagara and research projects.
Others stress single-thread performance remains crucial; in many real workloads, most cores sit idle.
Amdahl’s law and the difficulty of writing correct, scalable multithreaded code are repeatedly cited.
Modern languages (Go, Rust, functional / reactive approaches) help but do not solve parallelism for typical applications.
Removing bottlenecks like Python’s GIL is seen as necessary but far from sufficient for effective use of 100+ cores.

Networking, Caches, and Near-Memory Compute

Some speculate about “RAM-less” or RAM-minimized servers that keep hot code and state entirely in cache, streaming data from NIC to CPU and back.
Existing technologies (e.g., NIC-to-cache DMA, DPDK) already approximate “process in L3” models.
There is interest in pushing more compute to or near the NIC, but current solutions (e.g., FPGA-based smart NICs) are expensive and specialized.

Related topics