Testing AMD's Bergamo: Zen 4c
Zen 4c / Bergamo Characteristics
- Zen 4c cores have similar per-cycle performance to Zen 4 but:
- Less L3 cache per core.
- Lower maximum clock, especially vs desktop Ryzen; not much lower than dense server Zen 4.
- Slightly better performance per watt is suggested, but magnitude is unclear.
- Bergamo uses 8 compute chiplets; speculation/roadmap mention a follow-on part with 12 chiplets and 192 Zen 5c cores.
Caches, SRAM, and Coherency
- SRAM (especially cache) is seen as the primary scaling bottleneck: more cores imply more cache area.
- 3D V-Cache extends shared L3, but L1/L2 remain per core.
- Chiplet-based designs can scale cores and cache capacity, but:
- Cache coherency across many dies is complex.
- NUMA “cliffs” and memory latency can dominate as core counts rise.
Memory Models: x86 vs ARM
- x86 has a strong consistency model, making multithreaded programming easier but imposing higher synchronization and coherency costs.
- ARM offers weaker guarantees; can scale better but demands more careful software.
- Some argue these semantics may eventually limit how far x86 can scale in-package.
ISA and Decoder Complexity Debate
- One view: ISA choice barely matters; x86 decode occupies a small die/power budget (single-digit percent).
- Counterview: variable-length x86 instructions make wide, multi-instruction decode and branch prediction significantly more complex and power-hungry than fixed-width ARM/RISC-V.
- There is discussion of uop caches, instruction density, and complex multiplexing logic; no consensus on how decisive this is for future many-core designs.
Many-Core Futures and Software Limits
- Some see Bergamo-style “core spam” as the natural future: many simpler cores to hide memory latency, echoing earlier ideas like Sun Niagara and research projects.
- Others stress single-thread performance remains crucial; in many real workloads, most cores sit idle.
- Amdahl’s law and the difficulty of writing correct, scalable multithreaded code are repeatedly cited.
- Modern languages (Go, Rust, functional / reactive approaches) help but do not solve parallelism for typical applications.
- Removing bottlenecks like Python’s GIL is seen as necessary but far from sufficient for effective use of 100+ cores.
Networking, Caches, and Near-Memory Compute
- Some speculate about “RAM-less” or RAM-minimized servers that keep hot code and state entirely in cache, streaming data from NIC to CPU and back.
- Existing technologies (e.g., NIC-to-cache DMA, DPDK) already approximate “process in L3” models.
- There is interest in pushing more compute to or near the NIC, but current solutions (e.g., FPGA-based smart NICs) are expensive and specialized.