2024-10-12

AMD's Turin: 5th Gen EPYC Launched

Core counts, gaming, and software limits

Commenters joke that 64 cores is now “low,” but note most high-end gaming rigs still use ~16 cores (or mixed performance/efficiency cores).
Civilization VI is repeatedly cited as CPU-bound yet poorly parallelized: benchmarks show low overall CPU utilization even on 16-core chips, suggesting bottlenecks in memory, locking, or single-threaded logic.
Several argue Civ’s slowness is mainly bad engine design rather than hardware limits; others caution that some logic is inherently hard to parallelize and often gated by a “master” thread.

Turin SKUs, cache monsters, and licensing

The 16‑core EPYC with 512 MB L3 cache draws lots of attention. Consensus: it targets workloads where software is licensed per core (Oracle, Windows Server, SQL Server, CFD, MATLAB, Abaqus, VMware, etc.) and/or is very cache sensitive or single-threaded.
Topology is unusual: same silicon as high‑core‑count parts, but most cores are disabled to maximize cache per core. Inter‑chiplet latency is high, so it’s great for many independent jobs, poor for tightly coupled multithreaded work.
Some wonder about using huge L3 as directly addressable RAM or DRAM-less systems; others note modern AMD firmware paths and DMA make this impractical.

Power, thermals, and density

TDP spans ~125–500 W, with the biggest 128/192‑core SKUs at 500 W and cache-heavy 16‑core parts around 320 W.
Commenters argue these are manageable in servers due to large package area and aggressive cooling; power density is lower than desktop CPUs.
Power per thread (~1–2 W) is seen as a major advantage for datacenter operating costs.

Used EPYC, homelabs, and platform quirks

Early-gen EPYC is described as “cheap but not great”: weaker per-core performance, NUMA complexity, and old process nodes; modern consumer CPUs can beat them in compute and power efficiency.
Motherboards remain expensive and some used EPYC chips are vendor-locked via security fusing, limiting reuse.
Memory often dominates total system cost more than CPUs.

Single big machines vs clusters and cloud

Many believe modern high-core servers can replace “big data” clusters for a large share of workloads, citing large speedups when moving from Spark clusters to single-node engines like DuckDB.
Serialization, shuffles, and network overhead are blamed for distributed inefficiency; others note that resilience and operational simplicity still justify clusters and cloud for some use cases.
Some predict bare-metal hosting of such CPUs (e.g., at popular providers) could undercut expensive cloud setups for many services.

ARM competition and performance per dollar

A linked review comparing Turin Dense 196‑core to AmpereOne 192‑core reports:
- Turin ~1.6× higher performance,
- Ampere ~1.2× better energy efficiency,
- Ampere ~1.7× better performance per dollar (at list prices and for that specific SKU pairing).
Others stress EPYC’s better perf/W and the ability to discount x86 heavily off MSRP; also that AMD’s highest-density SKU isn’t its best perf/$ part.
There’s excitement about future ARM server chips (Ampere’s next gen, Nuvia-derived Qualcomm parts, hyperscaler in-house designs), with this era framed as a “golden age” of server CPUs compared to past Intel-only dominance.

LLMs, GPUs vs CPUs, and memory

Multiple back-of-the-envelope calculations compare Turin Dense’s AVX‑512 throughput to H100 GPUs; estimates suggest CPUs still trail GPUs by roughly an order of magnitude or more in raw half-precision compute, and GPUs retain a larger memory-bandwidth edge.
Some note that LLM throughput is largely limited by time to stream the model from RAM per token, reinforcing GPU advantages with high-bandwidth memory.

Memory speeds and ECC

Discussion confirms these platforms use ECC DDR5; stated 6000 MT/s figures refer to server ECC memory in specific DIMM-per-channel configurations.

Historical context

Several comments contrast today’s ~400-core dual-socket servers with early dual-core servers from the mid‑2000s and earlier multi-core experiments, highlighting how far core counts and threading have scaled even if single-core speed hasn’t improved by more than a few×.

Related topics