%CPU utilization is a lie

Hyperthreading, “Cores”, and Terminology

  • Several comments criticize treating a 12-core/24-thread CPU as “24 cores”; OSes and clouds expose “vCPUs” that map 1:1 to hardware threads, which misleads people into assuming linear scaling.
  • Analogies (two chefs/one stove, 2‑ply toilet paper) emphasize that SMT threads share execution units and are not equivalent to full cores.
  • Some note real, observable differences between SMT siblings and separate cores (e.g., TLB flush effects, shared caches, memory bandwidth).

When Hyperthreading Helps or Hurts

  • Impact is heavily workload‑ and architecture‑dependent.
    • Database and multi-user/IO-bound systems often see ~10–20% or more throughput gains, sometimes even at moderate utilization.
    • HPC and tightly vectorized, memory‑bandwidth‑bound workloads often see little or negative benefit; disabling SMT can simplify tuning.
  • SMT can interact with thermal limits and turbo behavior but usually doesn’t dominate power; multi-core and vector units matter more.
  • There’s debate over architectures: AMD SMT is said to behave “closer to a full core” in some Zen generations, IBM POWER leans heavily on many-way SMT, while Intel’s HT often delivers smaller incremental gains.

CPU Utilization as a Misleading Metric

  • Many point out utilization is formally “fraction of time not idle,” not “fraction of maximum useful work.” That’s well-defined but often misinterpreted.
  • Non-linearities from shared caches, memory bandwidth, interconnects, spinlocks, and frequency scaling mean 60% vs 80% utilization can correspond to dramatically different latency.
  • Typical 1–60s averaging windows hide 10–100ms bursts that actually drive latency SLOs. Some advocate measuring short-window p99/p100 CPU usage instead.
  • Power draw and temperature, or instructions-per-cycle (IPC), sometimes correlate better with “real” work than %CPU alone, but are themselves non-linear and hard to interpret.

Queueing Theory and Capacity Planning

  • Multiple commenters connect this to classic queueing theory: above roughly 60% utilization, queueing delay grows quickly; around 80% it can explode, depending on workload.
  • Some SREs treat 40–60% average CPU as “effectively full” for latency-sensitive systems, scaling out before hitting higher plateaus. Others argue IO‑bound apps can safely run hotter.

Benchmarks, Tooling, and Alternatives

  • stress-ng is noted as designed to max out components, not mimic real apps; real workloads (nginx, memcached, databases) often show “hockey stick” degradation near saturation.
  • Suggested tools/metrics: perf/ftrace for stalls and IPC, load average and run queue length, queue depth, RPS/latency, power usage, GPU FLOPs vs theoretical peak, etc.
  • Some argue utilization remains a useful “semi-crude” indicator when combined with business metrics (latency, RPS) and proper load testing.

Other Themes

  • OS accounting mostly counts scheduled time; busy-waiting and memory stalls still show as “busy.”
  • Hyperthreading is disabled by default in some security-focused OSes; SMT also interacts with per-core licensing.
  • Several note that both CPU % and memory reporting in mainstream OS tools are simplistic and often misunderstood, yet still widely relied upon.