3,200% CPU Utilization

Non-thread-safe collections and race conditions

  • Many commenters note this failure pattern is common: using non-thread-safe data structures (Java TreeMap, HashMap, .NET Dictionary) from multiple threads leads to bizarre bugs, including infinite loops.
  • Java docs explicitly state TreeMap is not synchronized; using it concurrently violates its contract regardless of what specific symptom appears.
  • ConcurrentModificationException only catches iterator invalidation and not multi-threaded concurrent put calls like in the story.
  • Several people recall similar production incidents: corrupt hash chains, corrupt dictionaries, and hard-to-debug livelocks.

From correctness bugs to performance catastrophes

  • Commenters highlight that race conditions don’t just corrupt data or deadlock; they can create cycles in internal structures and spin loops that peg all cores.
  • Others add that even without corruption, races can trigger redundant work (same job done many times, one result kept), manifesting as huge slowdowns.
  • Multiple anecdotes mention “can barely ssh in” situations when compute or I/O is saturated by pathological workloads.

Concurrency models and language/tool support

  • Discussion compares approaches: Java/C#/C++ with manual locking; Rust’s “fearless concurrency” and ownership; Go’s channels and race detector; STM and actors; immutable data structures.
  • Consensus: concurrency primitives and “thread-safe” collections help but do not remove the need to reason about higher-level invariants and multi-operation transactions.
  • Examples: checking size() then indexing, or keeping two collections in sync, are still unsafe even with concurrent containers.

Critique of the specific fixes

  • Wrapping TreeMap in Collections.synchronizedMap or swapping to a concurrent map only makes single operations safe; sequences of operations on the owning object may still be racy.
  • The “track visited nodes to break cycles” idea is seen as a mitigation, not a real fix: the collection remains broken under races and may fail in other ways or future JDK versions.

Culture: warnings, tests, and maintenance

  • One thread debates whether “every warning/strange behavior must be fixed”: some argue strongly yes (otherwise you lose your mental model), others stress cost–benefit and project size.
  • Many advocate “warnings as errors” and keeping the codebase at zero warnings; others recount failed clean-up efforts with little visible ROI.
  • Another long subthread contrasts tests vs understanding: tests can’t prove correctness (especially under concurrency), but missing tests for known bugs is seen as a smell.

Operational aspects: CPU metrics and access

  • Some complain about CPU utilization reporting (per-core summed >100% vs normalized), but others like the current convention for spotting single-thread bottlenecks.
  • Suggestions for maintaining ssh access under load include cgroups/systemd resource reservations, CPU pinning, and prioritizing sshd over heavy workloads.