2025-02-28

3,200% CPU Utilization

Non-thread-safe collections and race conditions

Many commenters note this failure pattern is common: using non-thread-safe data structures (Java TreeMap, HashMap, .NET Dictionary) from multiple threads leads to bizarre bugs, including infinite loops.
Java docs explicitly state TreeMap is not synchronized; using it concurrently violates its contract regardless of what specific symptom appears.
ConcurrentModificationException only catches iterator invalidation and not multi-threaded concurrent put calls like in the story.
Several people recall similar production incidents: corrupt hash chains, corrupt dictionaries, and hard-to-debug livelocks.

From correctness bugs to performance catastrophes

Commenters highlight that race conditions don’t just corrupt data or deadlock; they can create cycles in internal structures and spin loops that peg all cores.
Others add that even without corruption, races can trigger redundant work (same job done many times, one result kept), manifesting as huge slowdowns.
Multiple anecdotes mention “can barely ssh in” situations when compute or I/O is saturated by pathological workloads.

Concurrency models and language/tool support

Discussion compares approaches: Java/C#/C++ with manual locking; Rust’s “fearless concurrency” and ownership; Go’s channels and race detector; STM and actors; immutable data structures.
Consensus: concurrency primitives and “thread-safe” collections help but do not remove the need to reason about higher-level invariants and multi-operation transactions.
Examples: checking size() then indexing, or keeping two collections in sync, are still unsafe even with concurrent containers.

Critique of the specific fixes

Wrapping TreeMap in Collections.synchronizedMap or swapping to a concurrent map only makes single operations safe; sequences of operations on the owning object may still be racy.
The “track visited nodes to break cycles” idea is seen as a mitigation, not a real fix: the collection remains broken under races and may fail in other ways or future JDK versions.

Culture: warnings, tests, and maintenance

One thread debates whether “every warning/strange behavior must be fixed”: some argue strongly yes (otherwise you lose your mental model), others stress cost–benefit and project size.
Many advocate “warnings as errors” and keeping the codebase at zero warnings; others recount failed clean-up efforts with little visible ROI.
Another long subthread contrasts tests vs understanding: tests can’t prove correctness (especially under concurrency), but missing tests for known bugs is seen as a smell.

Operational aspects: CPU metrics and access

Some complain about CPU utilization reporting (per-core summed >100% vs normalized), but others like the current convention for spotting single-thread bottlenecks.
Suggestions for maintaining ssh access under load include cgroups/systemd resource reservations, CPU pinning, and prioritizing sshd over heavy workloads.

Related topics