3,200% CPU Utilization
Non-thread-safe collections and race conditions
- Many commenters note this failure pattern is common: using non-thread-safe data structures (Java
TreeMap,HashMap, .NETDictionary) from multiple threads leads to bizarre bugs, including infinite loops. - Java docs explicitly state
TreeMapis not synchronized; using it concurrently violates its contract regardless of what specific symptom appears. ConcurrentModificationExceptiononly catches iterator invalidation and not multi-threaded concurrentputcalls like in the story.- Several people recall similar production incidents: corrupt hash chains, corrupt dictionaries, and hard-to-debug livelocks.
From correctness bugs to performance catastrophes
- Commenters highlight that race conditions don’t just corrupt data or deadlock; they can create cycles in internal structures and spin loops that peg all cores.
- Others add that even without corruption, races can trigger redundant work (same job done many times, one result kept), manifesting as huge slowdowns.
- Multiple anecdotes mention “can barely ssh in” situations when compute or I/O is saturated by pathological workloads.
Concurrency models and language/tool support
- Discussion compares approaches: Java/C#/C++ with manual locking; Rust’s “fearless concurrency” and ownership; Go’s channels and race detector; STM and actors; immutable data structures.
- Consensus: concurrency primitives and “thread-safe” collections help but do not remove the need to reason about higher-level invariants and multi-operation transactions.
- Examples: checking
size()then indexing, or keeping two collections in sync, are still unsafe even with concurrent containers.
Critique of the specific fixes
- Wrapping
TreeMapinCollections.synchronizedMapor swapping to a concurrent map only makes single operations safe; sequences of operations on the owning object may still be racy. - The “track visited nodes to break cycles” idea is seen as a mitigation, not a real fix: the collection remains broken under races and may fail in other ways or future JDK versions.
Culture: warnings, tests, and maintenance
- One thread debates whether “every warning/strange behavior must be fixed”: some argue strongly yes (otherwise you lose your mental model), others stress cost–benefit and project size.
- Many advocate “warnings as errors” and keeping the codebase at zero warnings; others recount failed clean-up efforts with little visible ROI.
- Another long subthread contrasts tests vs understanding: tests can’t prove correctness (especially under concurrency), but missing tests for known bugs is seen as a smell.
Operational aspects: CPU metrics and access
- Some complain about CPU utilization reporting (per-core summed >100% vs normalized), but others like the current convention for spotting single-thread bottlenecks.
- Suggestions for maintaining ssh access under load include cgroups/systemd resource reservations, CPU pinning, and prioritizing sshd over heavy workloads.