I wasted weeks hand optimizing assembly because I benchmarked on random data

Java and low‑latency / trading use cases

  • Several commenters note real-world trading systems whose “hot path” is in Java or C#, often with patterns like: no allocations in the hot path, GC disabled or effectively idle, and huge RAM/CPU overprovisioning.
  • Others see this as “writing C in Java”: lots of primitives, object pools, very limited String use, and even custom JIT tweaks or JVM forks.
  • Some argue Rust/C/C++ would be more natural for ultra‑low latency; others counter that Java offers strong memory safety, a large talent pool, and JIT tricks (pointer compression, dynamic realignment) that can make it surprisingly competitive.

GC behavior and memory model debates

  • Azul’s “pauseless” C4 collector is cited; another commenter clarifies that GC always has work to do, but C4 does it concurrently so application pauses are negligible.
  • Long thread on whether Java’s boxing and String design impose a “heavy cost” vs. generational GC making allocations almost just pointer bumps.
  • Counterarguments stress GC pressure, cache misses, and pointer chasing, especially with arrays of boxed types.
  • Future/value types (Project Valhalla) and .NET’s value/Span machinery are discussed as attempts to fix long‑standing layout/boxing pain.

Benchmarking and data distributions

  • Central lesson: microbenchmarks must use data distributions that match production; random data can either be “too adversarial” (as in the article) or “too nice” (e.g., well‑conditioned random matrices).
  • Identifying representative scenarios is described as one of the hardest parts of performance work, especially on the web. Tools mentioned: continuous profiling, RUM, tracing, JS self‑profiling APIs.
  • Legal/privacy often block simply capturing real production inputs; even aggregate statistics can be sensitive.
  • Profile‑guided optimization (offline or built into JITs) helps, but cannot replace good workload modeling.

Optimization stories, value, and risk

  • Multiple anecdotes echo the article: elaborate optimizations beaten by a trivial fast path for the dominant case; huge engineering effort yielding modest real‑world gains.
  • Some see “wasted” assembly work as valuable skill‑building and proof‑of‑concept experience; others warn about burnout and the risk of losing sight of end‑user impact.
  • A recurring theme: write simple, obviously correct code first; only optimize after profiling on realistic loads, and be prepared to throw experimental code away.

Varint / LEB128 performance discussion

  • Commenters dig into LEB128/varint encoding: SIMD can greatly accelerate worst‑case multi‑byte decodes, but real workloads often consist mostly of 1–2 byte values where simple branches win.
  • Alternative encodings (e.g., MKV’s) are praised as more self‑synchronizing and stream‑friendly.
  • Streaming and very large messages complicate SIMD tricks, since you can’t safely over‑read or pre‑buffer arbitrary bytes.