2025-07-21

I wasted weeks hand optimizing assembly because I benchmarked on random data

Java and low‑latency / trading use cases

Several commenters note real-world trading systems whose “hot path” is in Java or C#, often with patterns like: no allocations in the hot path, GC disabled or effectively idle, and huge RAM/CPU overprovisioning.
Others see this as “writing C in Java”: lots of primitives, object pools, very limited String use, and even custom JIT tweaks or JVM forks.
Some argue Rust/C/C++ would be more natural for ultra‑low latency; others counter that Java offers strong memory safety, a large talent pool, and JIT tricks (pointer compression, dynamic realignment) that can make it surprisingly competitive.

GC behavior and memory model debates

Azul’s “pauseless” C4 collector is cited; another commenter clarifies that GC always has work to do, but C4 does it concurrently so application pauses are negligible.
Long thread on whether Java’s boxing and String design impose a “heavy cost” vs. generational GC making allocations almost just pointer bumps.
Counterarguments stress GC pressure, cache misses, and pointer chasing, especially with arrays of boxed types.
Future/value types (Project Valhalla) and .NET’s value/Span machinery are discussed as attempts to fix long‑standing layout/boxing pain.

Benchmarking and data distributions

Central lesson: microbenchmarks must use data distributions that match production; random data can either be “too adversarial” (as in the article) or “too nice” (e.g., well‑conditioned random matrices).
Identifying representative scenarios is described as one of the hardest parts of performance work, especially on the web. Tools mentioned: continuous profiling, RUM, tracing, JS self‑profiling APIs.
Legal/privacy often block simply capturing real production inputs; even aggregate statistics can be sensitive.
Profile‑guided optimization (offline or built into JITs) helps, but cannot replace good workload modeling.

Optimization stories, value, and risk

Multiple anecdotes echo the article: elaborate optimizations beaten by a trivial fast path for the dominant case; huge engineering effort yielding modest real‑world gains.
Some see “wasted” assembly work as valuable skill‑building and proof‑of‑concept experience; others warn about burnout and the risk of losing sight of end‑user impact.
A recurring theme: write simple, obviously correct code first; only optimize after profiling on realistic loads, and be prepared to throw experimental code away.

Varint / LEB128 performance discussion

Commenters dig into LEB128/varint encoding: SIMD can greatly accelerate worst‑case multi‑byte decodes, but real workloads often consist mostly of 1–2 byte values where simple branches win.
Alternative encodings (e.g., MKV’s) are praised as more self‑synchronizing and stream‑friendly.
Streaming and very large messages complicate SIMD tricks, since you can’t safely over‑read or pre‑buffer arbitrary bytes.

Related topics