We found a bug in Go's ARM64 compiler
Assembly, stack pointer rules, and unwinding
- Several comments dissect the root cause: Go’s ARM64 backend split a large stack-pointer adjustment into two instructions, allowing preemption between them and leaving the stack pointer temporarily inconsistent for the unwinder/GC.
- People discuss “stack moves once” as an implicit invariant common in some ABIs/runtimes (especially with stack-walking GCs), contrasting it with C/C++ ABIs that use expressive unwind metadata (DWARF, Microsoft/Itanium-style bytecode) to track SP on a per-instruction basis.
- Alternatives proposed: build the full constant in a temp register and do one ADD, or use MOV/MOVK sequences; some mention the LDR pseudo-instruction but note Go prefers register-constructed immediates.
- There’s debate whether the fix “belongs” in the compiler, the assembler, or unwinder tables; one camp calls it fundamentally a codegen bug, another frames it as missing/unexpressive unwind info.
Go’s runtime / tooling design choices
- Some criticize Go’s “NIH” tendencies (custom assembler, linker, signal-based preemption, PC-swiggling in handlers) as fragile, arguing they invite subtle bugs.
- Others defend these as standard for serious language runtimes (e.g., HotSpot uses signals too) and argue complex invariants are unavoidable for async GC and M:N scheduling.
- A recurring theme: Go’s runtime has strong hidden invariants (like “SP always valid”) that aren’t systematically verified; commenters suggest more explicit documentation, tests that inspect generated machine code, and perhaps formal methods or certified toolchains for critical pieces.
Debugging experience and rarity of such bugs
- Many praise the write-up’s clarity and narrative, saying it showcases disciplined, high-level debugging skill.
- Several note how hard it is to even suspect the compiler; most developers assume their own code is wrong, so these bugs are disproportionately time‑consuming.
- There’s a split between people who find this kind of deep, racey compiler/runtime bug “fun” and those who find it hellish but satisfying only in hindsight.
- Anecdotes: earlier eras saw more compiler bugs; today they’re rarer but still show up in domains that push compilers hard (HFT, low-level systems code).
Cloudflare engineering, scale, and infrastructure
- Commenters admire Cloudflare’s culture of “no unexplained crashes,” noting this policy comes from past incidents and justifies spending serious time on rare bugs.
- The post reinforces a perception of Cloudflare as doing unusually deep, non-“ML buzz” engineering, prompting multiple readers to consider applying.
- There’s discussion of remote vs location requirements and compensation, with mixed experiences reported.
- On infrastructure, people note Cloudflare’s long-running ARM experiments (Ampere Altra) alongside EPYC, especially at the edge; others point out Cloudflare uses both Go and Rust and is far from single-language.