We found a bug in Go's ARM64 compiler

Assembly, stack pointer rules, and unwinding

  • Several comments dissect the root cause: Go’s ARM64 backend split a large stack-pointer adjustment into two instructions, allowing preemption between them and leaving the stack pointer temporarily inconsistent for the unwinder/GC.
  • People discuss “stack moves once” as an implicit invariant common in some ABIs/runtimes (especially with stack-walking GCs), contrasting it with C/C++ ABIs that use expressive unwind metadata (DWARF, Microsoft/Itanium-style bytecode) to track SP on a per-instruction basis.
  • Alternatives proposed: build the full constant in a temp register and do one ADD, or use MOV/MOVK sequences; some mention the LDR pseudo-instruction but note Go prefers register-constructed immediates.
  • There’s debate whether the fix “belongs” in the compiler, the assembler, or unwinder tables; one camp calls it fundamentally a codegen bug, another frames it as missing/unexpressive unwind info.

Go’s runtime / tooling design choices

  • Some criticize Go’s “NIH” tendencies (custom assembler, linker, signal-based preemption, PC-swiggling in handlers) as fragile, arguing they invite subtle bugs.
  • Others defend these as standard for serious language runtimes (e.g., HotSpot uses signals too) and argue complex invariants are unavoidable for async GC and M:N scheduling.
  • A recurring theme: Go’s runtime has strong hidden invariants (like “SP always valid”) that aren’t systematically verified; commenters suggest more explicit documentation, tests that inspect generated machine code, and perhaps formal methods or certified toolchains for critical pieces.

Debugging experience and rarity of such bugs

  • Many praise the write-up’s clarity and narrative, saying it showcases disciplined, high-level debugging skill.
  • Several note how hard it is to even suspect the compiler; most developers assume their own code is wrong, so these bugs are disproportionately time‑consuming.
  • There’s a split between people who find this kind of deep, racey compiler/runtime bug “fun” and those who find it hellish but satisfying only in hindsight.
  • Anecdotes: earlier eras saw more compiler bugs; today they’re rarer but still show up in domains that push compilers hard (HFT, low-level systems code).

Cloudflare engineering, scale, and infrastructure

  • Commenters admire Cloudflare’s culture of “no unexplained crashes,” noting this policy comes from past incidents and justifies spending serious time on rare bugs.
  • The post reinforces a perception of Cloudflare as doing unusually deep, non-“ML buzz” engineering, prompting multiple readers to consider applying.
  • There’s discussion of remote vs location requirements and compensation, with mixed experiences reported.
  • On infrastructure, people note Cloudflare’s long-running ARM experiments (Ampere Altra) alongside EPYC, especially at the edge; others point out Cloudflare uses both Go and Rust and is far from single-language.