I helped fix sleep-wake hangs on Linux with AMD GPUs

User Experiences with AMD Sleep/Wake

  • Many report that AMD’s Linux graphics stack is generally good but sleep/wake has been the main recurring pain point, especially with desktop dGPUs and some laptop setups.
  • Aorus/X570/B550 motherboards and certain NVMe or USB‑C/Thunderbolt devices are repeatedly cited as problematic: machines freeze on wake, instantly re‑wake, or never fully reach sleep.
  • Various udev rules are shared to disable PCIe/USB wake sources (power/wakeup=disabled on specific buses or devices), with mixed success; some ultimately fixed issues only by removing flaky PCIe cards.
  • Some users on AMD laptops (including ThinkPads and handhelds with 7840U/8840U) report nearly flawless S0ix/suspend with only small tweaks to wakeup sources.

Suspend Reliability Across OSes

  • Multiple commenters note that suspend/hibernate is fragile not just on Linux but also on Windows (especially with Modern Standby/S0) and macOS; stories of laptops cooking in bags or draining batteries overnight are common.
  • Opinion is divided: some claim Windows is mostly fine and Linux much worse; others point to notorious Windows “Modern Standby” failures and say Linux on business ThinkPads or Linux‑focused vendors works comparably or better.
  • Apple is praised for generally good sleep behavior, but several people describe occasional or recurring failures even on MacBooks, including on ARM.

Workarounds, Debugging, and Tooling

  • Techniques discussed:
    • Using /proc/acpi/wakeup and /sys/.../power/wakeup to identify and disable spurious wake devices.
    • Custom systemd units vs udev rules; importance of Type=oneshot and RemainAfterExit.
    • Serial consoles, systemd’s debug shell, and decompiling kernel modules to trace crashes.
    • Memtest to rule out bad RAM for GPU‑related black screens; powercfg /lastwake on Windows.
  • Some users give up on reliable sleep and instead script full session restoration (tmux resurrect, window manager layout scripts).

Root Cause and VRAM Handling

  • A concise summary is given: during suspend, GPU VRAM contents must be saved into system RAM; previously this could happen after swap was disabled, so VRAM+RAM could exceed available memory and hang the system.
  • The fix in the article hooks GPU VRAM eviction earlier in the suspend path so it runs before swap/shutdown of relevant memory subsystems.
  • Prior user‑space workaround memreserver pre‑allocated and mlock’d RAM to guarantee space for VRAM, at the cost of potentially huge reservations.
  • Discussion touches on Linux overcommit and the OOM killer: many see current OOM behavior as fundamentally brittle, with cgroups/zram viewed as mitigations, not real fixes.

AMD vs Nvidia vs Intel on Linux

  • Views are split: some say AMD GPUs on Linux are “a nightmare” and recommend avoiding them; others say the opposite—AMD and Intel iGPUs are the safest, while Nvidia causes more crashes, idle power issues, and Wayland problems.
  • Nvidia users report their own suspend/black‑screen bugs; suggestions include enabling/disabling nvidia-suspend services and trying older driver branches.
  • Intel Arc is mentioned as also hitting similar suspend/VRAM problems, suggesting this class of bug is not AMD‑exclusive.

Broader Reflections and Future Work

  • Many see suspend/hibernate as intrinsically hard because it crosses kernel, drivers, firmware, init system, graphics stack, and desktop environment, all developed somewhat independently.
  • There’s a call for better automated diagnostics, e.g. a “memtest for S3/S0” and more standardized tools akin to Microsoft’s sleep diagnostics.
  • Alibaba’s proposed refinements to AMD suspend/resume state machines are linked as further work aimed at systematic fixes rather than case‑by‑case patches.