Sudden Hard Hangs

, ,

This machine ran fine for months. Now I’m getting complete hard hangs requiring power-button reset. Display freezes, no keyboard, no SSH, no magic SysRq. Happens during routine desktop activity (browsing, terminal, container builds — no obvious common trigger). No video playback or gaming.

System

  • CPU: AMD Ryzen 9 5900X
  • GPU: AMD Radeon RX 9070 XT — Navi 48 / RDNA 4, PCI ID 1002:7550
  • OS: Fedora 43, btrfs root
  • Mesa: 25.3.6-3.fc43
  • linux-firmware: 20260410-1.fc43

Crash signature

  • Journal ends abruptly mid-stream. No kernel panic, oops, MCE, OOM, thermal event, soft-lockup warning, GPU reset attempt, or pstore artifact.
  • kdump was set up with reserved crashkernel; /var/crash empty after each hang — the lockup appears to block the kexec path.
  • No amdgpu warnings or ring timeouts in the seconds leading up to the freeze. Last messages are unrelated userspace (e.g., a Signal notification, a podman push).

Timeline

  • 2026-04-28dnf transaction 98: kernel 6.19.11 → 6.19.14, plus ~110 packages including mesa and linux-firmware bumps.
  • 2026-05-08 — first boot into the new stack.
  • Following days — first hang after ~3 days, then ~1 day.
  • 2026-05-11 — tried kernel 7.0.4 hoping newer driver would help RDNA 4. Hung twice within a minute. Reverted.
  • 2026-05-12 — switched to 6.19.13. Hung after 2 hours. Currently booted again with netconsole running.

Anyone else run into this? Any ideas on how to troubleshoot?

Random crashes are often hardware failures. Memory is a prime suspect. Run the standalone memtest86+ overnight for several nights.

If your vendor provides hardware test software run that.

Modern manufacturing is highly consistent, so many hardware issues are seen across a given model and will be visible in online forums.

I ran memtest overnight for 13 hours and it passed. It’s a custom built desktop from a few years ago, so I don’t think I have an vendor software to run.

I will run some stress tests tonight to test GPU and CPU. There was a BIOS update I did about a month ago (it was released last September).

amdgpu firmware was bumped to 20260410-1.fc43 on 2026-04-23 and first hang was two weeks later.

Thanks for your help. I’m just kind of loss without a smoking gun.

I got something similar on Fedora 44 Kinoite

Upgrade to 44.20260511 and add 3 crashes in 2 hours.

revert to 44.20260506.0 and run correctly for two days.

Make an update to 44.20260513.2 this afternoon and crash after 2 hours.

CPU: Intel Xeon w3-2423
GPU: nvidia tu117glm

Just crash again while writing this message, one monitor with garbage graphics on it, like a bad duplicate of the other monitor.

BIOS diagnostics passed.

Reverting to 44.20260506.0 right now.

Last error in journalctl -b -1 -e

Kernel: nouveau 0000:47:00.0: gsp: mmu fault queued
Kernel: nouveau 0000:47:00.0: gsp: rc engn:00000001 chid:19 gfid:0 level:2 type:31 scope:1 part:323 fault_addr:0000003ff1400000 fault_type:00000000
Kernel: nouveau fifo:000000:0013:0013:[kwin_wayland[2205]] errors - disabling channel
Kernel: nouveau 0000:47:00.0: kwin_wayland[2205]: channel 19 killed!

(20 seconds later)

Systemd-logind[1467]: Power key pressed short.

Same message on journalctl -b -2 -e (fault_addr is not the same exactly and pid of kwin_wayland)

Please start a new thread. Your hardware appears to be different, which complicates efforts to understand issues. Please provide hardware details (posting the output from running inxi -Fzxx in a terminal as pre-formatted web-discoverable text is often effective at reaching others with similar hardware who can provide a solution).

Yes. Before retiring, I worked with colleagues at large institutions. IT groups often had collections of misbehaving systems set aside for troubleshooting as time allowed. That allowed them to swap power supplies, cables, mass storage devices, and system boards to see if the problem moves to the 2nd machine.

If you have a way to log in from another system with ssh you can run journalctl —follow in the hope some error messages appear before the problem system fully crashes. I often use Termius on an ipad for ssh access to linux boxes.

You can also try just removing all but the very minimal optional hardware, even down to a “headless” configuration accessed by ssh.

I will make sure my laptop is running that all day. Thanks. I have had netconole running as well this week but no messages have made it out.

Update:

Had another crash yesterday but this time there were errors in journal:

May 14 18:06:43 jay-desktop kernel: mce: [Hardware Error]: Machine check events logged
May 14 18:06:43 jay-desktop kernel: mce: [Hardware Error]: CPU 9: Machine Check: 0 Bank 5: bea0000001000108
May 14 18:06:43 jay-desktop kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffffb2236db2 MISC d01a000000000000 SYND 4d000000 IPID 500b000000000
May 14 18:06:43 jay-desktop kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1778796400 SOCKET 0 APIC 16 microcode a201030
May 14 18:06:49 jay-desktop kernel: MCE: In-kernel MCE decoding enabled.

Not all hangs had this error.

Arch Linux often has excellent documentation. https://wiki.archlinux.org/title/Machine-check_exception isn’t specific to Arch. You can install rasdaemon with dnf5:

% sudo dnf5 info rasdaemon
[sudo: authenticate] Password: 
Updating and loading repositories:
Repositories loaded.
Available packages
Name           : rasdaemon
Epoch          : 0
Version        : 0.8.0
Release        : 9.fc44
Architecture   : x86_64
Download size  : 89.6 KiB
Installed size : 267.8 KiB
Source         : rasdaemon-0.8.0-9.fc44.src.rpm
Repository     : fedora
Summary        : Utility to receive RAS error tracings
URL            : http://git.infradead.org/users/mchehab/rasdaemon.git
License        : GPL-2.0-only
Description    : rasdaemon is a RAS (Reliability, Availability and Serviceability) logging tool.
               : It currently records memory errors, using the EDAC tracing events.
               : EDAC is drivers in the Linux kernel that handle detection of ECC errors
               : from memory controllers for most chipsets on i386 and x86_64 architectures.
               : EDAC drivers for other architectures like arm also exists.
               : This userspace component consists of an init script which makes sure
               : EDAC drivers and DIMM labels are loaded at system startup, as well as
               : an utility for reporting current error counts from the EDAC sysfs files.
Vendor         : Fedora Project

The Arch Linux article has a link to https://en.wikipedia.org/wiki/Machine-check_exception which iincludes some troubleshooting suggestions.