Hard Freeze, Usually When Idle. Possibly AMD related

Virtually all of the March deployments involve some sort of bug that causes my system to hard freeze, sometimes within 15 minutes of being left idle, nearly always after being left on over night. The screen will be frozen with the time on it so I can see when it happened, but there are no logs whatsoever (e.g., via journalctl -b -1 -k and journalctl -b -1). The only way to recover is by holding the power button to shut it down.

Suspend is also disabled in power management settings because I was running into issues with the CPU intermittently overheating on resume before the fans could kick on.

Is anyone else running into this? I’ve tried deploying various commits back into February at this point as all of the March builds seem to have this problem. February is TBD. Is there some way to get the logs before the system goes completely dead?

System Specs:

AMD Ryzen 7 5800X
AMD Radeon 6600XT
32GB RAM
Aorus x570 Elite Wifi (latest firmware)
Kinoite 41.20250331.0 → 41.20250221.0
KDE Plasma 6

Thanks!

When it freezes, are you able to switch to a different tty (ctrl+alt+F3)?

Nope, it’s completely toast. No tty switching, sysrq, SSH, etc.

I am having the same issue, also on AMD, did you find a fix?

EDIT: I found a (horrible) work around… a mouse jiggler keeps it from doing whatever it does when left idle, and it doesn’t lock up.


(sorry for the re-post, hit the wrong reply button and nested this under someone else’s post)

I’m having the exact same issue; hard freeze, system does not respond to ping, no ssh, keyboard/mouse totally non-responsive. Nothing in the logs to indicate a failure. memtest shows no errors. This machine does double duty as a workstation and server, but none of the containers, VMs or databases are throwing any errors, nor can I correlate any log event times to hangs.

System is fine when it’s used interactively, but hangs when left idle. I have disabled all ACPI features in the bios, disabled power management features in KDE and even tried disabling the lock screen.

I currently have the system on a smart plug connected to Home Assistant with an automation to power cycle the system if it stops responding to pings for 15 minutes, but there is no consistency in the times the automation has run, either by amount of time, time of day, or relative to other events.

This issue has come and gone over the last few months, but it’s been particularly persistent the last 2 weeks. I thought I might be having hardware failure but found this forum post with suspiciously similar hardware to mine, so now I’m wondering if it’s a bug…

CPU: AMD Ryzen 9 5900X (SR-IOV enabled for virtualization)
GPU : AMD Radeon 6700xt
MEM: 4x 32G DDR5-3200 (128G total)
MOBO: MSI MEG x570 ACE motherboard (latest firmware, currently 7C35v1Q)
OS: Fedora 42 with KDE Plasma 6 (currently kernel 6.15.9-201) – Originally installed Fedora 34, but have upgraded in-place every ~6 months to the then-current release.

I’m using wired 2.5GbE. WiFi is shut off but Bluetooth is active and used for my speakers.

Anyone have any ideas? Hopefully this info helps someone track it down. I really don’t want to reinstall this machine if I don’t have to…

It appears to be a known issue with the amdgpu driver. Moving back to kernel 6.14 is a slightly better workaround than a mouse jiggler.

kernel-6.14.0-63.fc42 - Fedora Packages is the original kernel, still available on their servers. Reverting to this kernel has stopped the lockups for me. Hopefully 6.16 fixes it, but reports in the link below suggest it may not.

Here’s some information about the bug

And indeed, kernel-6.16.2-200.fc42.x86_64 has been released, and after about 9 hours of idle time it locked up solid. 6.14 lasted about 4 days before 6.16 was released. 6.14 it is then.

Unfortunately I have not found a fix, other than to be more aggressive about putting the computer to sleep when I’m not using it. That’s avoided a hard freeze for the last couple of months at least.

It’s too bad because these bugs were the reason I switched to Kinoite when they were rendering my computer almost unusable on Arch Linux. I had a month or two of relief and then they seemed to start happening here as well. It feels like a showstopper that is hitting a lot of people yet I’m not sure how much progress is really being made to fix it.

It’s been a couple of days now and the 6.16.4-200 has been perfectly stable for me. I think they’ve fixed the problem.

Glad to hear it, I was about to post that I had to keep my computer on for ~2 days on 6.15.10-200.fc42.x86_64 with no crashes. I will say that the issue has seemed to come and go over the last year or so so keep an eye out for it I guess :slight_smile:

Well, after weeks (or months) of no issues I had a hard freeze with exactly the same symptoms yesterday. I’m guessing it’s been partially mitigated to the point where it happens far less frequently but maybe there are still some edge cases?

And I got another freeze today, this time after only idling for a few minutes. Hopefully there hasnt been a regression but at this point I’m almost about to buy an Nvidia card because this show-stopping bug has been really persistent over the last ~8 months and it’s just frankly getting a little ridiculous at this point.

I also had another freeze this week, and since my last comment, probably 10.

Same, multiple freezes daily again. Back to the 6.14.0-63-fc42 kernel for me…

To everyone experiencing freezes:
if you keep power plugged in, and set power plan to Maximum performance, do freezes remain?

Also, what are your memory and NVMe drive temperatures?

my AMD laptop suffered from freezes and they all went away after i put quality thermal pads (20w/m2) onto both memory modules AND nvme drive: the vents did not provide enough cooling and they did overheat.

After thermal pads, not a single freeze for almost a year.
And I update kernel daily as I’m on rawhide.
Thought you might have a look in this direction just in case.

The problem most of us are experiencing is not thermal; in my case I’m using a well-ventilated ATX-mid desktop PC. All power management is disabled, including display suspend.

My NVMe drives are 38°C and 40°C. CPU is 48°C at idle, and never climbs above 70°C under max load. Motherboard is also similarly cool, currently clocking 45°C.

The common thread is an AMD graphics card. Reverting to kernel 6.14 seems to fix the freezes, which indicates a software/driver problem. Using an old kernel is not an ideal solution since the new kernels provide security patches and other fixes, and at some point a new version of Fedora will come out without a problem-free kernel to revert to.

In my case, the culprit was actually a failing CPU. Replaced the CPU and the system is stable again.

1 Like

I think i am also experiencing this. However, unlike the OP, freezes also occur during “normal” usage, i.e just typing stuff into the terminal.

I am not entirely sure if this a problem specific to one kernel, since I started noticing these freezes after upgrading to f43.

It has for sure happened on version kernel-core-6.17.7-300.fc43.x86_64. Booting into kernel-core-6.17.6-300.fc43.x86_64 seems to have fixed the issue.

I am running on a Framework 13 with an AMD Ryzen 7 7840U w/ Radeon 780M Graphics

I had a freeze a little while back while actively using the system, so it does not appear to be only when idle. That said, I haven’t had a freeze for a couple of weeks now (running Linux fedora 6.17.6-200.fc42.x86_64) but that could just be coincidence. The last crash I got was on 6.17.4-200.

yeah in my case 6.17.6 just runs fine, but as soon as i boot into 6.17.7, my laptop just freezes a couple of minutes after booting. Sometimes it outputs no logs, sometimes there is something like this:

logs
Nov 14 08:37:44 fedora-laptop kernel: amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Nov 14 08:37:44 fedora-laptop kernel: amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Nov 14 08:37:44 fedora-laptop kernel: amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Nov 14 08:37:55 fedora-laptop kernel: amdgpu 0000:c1:00.0: amdgpu: [drm] *ERROR* [CRTC:80:crtc-0] hw_done or flip_done timed out
Nov 14 08:37:55 fedora-laptop kernel: amdgpu 0000:c1:00.0: [drm] *ERROR* [CRTC:80:crtc-0] flip_done timed out

If it helps, I usually also connect two monitors via usb-c.

edit: scratch that, it happened on 6.17.6 too:

logs
Nov 14 08:55:23 fedora-laptop kernel: amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Nov 14 08:55:23 fedora-laptop kernel: amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Nov 14 08:55:23 fedora-laptop kernel: amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Nov 14 08:55:33 fedora-laptop kernel: amdgpu 0000:c1:00.0: [drm] *ERROR* [CRTC:80:crtc-0] flip_done timed out
Nov 14 08:55:33 fedora-laptop kernel: amdgpu 0000:c1:00.0: amdgpu: [drm] *ERROR* [CRTC:80:crtc-0] hw_done or flip_done timed out
Nov 14 08:55:34 fedora-laptop kernel: amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Nov 14 08:55:34 fedora-laptop kernel: amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Nov 14 08:55:34 fedora-laptop kernel: amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Nov 14 08:55:34 fedora-laptop kernel: amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Nov 14 08:55:35 fedora-laptop kernel: amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Nov 14 08:55:35 fedora-laptop kernel: amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Nov 14 08:55:35 fedora-laptop kernel: amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Nov 14 08:55:35 fedora-laptop kernel: amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Nov 14 08:55:35 fedora-laptop kernel: amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Nov 14 08:55:36 fedora-laptop kernel: amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Nov 14 08:55:36 fedora-laptop kernel: amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Nov 14 08:55:36 fedora-laptop kernel: amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Nov 14 08:55:36 fedora-laptop kernel: amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Nov 14 08:55:37 fedora-laptop kernel: amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Nov 14 08:55:37 fedora-laptop kernel: amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Nov 14 08:55:37 fedora-laptop kernel: amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Nov 14 08:55:37 fedora-laptop kernel: amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Nov 14 08:55:38 fedora-laptop kernel: amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Nov 14 08:55:38 fedora-laptop kernel: amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Nov 14 08:55:38 fedora-laptop kernel: amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Nov 14 08:55:38 fedora-laptop kernel: amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Nov 14 08:55:59 fedora-laptop kernel: rfkill: input handler enabled
Nov 14 08:55:59 fedora-laptop kernel: fbcon: Taking over console
Nov 14 08:55:59 fedora-laptop kernel: BUG: kernel NULL pointer dereference, address: 0000000000000008
Nov 14 08:55:59 fedora-laptop kernel: #PF: supervisor read access in kernel mode
Nov 14 08:55:59 fedora-laptop kernel: #PF: error_code(0x0000) - not-present page