I’ve a problem I’ve been struggling with for a few days and I’m not sure where to go from here, now.
Problem:
At random times my desktop will experience what appears to be a hard-lock. I lose the ability to SSH into the system, access TTY, etc. This forces a hard reset. Sometimes I can work for hours on end, maybe 8+ before I get a freeze, sometimes it’s 15 minutes; there doesn’t seem to be a link between the freeze and a specific application I’m using.
Things I’ve checked:
NVME SSD errors / corruption. All tests have passed.,
RAM. I ran a stress test over night with memtest86+, all clean.,
Stress tested my CPU / GPU with OCCT, no errors found.,
I’ve reinstalled KDE (Fedora 43) and tested to see if perhaps a theme, or a configuration I’m using is the problem; it still froze.,
I’ve upgraded to Kernel 6.18 and MESA 26 to see if that would clear the issue, it did not. I’ve also downgraded to previous Kernels like 6.17.4, as well as early versions of MESA like 25.2.4.,
Thermals & Hardware. No overheating, no dust build up, no scratches on the PCB of the motherboard. RAM is seated correctly as is the GPU. Thermal paste applied properly. Standoffs are good as well.
There are only 70 or so lines in your journalctl log. Looks like you only copy/pasted the first page of output.
What is this, and can it be turned off just to rule it out?
Dec 26 10:06:25 fedora kernel: amdgpu: Overdrive is enabled, please disable it before reporting any bugs unrelated to overdrive.
Next time it happens (which won’t be long from your description!), see if the kernel is still responsive to a REISUB sequence to encourage anything still in buffers to be flushed to disk, just in case there’s something crucial sitting there telling you precisely why the detritus is hurtling towards the fan, but never actually written to disk.
Unfortunately, REISUB was not successful. Everything is locked down once this freeze happens, and I time stamped it a couple of minutes ago at 17:38.
I’m unsure why amdgpu overdrive was enabled, as it’s not enabled in my BIOS. I did use LACT for monitoring at one point but I checked that program too and it was disabled. I’m not using it, so I uninstalled it. I’ve confirmed overdrive is now disabled (and I still froze afterwards, at the time noted above).
As for the 70 lines of output, I just grabbed what was displayed in the terminal, which was those 70 lines.
Since I’ve got a time stamp, what commands can I run that will provide you with the most information?
Paste the last 100 or so lines from the journalctl from a boot which crashed - let’s see what the very last thing(s) that were happening before everything ground to a halt.
That should be small enough to paste directly so you won’t have to use fpaste or a pastebin. Feel free to use the --no-hostname flag to trim out the machine name as we don’t need it, to make it even smaller.
I fear that there won’t be much of use in there, as so often in these cases, the very part of data you need is not flushed. To attempt to counteract this, you could have a journalctl -f running in a terminal so you can see if there’s anything written to screen which is not flushed to disk. Might give us a clue… probably won’t!
Since the journal contains massive amounts of detail, it is best if you can use “filters” and timestamps to focus on the records that may explain the problem. Since you have a time, you can view the jounral for corresponding boot using b -N where N is the count back to the boot in question, or using the boot ID that corresponds to the timestamp (use journalctl --list-boots to get the ID). Read man journalctl for explanations of the options. For posting here, using --no-hostname --no-pager in addition to the filter options will give you complete lines (we don’t care about your hostname).