Since I updated from Fedora 29 to 30, I keep having random crashes at a rate of about one every 2-6 hours. I might be working on something, then the cursor freezes, but typically audio and video continue for a few more seconds if there is something playing. Then the entire computer shuts down and hangs on a black screen. I then need to power down the computer manually (holding the power button) before being able to boot again.
The system logs do not have anything useful in them, just typical log messages, no errors from the time of the crash.
I tried the following:
Lowered my memory speed, as I was overclocking RAM/CPU. Now I’m back to stock speeds (even the OC was stable on Windows)
Ran a memtest to verify my RAM is good. It ran over 24 hours, no errors.
Re-installed Fedora 30 from scratch.
Use nouveau instead of Nvidia’s proprietary driver.
I am using the KDE Plasma desktop version of Fedora.
Can anyone help me diagnose why this is happening?
My hardware is:
Mobo: ASUS PRIME X370-PRO
CPU: Ryzen R7 1700X @ stock speeds
GPU: NVIDIA GeForce GTX 1070 Ti @ stock speeds
RAM: 16GB DDR4 2400MHz
Storage: Samsung 970 EVO NVMe SSD
Powersupply is 600W
Again, the system is totally stable when running Windows for many hours, but crashes/hangs at least once a day on Fedora, which it did not do on Fedora 29!. I’d like to find out why so I could hopefully (help) solve the issue.
I would check your boot drives smart data for errors. I use both the disk utility in fedora or my favorite is Gsmartcontrol. If the system becomes to unstable these can be checked from thumb drive install or live disks
Gparted live disk: GParted -- Live CD/USB/PXE/HD
If the hard drive checks out as fine, the next thing I would check is the power supply. That takes special tools or just a swap out to see if system crashes with new powersupply. A power supply test unit comes in handy for this. System instability generally in my opinion tends more to be on the hardware side vs software, but its a crap shoot none the less,good luck.
Currently I’m using nouveau as it comes automatically configured after installing Fedora with the GPU installed. I don’t think it matters because with either driver, I’m still seeing crashes. As a last resort, I could try that.
I’m on 5.2.13-200.fc30.x86_64 currently. I have been on 5.1.xx for some time but upgrading did not help. Do you advise me to downgrade to 5.0?
smartctl -A /dev/nvme0n1
smartctl 7.0 2019-03-31 r4903 [x86_64-linux-5.2.13-200.fc30.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 39 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 11,014,547 [5.63 TB]
Data Units Written: 8,044,159 [4.11 TB]
Host Read Commands: 307,154,478
Host Write Commands: 181,758,668
Controller Busy Time: 755
Power Cycles: 510
Power On Hours: 1,675
Unsafe Shutdowns: 144
Media and Data Integrity Errors: 0
Error Information Log Entries: 390
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 39 Celsius
Temperature Sensor 2: 45 Celsius
I’m not sure about the percentage used number. If I use KDE Partition Manager, it shows me this:
So nothing seems to be critically limited in space either.
I would tend to agree on this, except that the system is rock solid under Windows, even when pushing it to it’s limits on stress tests and what not. The system used to be overclocked on both CPU and RAM and would do fine. Now I’m back to stock clocks and it still crashes on Fedora, not on Windows.
The crashes started from the moment I upgraded to Fedora 30, which makes me very suspicious that it might be software this time.
I unfortunately do not have a spare power supply to try out at this moment, nor a test unit.
I have since changed the power profile of my BIOS to ‘performance’ instead of ‘balanced’, which I believe means it should avoid the lowest power states and generally increases clockspeed. Since this change, I have not experienced any freezes.
So some software or hardware component has issues with low power states (perhaps instable voltages for low power draw?).
I’ll consider this solved for now, even though the root cause is still unknown.