I was running some CPU and RAM intensive tasks overnight and woke up to find that the system had restarted for some unknown reason in the middle of these tasks. I do this sort of thing all the time and have never experienced this. I looked at system logs of the previous boot by running sudo journalctl --boot=-1 and I don’t see anything notable to indicate what could have caused it. The output is too large to post here but the shortened version of the last two hours or so before the restart is like this:
May 03 04:12:27 joelsdesktop systemd[2191]: Started tracker-extract-3.service - Tracker metadata extractor.
May 03 04:12:39 joelsdesktop systemd[2191]: Starting tracker-extract-3.service - Tracker metadata extractor...
May 03 04:12:39 joelsdesktop systemd[2191]: Started tracker-extract-3.service - Tracker metadata extractor.
May 03 04:12:50 joelsdesktop systemd[1]: Starting dnf-makecache.service - dnf makecache...
May 03 04:12:51 joelsdesktop dnf[4087737]: Copr repo for PyCharm owned by phracek 5.8 kB/s | 2.1 kB 00:00
May 03 04:12:51 joelsdesktop dnf[4087737]: Fedora 37 - x86_64 46 kB/s | 25 kB 00:00
May 03 04:12:52 joelsdesktop dnf[4087737]: Fedora 37 openh264 (From Cisco) - x86_64 2.2 kB/s | 989 B 00:00
May 03 04:12:52 joelsdesktop dnf[4087737]: Fedora Modular 37 - x86_64 58 kB/s | 25 kB 00:00
May 03 04:12:53 joelsdesktop dnf[4087737]: Fedora 37 - x86_64 - Updates 27 kB/s | 23 kB 00:00
May 03 04:12:55 joelsdesktop dnf[4087737]: Fedora 37 - x86_64 - Updates 315 kB/s | 515 kB 00:01
May 03 04:12:56 joelsdesktop dnf[4087737]: Fedora Modular 37 - x86_64 - Updates 55 kB/s | 24 kB 00:00
May 03 04:12:56 joelsdesktop dnf[4087737]: google-chrome 3.8 kB/s | 1.3 kB 00:00
May 03 04:12:57 joelsdesktop dnf[4087737]: RPM Fusion for Fedora 37 - Free 3.5 kB/s | 3.4 kB 00:00
May 03 04:12:57 joelsdesktop dnf[4087737]: RPM Fusion for Fedora 37 - Free - Updates 7.2 kB/s | 3.2 kB 00:00
May 03 04:12:58 joelsdesktop systemd[2191]: Starting tracker-extract-3.service - Tracker metadata extractor...
May 03 04:12:58 joelsdesktop systemd[2191]: Started tracker-extract-3.service - Tracker metadata extractor.
May 03 04:12:58 joelsdesktop dnf[4087737]: RPM Fusion for Fedora 37 - Nonfree 15 kB/s | 6.6 kB 00:00
May 03 04:12:59 joelsdesktop dnf[4087737]: RPM Fusion for Fedora 37 - Nonfree - NVIDIA Dri 10 kB/s | 6.3 kB 00:00
May 03 04:12:59 joelsdesktop dnf[4087737]: RPM Fusion for Fedora 37 - Nonfree - Steam 11 kB/s | 6.1 kB 00:00
May 03 04:13:00 joelsdesktop dnf[4087737]: RPM Fusion for Fedora 37 - Nonfree - Updates 11 kB/s | 6.1 kB 00:00
May 03 04:13:00 joelsdesktop dnf[4087737]: Visual Studio Code 2.8 kB/s | 1.5 kB 00:00
May 03 04:13:01 joelsdesktop dnf[4087737]: Metadata cache created.
May 03 04:13:02 joelsdesktop systemd[1]: dnf-makecache.service: Deactivated successfully.
May 03 04:13:02 joelsdesktop systemd[1]: Finished dnf-makecache.service - dnf makecache.
May 03 04:13:02 joelsdesktop audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dnf-makecache comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 03 04:13:02 joelsdesktop audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dnf-makecache comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 03 04:13:02 joelsdesktop systemd[1]: dnf-makecache.service: Consumed 3.160s CPU time.
May 03 04:13:12 joelsdesktop systemd[2191]: Starting tracker-extract-3.service - Tracker metadata extractor...
May 03 04:13:12 joelsdesktop systemd[2191]: Started tracker-extract-3.service - Tracker metadata extractor.
May 03 04:13:28 joelsdesktop systemd[2191]: Starting tracker-extract-3.service - Tracker metadata extractor...
...
May 03 04:41:24 joelsdesktop systemd[2191]: Started tracker-extract-3.service - Tracker metadata extractor.
May 03 04:41:36 joelsdesktop systemd[2191]: Starting tracker-extract-3.service - Tracker metadata extractor...
May 03 04:41:36 joelsdesktop systemd[2191]: Started tracker-extract-3.service - Tracker metadata extractor.
May 03 04:41:42 joelsdesktop cupsd[1591]: REQUEST localhost - - "POST / HTTP/1.1" 200 183 Renew-Subscription successful-ok
May 03 04:41:48 joelsdesktop systemd[2191]: Starting tracker-extract-3.service - Tracker metadata extractor...
May 03 04:41:48 joelsdesktop systemd[2191]: Started tracker-extract-3.service - Tracker metadata extractor.
May 03 04:41:59 joelsdesktop systemd[2191]: Starting tracker-extract-3.service - Tracker metadata extractor...
...
May 03 05:16:25 joelsdesktop systemd[2191]: Started tracker-extract-3.service - Tracker metadata extractor.
May 03 05:16:35 joelsdesktop systemd[2191]: Starting tracker-extract-3.service - Tracker metadata extractor...
May 03 05:16:36 joelsdesktop systemd[2191]: Started tracker-extract-3.service - Tracker metadata extractor.
May 03 05:16:46 joelsdesktop systemd[1]: Starting dnf-makecache.service - dnf makecache...
May 03 05:16:47 joelsdesktop dnf[93493]: Metadata cache refreshed recently.
May 03 05:16:47 joelsdesktop systemd[1]: dnf-makecache.service: Deactivated successfully.
May 03 05:16:47 joelsdesktop systemd[1]: Finished dnf-makecache.service - dnf makecache.
May 03 05:16:47 joelsdesktop audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dnf-makecache comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 03 05:16:47 joelsdesktop audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dnf-makecache comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 03 05:16:49 joelsdesktop systemd[2191]: Starting tracker-extract-3.service - Tracker metadata extractor...
May 03 05:16:49 joelsdesktop systemd[2191]: Started tracker-extract-3.service - Tracker metadata extractor.
May 03 05:17:01 joelsdesktop systemd[2191]: Starting tracker-extract-3.service - Tracker metadata extractor...
May 03 05:17:01 joelsdesktop systemd[2191]: Started tracker-extract-3.service - Tracker metadata extractor.
May 03 05:17:13 joelsdesktop systemd[2191]: Starting tracker-extract-3.service - Tracker metadata extractor...
...
May 03 05:39:59 joelsdesktop systemd[2191]: Starting tracker-extract-3.service - Tracker metadata extractor...
May 03 05:39:59 joelsdesktop systemd[2191]: Started tracker-extract-3.service - Tracker metadata extractor.
May 03 05:40:02 joelsdesktop cupsd[1591]: REQUEST localhost - - "POST / HTTP/1.1" 200 183 Renew-Subscription successful-ok
May 03 05:40:11 joelsdesktop systemd[2191]: Starting tracker-extract-3.service - Tracker metadata extractor...
May 03 05:40:11 joelsdesktop systemd[2191]: Started tracker-extract-3.service - Tracker metadata extractor.
May 03 05:40:27 joelsdesktop systemd[2191]: Starting tracker-extract-3.service - Tracker metadata extractor...
...
May 03 06:03:54 joelsdesktop systemd[2191]: Starting tracker-extract-3.service - Tracker metadata extractor...
May 03 06:03:54 joelsdesktop systemd[2191]: Started tracker-extract-3.service - Tracker metadata extractor.
May 03 06:04:07 joelsdesktop systemd[2191]: Starting tracker-extract-3.service - Tracker metadata extractor...
May 03 06:04:07 joelsdesktop systemd[2191]: Started tracker-extract-3.service - Tracker metadata extractor.
where I have ommitted al lot of the redundant “starting tracker” lines to keep within the character limit. This is the last line before the restart.
Would a memory error have triggered a restart or would it have just shut down? Would that output anything notable in the logs? I monitor temperatures all the time and they are always fine. It’s a desktop so it would not have been a battery issue.
How would I even begin to diagnose which component could be at fault? When I first set up the system last October, I ran memtest 86 for several days and zero errors popped up, so I think the memory is fine. (Memory always seems to be the most finnicky component.) The only difference is last night I left an external SSD plugged in, which I don’t usually do.
memtest86 and prime95 (it has a linux version) are popular testing tools. I never used anything outside of that, I think, so can’t really recommend anything further.
Thanks. This is the first time I have experienced this, so I guess worst case scenario if it is memory-related, one memory error in 7-8 months is not terrible, but a minor annoyance. My guess is that memory errors this rare would not show up on a test such as memtest86 running for a couple days at most. If this becomes a frequent thing then that is a bit worrisome. I have down clocked the memory (strictly speaking, the memory is only supported up to 3600 MHz but I had it slightly overclocked at 4000MHz) just to reduce the chances of this happening.
It may have been a power fluctuation, enough to cause the restart if the bios is configured to restart on a power loss. Probably would be no log entries related.
The Zen4 platform with four dual-rank DIMMs installed can be tricky, and generally memory errors can surface weeks or months later on seemingly stable configurations when hit with a unique load scenario.
Updating the EFI to the latest version (really important) and slightly increasing the SoC voltage (within safe margins, <= 1.30V) could help alleviate instabilities.
I recommend testing with y-cruncher, not memtest86. memtest86 is fine when testing for defective RAM modules, but it tends to not reliably uncover system instabilities caused by overclocked RAM in a timely manner. Where memtest86 can run without errors for days, y-cruncher usually uncovers them within a few hours.
For the 7950x and 128 GiB RAM, this configuration should work fine:
Thanks for the suggestion. I am currently running the stress test. I would certainly hope it would not have trouble with this, seeing as currently it is not overclocked. (speed is currently at 3600 MHz and timings are set to default.)
So far it has passed 35 iterations of all the tests (10.15 hours). Will it just keep going forever until I stop it? How long would you recommend running it for? I notice there are other tests not being run. Is there a reason for not running these or should I run those as well?
After about 13 hours I had to stop the test and restart the machine. The CPU load was so intense that all I/O became unusable. (Extreme lag for ~10 hrs, then froze up completely after this.) I am hoping this is sufficient.
This seems to indicate that the stress test found a weakness. Whether due to temperature or something else you do not say, but I hope you were monitoring temps while the test was in progress.
So far it has passed 35 iterations of all the tests (10.15 hours). Will it just keep going forever until I stop it? How long would you recommend running it for?
I personally would call 35 iterations to be sufficient. If you do not stop the test, it would proceed until something went wrong, either detecting a memory error and exiting with a warning or the machine crashing.
I notice there are other tests not being run. Is there a reason for not running these or should I run those as well?
The other tests are not as demanding for testing RAM and would waste more time and energy, but you can of course run them as well.
After about 13 hours I had to stop the test and restart the machine. The CPU load was so intense that all I/O became unusable. (Extreme lag for ~10 hrs, then froze up completely after this.) I am hoping this is sufficient.
A slow machine is to be expected when running the test since it loads all CPU cores and most of your RAM, but a freeze should not have happened. If there was a thermal problem, it should have occurred much earlier since thermally saturating the hardware should not take 13 hours.
On Zen4, usually when there is an unstable memory configuration, the machine just resets and does not freeze, or the errors go undetected when not using ECC memory, and some bits get flipped unless you are specifically looking for wrong data, like y-cruncher does.
You could do another test with some cores excluded, and less RAM utilized to increase system responsiveness during testing. If you want to go that route, just modify the LogicalCores array and remove some (maybe up to 8) cores, and reduce TotalMemory from the stresstest.cfg so the system has some more free RAM to work with. Monitoring system temperatures might also be advisable.
What I was able to observe in htop was that the CPU cores were not just loaded, they were extremely overloaded. It looked like it was constantly trying to run 60-70 processes simultaneously on 32 logical cores. My guess is that a sustained load like this will eventually cause extreme scheduling issues such that after a while basic functionality like bluetooth and wifi starts getting pushed out and stops working. For example, the wifi would continually disconnect and reconnect.
I can also say that the thermals are as expected for this chip. Unfortunately AMD designed them to run a bit hot, but even under full load the highest I saw was about 88 C (typically 70s).
I should clarify that what I mean by freeze is that the machine goes to the Lock Screen and bluetooth and usb devices are unresponsive. The test could very well be continuing to run behind the lock screen, but I can’t say. I’m locked out of the machine since bluetooth stops working. The time on the Lock Screen was stuck at 6:13pm for a couple of hours as well, so clearly the clock was not working properly as well. Here is the system log output for the duration of the test:
6:07:30 PM kernel: System encountered a non-fatal error in __audit_sockaddr()
6:07:30 PM kernel: System encountered a non-fatal error in __audit_sockaddr()
6:07:27 PM kernel: System encountered a non-fatal error in btrfs_alloc_delayed_item()
6:07:23 PM kernel: System encountered a non-fatal error in __audit_sockaddr()
6:07:06 PM kernel: logitech-hidpp-device 0005:046D:B023.0021: Device not connected
5:38:48 PM systemd: Failed to start dnf-makecache.service - dnf makecache.
5:31:36 PM kernel: logitech-hidpp-device 0005:046D:B023.0020: Device not connected
5:18:49 PM kernel: iwlwifi 0000:0d:00.0: Not associated and the session protection is over already...
5:16:18 PM kernel: logitech-hidpp-device 0005:046D:B023.001E: Device not connected
4:47:35 PM gdm-session-wor: gkr-pam: the password for the login keyring was invalid.
4:45:50 PM kernel: logitech-hidpp-device 0005:046D:B023.001C: Device not connected
4:03:13 PM kernel: iwlwifi 0000:0d:00.0: Not associated and the session protection is over already...
4:01:30 PM kernel: logitech-hidpp-device 0005:046D:B023.001A: Device not connected
3:35:45 PM gdm-session-wor: gkr-pam: the password for the login keyring was invalid.
3:35:00 PM kernel: logitech-hidpp-device 0005:046D:B023.0018: Device not connected
3:28:41 PM kernel: iwlwifi 0000:0d:00.0: Not associated and the session protection is over already...
3:18:00 PM kernel: logitech-hidpp-device 0005:046D:B023.0016: Device not connected
2:47:52 PM kernel: Bluetooth: hci0: ACL packet for unknown connection handle 3586
2:07:59 PM kernel: logitech-hidpp-device 0005:046D:B023.0012: Device not connected
1:36:32 PM kernel: iwlwifi 0000:0d:00.0: Not associated and the session protection is over already...
1:35:47 PM systemd: Failed to start dnf-makecache.service - dnf makecache.
1:35:42 PM kernel: iwlwifi 0000:0d:00.0: Not associated and the session protection is over already...
12:37:12 PM kernel: Bluetooth: hci0: ACL packet for unknown connection handle 3585
12:27:59 PM kernel: iwlwifi 0000:0d:00.0: Not associated and the session protection is over already...
12:27:16 PM kernel: logitech-hidpp-device 0005:046D:B023.000D: Device not connected
12:27:03 PM kernel: iwlwifi 0000:0d:00.0: Not associated and the session protection is over already...
12:23:20 PM kernel: logitech-hidpp-device 0005:046D:B023.000C: Device not connected
11:20:56 AM kernel: iwlwifi 0000:0d:00.0: Not associated and the session protection is over already...
9:38:17 AM systemd: Failed to start dnf-makecache.service - dnf makecache.
9:12:09 AM kernel: iwlwifi 0000:0d:00.0: Not associated and the session protection is over already...
8:23:59 AM bluetoothd: profiles/audio/avctp.c:avctp_connect_cb() HUP or ERR on socket: Connection timed out (110)
8:12:51 AM kernel: iwlwifi 0000:0d:00.0: Not associated and the session protection is over already...
8:10:51 AM gdm-session-wor: gkr-pam: the password for the login keyring was invalid.
7:07:15 AM kernel: iwlwifi 0000:0d:00.0: Not associated and the session protection is over already...
Is there a way to set a maximum number of iterations? Or alternatively I suppose I could have the results output to a log file. But if it can’t even maintain bluetooth or wifi, I’m not sure how it would manage writing to an SSD.
It looked like it was constantly trying to run 60-70 processes simultaneously on 32 logical cores.
It does the same on my system, though I never experienced that kind of scheduling issues you are describing. But I am not using bluetooth peripherals, and I am running with CPUSchedulingPolicy=idle.
[…] the highest I saw was about 88 C (typically 70s)
That seems fine for that CPU, yes.
Is there a way to set a maximum number of iterations?
Not that I know of. Just set the maximum runtime by setting SecondsTotal to a sane value and maybe reduce the LogicalCores to 16 values.
If you keep having trouble with y-cruncher after that, maybe switch tools or call it stable enough after 35 successful iterations if you don’t want to bump up the RAM speeds again.
Yes, I think I am going to call it stable for now. I’m not particularly interested in overlocking the RAM. I’d much rather have stability. If I experience any more crashes, I’ll do more testing. Thanks for your help!
I did an upgrade on my F37 system (workstation) 4 days ago during which kernel 6.2.14 was installed and just after the reboot began getting major crashes (some causing a reboot) and constant kernel oops. I wound up with ~14000 oops files in /var/spool/abrt in about 10 hours.
I don’t feel it was kernel related since the issue remained even when I rebooted with both the 6.2.13 and 6.2.12 kernels.
Never did find out what the cause was, but I did a new clean install of F38 and the problems stopped. Something that was upgraded with that transaction seemed the cause, but over 40 packages were updated at that time so did not try to identify it any further…
I am in the process now of reinstalling all the software I was using and hopefully it will not restart the problems.
I did try running y-cruncher and never was able to complete even one test since it gave repeated errors on either one or 2 different cpus and random ones each time…
Yeah, my computer just randomly rebooted as well. This happened an hour after finishing a 24 hr error-free y-cruncher test, so I don’t think it’s cpu or ram-related either. The thing is this is kind of difficult to reproduce because it often doesn’t happen for days at a time.