AMDGPU crash every 5 days

So, I have a problem that I’m now certain I cannot solve on my own: my GPU crash every five days (more or less 20 minutes. The sound continues to play for a short time and then full stop). I have tried a lot, but since I’m using Linux for only nine months now, I’m running out of ideas. What is normal or not? I don’t know.

Let’s start at the beginning, my entire computer (up to the power cord) is brand new, assembled by hand on my own, last month:

CPU: Intel I7-12700K
AIO: Corsair iCUE H150i RGB ELITE
MB: ASUS STRIX Z690-G (DDR5)
RAM: DDR5 Corsair Vengeance 32Go
SSD1: Samsung 980 PRO MZ-V8P500BW (500Go)
SSD2: Samsung 870 EVO (1To)
SSD3: Crucial CT1000MX500SSD1 (1To)
Power: Seasonic PRIME TX-1000 80Plus Titanium
Case: Bequiet! Silent Base 802 Black
GPU: PowerColor AMD Radeon RX 7900 XT Hellhound
Displays: I have two in HDMI, with one using an adapter to go on a DisplayPort (working perfectly on my last Nvidia config).

Probably the worst idea I could have, knowing that my GPU is far too young and therefore not fully supported yet. Impossible to tell if my trouble comes from driver or hardware side. This machine was built to replace another, whose GPU failed me some time ago, meaning that I do not have any to do tests with…

My system is a Fedora 37, on a kernel 6.2.0.0.rc5 that I just updated today from 6.2.0.0.rc2. LLVM 15.0.6. Mesa 22.3.3. I performed a memtest (with no error) and a quick Vram test (with no error). At first, the crash occurred everyday (24 hours more or less 20 minutes), with the same message as what I have now:

[amdgpu]] *ERROR* ring gfx_0.0.0 timeout

I’m sure it means that yes, my GPU indeed crashed. I also noticed regular core dump with Steam, specifically with gldriverquery, and experienced at least one complete failure triggered by the Unreal Engine 5.1 (by just keeping it open in idle, same gfx timeout message). I was able to push my crash occurrence to 5 days after completely shutting down aspm.

Also:

I’m able to play Cyberpunk 2077 for 4 hours but it crashes in a rather unusual way: process unkillable, by any means short to a hard reboot… and I can use my system normally on the side. Cool, but concerning.

If I change the performance mode of my GPU with CoreCtrl from auto to high, I suffer heavy stuttering and glitches with Youtube videos. It also looks like the voltage is constantly going down. I’m not fully sure if it means something important or not.

I don’t know what to do now, what are your advises? Please remember, I’m still a beginner with Linux.

You seem to be running Steam to play games, and keeping the machine up with steam active the whole time.

Steam is a 32 bit environment running on a 64 bit system. What repo did you choose to install steam from?

While I do not have similar hardware to test, it is quite obvious that you have installed both a very new CPU and a very new GPU. You are also NOT running the fully tested fedora 37 kernel (6.1.6) nor the one from fedora 37 testing (6.1.7), but rather the one from rawhide.

It will be difficult to track down the problem but the first approach should be to provide the logs that may provide some hints. Since you said it crashes the logs for the previous boot may assist.
journalctl -b -1 | tail -500 would give the last 500 lines from the logs of the last boot as a starting point to look for hints. (The last 50 lines may suffice but more info ensures a better chance.) Post that as preformatted text using the </> button above.

2 Likes

I used Gnome Software for Steam, and if I’m not sure about the preceding crash, I can confirm that yes, it was running these last 5 days.

That was more a desperate measure than anything else. The 7900XT wasn’t running with the kernel from the liveUSB, so I updated using the Vanilla Repositories literally to the last I could (it was for a test, at first, but since it started my GPU… I kept it. With the problem I have, I tried upgrading before downgrading. Old habit from Windows, hard to lose).

Well… I made a mistake… and lost my entire journalctl for the past 2 weeks… I’ll try to trigger another crash with the Unreal Engine if I can (since it’s the only thing I can do, besides waiting for five days), I will post the log when it’s done.

This might help you to gather info’s:
Reporting Bugs — amdgpu graphics and compute stack unknown-build documentation

I found this interesting to:
https://www.amd.com/en/support/graphics/amd-radeon-rx-7000-series/amd-radeon-rx-7900-series/amd-radeon-rx-7900xt

1 Like

The liveUSB has (I think) kernel 6.0.5 and outdated drivers. Installing then doing a full sudo dnf upgrade would get the latest released packages (kernel, drivers, firmware, etc.) from the repos. I don’t have an AMD gpu so cannot say exactly when the 7900XT began to be supported. I was just commenting on the items I noted that seemed to not be stock and that using things outside the normal repos makes tracking down problems a little more difficult.

Upgrading is not a bad thing. It often gets improvements especially with kernels and drivers related to the newer hardware. I think the initiative to look for newer software that may support the newer hardware is a good thing.

What seems odd is the consistency with your stating this happens regularly every 5 days.

2 Likes

Ok, I was able to trigger a crash with the Unreal Engine, as predicted! Sorry for the delay.

Jan 23 18:03:34 olorin firefox.desktop[3245]: libva info: VA-API version 1.16.0
Jan 23 18:03:34 olorin firefox.desktop[3245]: libva info: Trying to open /usr/lib64/dri/radeonsi_drv_video.so
Jan 23 18:03:34 olorin PackageKit[1476]: search-file transaction /2507_cacaaaac from uid 1000 finished with success after 9ms
Jan 23 18:03:34 olorin firefox.desktop[3245]: libva info: Found init function __vaDriverInit_1_16
Jan 23 18:03:34 olorin firefox.desktop[3245]: libva info: va_openDriver() returns 0
Jan 23 18:03:34 olorin PackageKit[1476]: search-file transaction /2508_bacedbce from uid 1000 finished with success after 7ms
Jan 23 18:03:34 olorin PackageKit[1476]: search-file transaction /2509_aceecece from uid 1000 finished with success after 5ms
Jan 23 18:03:34 olorin PackageKit[1476]: search-file transaction /2510_eceeeeed from uid 1000 finished with success after 5ms
Jan 23 18:03:34 olorin PackageKit[1476]: search-file transaction /2511_adcbbcce from uid 1000 finished with success after 4ms
Jan 23 18:03:34 olorin PackageKit[1476]: search-file transaction /2512_edceeeea from uid 1000 finished with success after 4ms
Jan 23 18:03:34 olorin PackageKit[1476]: search-file transaction /2513_adedcaca from uid 1000 finished with success after 5ms
Jan 23 18:03:34 olorin PackageKit[1476]: search-file transaction /2514_abdbaeaa from uid 1000 finished with success after 5ms
Jan 23 18:03:34 olorin PackageKit[1476]: search-file transaction /2515_ecaabebe from uid 1000 finished with success after 5ms
Jan 23 18:03:34 olorin PackageKit[1476]: search-file transaction /2516_dcdccbed from uid 1000 finished with success after 6ms
Jan 23 18:03:34 olorin PackageKit[1476]: search-file transaction /2517_aeadecec from uid 1000 finished with success after 5ms
Jan 23 18:03:34 olorin PackageKit[1476]: search-file transaction /2518_abcdccde from uid 1000 finished with success after 5ms
Jan 23 18:03:34 olorin PackageKit[1476]: search-file transaction /2519_aeebedcb from uid 1000 finished with success after 4ms
Jan 23 18:03:34 olorin PackageKit[1476]: resolve transaction /2520_cabceacc from uid 1000 finished with success after 2ms
Jan 23 18:03:34 olorin firefox.desktop[3245]: libva info: VA-API version 1.16.0
Jan 23 18:03:34 olorin firefox.desktop[3245]: libva info: Trying to open /usr/lib64/dri/radeonsi_drv_video.so
Jan 23 18:03:34 olorin firefox.desktop[3245]: libva info: Found init function __vaDriverInit_1_16
Jan 23 18:03:34 olorin firefox.desktop[3099]: [2023-01-23T17:03:34Z ERROR mp4parse] Found 2 nul bytes in "\0\0"
Jan 23 18:03:34 olorin firefox.desktop[3099]: [2023-01-23T17:03:34Z ERROR mp4parse] Found 2 nul bytes in "\0\0"
Jan 23 18:03:34 olorin firefox.desktop[3099]: [2023-01-23T17:03:34Z ERROR mp4parse] Found 2 nul bytes in "\0\0"
Jan 23 18:03:34 olorin firefox.desktop[3099]: [2023-01-23T17:03:34Z ERROR mp4parse] Found 2 nul bytes in "\0\0"
Jan 23 18:03:34 olorin firefox.desktop[3245]: libva info: va_openDriver() returns 0
Jan 23 18:03:34 olorin firefox.desktop[3245]: libva info: VA-API version 1.16.0
Jan 23 18:03:34 olorin firefox.desktop[3245]: libva info: Trying to open /usr/lib64/dri/radeonsi_drv_video.so
Jan 23 18:03:34 olorin firefox.desktop[3245]: libva info: Found init function __vaDriverInit_1_16
Jan 23 18:03:34 olorin firefox.desktop[3245]: libva info: va_openDriver() returns 0
Jan 23 18:03:34 olorin firefox.desktop[3245]: libva info: VA-API version 1.16.0
Jan 23 18:03:34 olorin firefox.desktop[3245]: libva info: Trying to open /usr/lib64/dri/radeonsi_drv_video.so
Jan 23 18:03:34 olorin firefox.desktop[3245]: libva info: Found init function __vaDriverInit_1_16
Jan 23 18:03:34 olorin firefox.desktop[3245]: libva info: va_openDriver() returns 0
Jan 23 18:03:34 olorin firefox.desktop[3245]: libva info: VA-API version 1.16.0
Jan 23 18:03:34 olorin firefox.desktop[3245]: libva info: Trying to open /usr/lib64/dri/radeonsi_drv_video.so
Jan 23 18:03:34 olorin firefox.desktop[3245]: libva info: Found init function __vaDriverInit_1_16
Jan 23 18:03:34 olorin rtkit-daemon[982]: Successfully made thread 3385 of process 3099 (/usr/lib64/firefox/firefox) owned by '1000' RT at priority 10.
Jan 23 18:03:34 olorin firefox.desktop[3245]: libva info: va_openDriver() returns 0
Jan 23 18:03:34 olorin firefox.desktop[3245]: libva info: VA-API version 1.16.0
Jan 23 18:03:34 olorin firefox.desktop[3245]: libva info: Trying to open /usr/lib64/dri/radeonsi_drv_video.so
Jan 23 18:03:34 olorin firefox.desktop[3245]: libva info: Found init function __vaDriverInit_1_16
Jan 23 18:03:34 olorin firefox.desktop[3245]: libva info: va_openDriver() returns 0
Jan 23 18:03:35 olorin firefox.desktop[3245]: libva info: VA-API version 1.16.0
Jan 23 18:03:35 olorin firefox.desktop[3245]: libva info: Trying to open /usr/lib64/dri/radeonsi_drv_video.so
Jan 23 18:03:35 olorin firefox.desktop[3245]: libva info: Found init function __vaDriverInit_1_16
Jan 23 18:03:35 olorin firefox.desktop[3245]: libva info: va_openDriver() returns 0
Jan 23 18:03:35 olorin firefox.desktop[3245]: libva info: VA-API version 1.16.0
Jan 23 18:03:35 olorin firefox.desktop[3245]: libva info: Trying to open /usr/lib64/dri/radeonsi_drv_video.so
Jan 23 18:03:35 olorin firefox.desktop[3245]: libva info: Found init function __vaDriverInit_1_16
Jan 23 18:03:35 olorin firefox.desktop[3245]: libva info: va_openDriver() returns 0
Jan 23 18:03:35 olorin firefox.desktop[3245]: libva info: VA-API version 1.16.0
Jan 23 18:03:35 olorin firefox.desktop[3245]: libva info: Trying to open /usr/lib64/dri/radeonsi_drv_video.so
Jan 23 18:03:35 olorin firefox.desktop[3245]: libva info: Found init function __vaDriverInit_1_16
Jan 23 18:03:35 olorin firefox.desktop[3245]: libva info: va_openDriver() returns 0
Jan 23 18:03:35 olorin firefox.desktop[3245]: libva info: VA-API version 1.16.0
Jan 23 18:03:35 olorin firefox.desktop[3245]: libva info: Trying to open /usr/lib64/dri/radeonsi_drv_video.so
Jan 23 18:03:35 olorin firefox.desktop[3245]: libva info: Found init function __vaDriverInit_1_16
Jan 23 18:03:35 olorin firefox.desktop[3245]: libva info: va_openDriver() returns 0
Jan 23 18:03:35 olorin firefox.desktop[3245]: libva info: VA-API version 1.16.0
Jan 23 18:03:35 olorin firefox.desktop[3245]: libva info: Trying to open /usr/lib64/dri/radeonsi_drv_video.so
Jan 23 18:03:35 olorin firefox.desktop[3245]: libva info: Found init function __vaDriverInit_1_16
Jan 23 18:03:35 olorin firefox.desktop[3245]: libva info: va_openDriver() returns 0
Jan 23 18:03:35 olorin firefox.desktop[3245]: libva info: VA-API version 1.16.0
Jan 23 18:03:35 olorin firefox.desktop[3245]: libva info: Trying to open /usr/lib64/dri/radeonsi_drv_video.so
Jan 23 18:03:35 olorin firefox.desktop[3245]: libva info: Found init function __vaDriverInit_1_16
Jan 23 18:03:35 olorin firefox.desktop[3245]: libva info: va_openDriver() returns 0
Jan 23 18:03:35 olorin firefox.desktop[3245]: libva info: VA-API version 1.16.0
Jan 23 18:03:35 olorin firefox.desktop[3245]: libva info: Trying to open /usr/lib64/dri/radeonsi_drv_video.so
Jan 23 18:03:35 olorin firefox.desktop[3245]: libva info: Found init function __vaDriverInit_1_16
Jan 23 18:03:35 olorin firefox.desktop[3245]: libva info: va_openDriver() returns 0
Jan 23 18:03:41 olorin systemd[1]: Stopping user@42.service - User Manager for UID 42...
Jan 23 18:03:41 olorin systemd[1356]: Activating special unit exit.target...
Jan 23 18:03:41 olorin systemd[1356]: Stopped target default.target - Main User Target.
Jan 23 18:03:41 olorin systemd[1356]: Stopping pipewire-pulse.service - PipeWire PulseAudio...
Jan 23 18:03:41 olorin systemd[1356]: Stopped pipewire-pulse.service - PipeWire PulseAudio.
Jan 23 18:03:41 olorin systemd[1356]: Stopping wireplumber.service - Multimedia Service Session Manager...
Jan 23 18:03:41 olorin wireplumber[1445]: stopped by signal: Terminated
Jan 23 18:03:41 olorin wireplumber[1445]: disconnected from pipewire
Jan 23 18:03:41 olorin systemd[1356]: Stopped wireplumber.service - Multimedia Service Session Manager.
Jan 23 18:03:41 olorin systemd[1356]: Stopping pipewire.service - PipeWire Multimedia Service...
Jan 23 18:03:41 olorin systemd[1356]: Stopped pipewire.service - PipeWire Multimedia Service.
Jan 23 18:03:41 olorin systemd[1356]: Stopped target basic.target - Basic System.
Jan 23 18:03:41 olorin systemd[1356]: Stopped target paths.target - Paths.
Jan 23 18:03:41 olorin systemd[1356]: Stopped target sockets.target - Sockets.
Jan 23 18:03:41 olorin systemd[1356]: Stopped target timers.target - Timers.
Jan 23 18:03:41 olorin systemd[1356]: Stopped systemd-tmpfiles-clean.timer - Daily Cleanup of User's Temporary Directories.
Jan 23 18:03:41 olorin systemd[1356]: Closed pipewire-pulse.socket - PipeWire PulseAudio.
Jan 23 18:03:41 olorin systemd[1356]: Closed pipewire.socket - PipeWire Multimedia System Socket.
Jan 23 18:03:41 olorin dbus-broker[1369]: Dispatched 208 messages @ 4(±14)μs / message.
Jan 23 18:03:41 olorin systemd[1356]: Stopping dbus-broker.service - D-Bus User Message Bus...
Jan 23 18:03:41 olorin systemd[1356]: Stopped systemd-tmpfiles-setup.service - Create User's Volatile Files and Directories.
Jan 23 18:03:41 olorin systemd[1356]: Stopped dbus-broker.service - D-Bus User Message Bus.
Jan 23 18:03:41 olorin systemd[1356]: Removed slice session.slice - User Core Session Slice.
Jan 23 18:03:41 olorin systemd[1356]: Closed dbus.socket - D-Bus User Message Bus Socket.
Jan 23 18:03:41 olorin systemd[1356]: Removed slice app.slice - User Application Slice.
Jan 23 18:03:41 olorin systemd[1356]: Reached target shutdown.target - Shutdown.
Jan 23 18:03:41 olorin systemd[1356]: Finished systemd-exit.service - Exit the Session.
Jan 23 18:03:41 olorin systemd[1356]: Reached target exit.target - Exit the Session.
Jan 23 18:03:41 olorin systemd[1]: user@42.service: Deactivated successfully.
Jan 23 18:03:41 olorin systemd[1]: Stopped user@42.service - User Manager for UID 42.
Jan 23 18:03:41 olorin audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=user@42 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jan 23 18:03:41 olorin systemd[1]: Stopping user-runtime-dir@42.service - User Runtime Directory /run/user/42...
Jan 23 18:03:41 olorin systemd[1]: run-user-42.mount: Deactivated successfully.
Jan 23 18:03:41 olorin systemd[1]: user-runtime-dir@42.service: Deactivated successfully.
Jan 23 18:03:41 olorin systemd[1]: Stopped user-runtime-dir@42.service - User Runtime Directory /run/user/42.
Jan 23 18:03:41 olorin audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=user-runtime-dir@42 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jan 23 18:03:41 olorin systemd[1]: Removed slice user-42.slice - User Slice of UID 42.
Jan 23 18:03:41 olorin systemd[1]: user-42.slice: Consumed 3.445s CPU time.
Jan 23 18:03:42 olorin firefox.desktop[3245]: libva info: VA-API version 1.16.0
Jan 23 18:03:42 olorin firefox.desktop[3245]: libva info: Trying to open /usr/lib64/dri/radeonsi_drv_video.so
Jan 23 18:03:42 olorin firefox.desktop[3245]: libva info: Found init function __vaDriverInit_1_16
Jan 23 18:03:42 olorin firefox.desktop[3245]: libva info: va_openDriver() returns 0
Jan 23 18:03:51 olorin systemd[1]: fprintd.service: Deactivated successfully.
Jan 23 18:03:51 olorin audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=fprintd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jan 23 18:03:51 olorin audit: BPF prog-id=73 op=UNLOAD
Jan 23 18:04:00 olorin systemd[1]: systemd-hostnamed.service: Deactivated successfully.
Jan 23 18:04:00 olorin audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jan 23 18:04:00 olorin systemd[1]: systemd-localed.service: Deactivated successfully.
Jan 23 18:04:00 olorin audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-localed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jan 23 18:04:00 olorin audit: BPF prog-id=69 op=UNLOAD
Jan 23 18:04:00 olorin audit: BPF prog-id=68 op=UNLOAD
Jan 23 18:04:00 olorin audit: BPF prog-id=67 op=UNLOAD
Jan 23 18:04:00 olorin audit: BPF prog-id=72 op=UNLOAD
Jan 23 18:04:00 olorin audit: BPF prog-id=71 op=UNLOAD
Jan 23 18:04:00 olorin audit: BPF prog-id=70 op=UNLOAD
Jan 23 18:04:19 olorin systemd[1840]: Started app-gnome-ckb\x2dnext-3741.scope - Application launched by gnome-shell.
Jan 23 18:04:19 olorin ckb-next[3741]: QSocketNotifier: Can only be used with threads started with QThread
Jan 23 18:04:19 olorin ckb-next.desktop[3741]: ckb-next is already running. Exiting.
Jan 23 18:04:19 olorin ckb-next[2335]: qt.qpa.wayland: Wayland does not support QWindow::requestActivate()
Jan 23 18:04:21 olorin geoclue[1444]: Service not used for 60 seconds. Shutting down..
Jan 23 18:04:21 olorin systemd[1]: geoclue.service: Deactivated successfully.
Jan 23 18:04:21 olorin audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=geoclue comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jan 23 18:04:22 olorin systemd[1840]: app-gnome-ckb\x2dnext.autostart-2335.scope: Consumed 2.165s CPU time.
Jan 23 18:04:22 olorin realmd[1684]: quitting realmd service after timeout
Jan 23 18:04:22 olorin realmd[1684]: stopping service
Jan 23 18:04:22 olorin systemd[1]: realmd.service: Deactivated successfully.
Jan 23 18:04:22 olorin audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=realmd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jan 23 18:04:30 olorin systemd[1840]: Starting gvfs-metadata.service - Virtual filesystem metadata service...
Jan 23 18:04:30 olorin systemd[1840]: Started gvfs-metadata.service - Virtual filesystem metadata service.
Jan 23 18:04:31 olorin chronyd[1042]: Selected source 54.38.222.63 (2.fedora.pool.ntp.org)
Jan 23 18:04:48 olorin gnome-shell[2008]: Object .Gjs_ui_workspaceThumbnail_ThumbnailsBox (0x55e747d16c30), has been already disposed — impossible to get any property from it. This might be caused by the object having been destroyed from C code using something such as destroy(), dispose(), or remove() vfuncs.
Jan 23 18:04:48 olorin gnome-shell[2008]: == Stack trace for context 0x55e744a9a4b0 ==
Jan 23 18:04:48 olorin gnome-shell[2008]: #0   7ffe4bce3ae0 b   resource:///org/gnome/shell/ui/workspacesView.js:639 (28a9f942ac90 @ 33)
Jan 23 18:04:48 olorin gnome-shell[2008]: #1   7ffe4bce3c30 b   resource:///org/gnome/shell/ui/workspacesView.js:686 (28a9f942ad30 @ 467)
Jan 23 18:04:48 olorin gnome-shell[2008]: #2   55e747de9b78 i   resource:///org/gnome/shell/ui/workspace.js:854 (3bc5b9fc11f0 @ 369)
Jan 23 18:04:48 olorin gnome-shell[2008]: #3   55e747de9af0 i   resource:///org/gnome/shell/ui/workspace.js:805 (3bc5b9fc1150 @ 17)
Jan 23 18:04:48 olorin gnome-shell[2008]: #4   55e747de9a58 i   resource:///org/gnome/shell/ui/workspacesView.js:1016 (28a9f942b790 @ 124)
Jan 23 18:04:48 olorin gnome-shell[2008]: #5   55e747de99c8 i   resource:///org/gnome/shell/ui/overviewControls.js:715 (3bc5b9f87380 @ 39)
Jan 23 18:04:48 olorin gnome-shell[2008]: #6   55e747de9940 i   resource:///org/gnome/shell/ui/layout.js:347 (38640bdcb1a0 @ 22)
Jan 23 18:04:48 olorin gnome-shell[2008]: #7   55e747de98b0 i   resource:///org/gnome/shell/ui/overview.js:599 (3bc5b9f851f0 @ 170)
Jan 23 18:04:48 olorin gnome-shell[2008]: #8   55e747de9830 i   resource:///org/gnome/shell/ui/overview.js:582 (3bc5b9f851a0 @ 12)
Jan 23 18:04:48 olorin gnome-shell[2008]: #9   55e747de97b0 i   resource:///org/gnome/shell/ui/overviewControls.js:757 (3bc5b9f87560 @ 55)
Jan 23 18:04:48 olorin gnome-shell[2008]: #10   7ffe4bcea820 b   resource:///org/gnome/shell/ui/environment.js:150 (38640bd8c9c0 @ 39)
Jan 23 18:04:48 olorin gnome-shell[2008]: #11   55e747de9728 i   resource:///org/gnome/shell/ui/environment.js:316 (38640bd8cec0 @ 14)
Jan 23 18:04:48 olorin gnome-shell[2008]: == Stack trace for context 0x55e744a9a4b0 ==
Jan 23 18:04:48 olorin gnome-shell[2008]: #0   7ffe4bce3c30 b   resource:///org/gnome/shell/ui/workspacesView.js:688 (28a9f942ad30 @ 484)
Jan 23 18:04:48 olorin gnome-shell[2008]: #1   55e747de9b78 i   resource:///org/gnome/shell/ui/workspace.js:854 (3bc5b9fc11f0 @ 369)
Jan 23 18:04:48 olorin gnome-shell[2008]: #2   55e747de9af0 i   resource:///org/gnome/shell/ui/workspace.js:805 (3bc5b9fc1150 @ 17)
Jan 23 18:04:48 olorin gnome-shell[2008]: #3   55e747de9a58 i   resource:///org/gnome/shell/ui/workspacesView.js:1016 (28a9f942b790 @ 124)
Jan 23 18:04:48 olorin gnome-shell[2008]: #4   55e747de99c8 i   resource:///org/gnome/shell/ui/overviewControls.js:715 (3bc5b9f87380 @ 39)
Jan 23 18:04:48 olorin gnome-shell[2008]: #5   55e747de9940 i   resource:///org/gnome/shell/ui/layout.js:347 (38640bdcb1a0 @ 22)
Jan 23 18:04:48 olorin gnome-shell[2008]: #6   55e747de98b0 i   resource:///org/gnome/shell/ui/overview.js:599 (3bc5b9f851f0 @ 170)
Jan 23 18:04:48 olorin gnome-shell[2008]: #7   55e747de9830 i   resource:///org/gnome/shell/ui/overview.js:582 (3bc5b9f851a0 @ 12)
Jan 23 18:04:48 olorin gnome-shell[2008]: #8   55e747de97b0 i   resource:///org/gnome/shell/ui/overviewControls.js:757 (3bc5b9f87560 @ 55)
Jan 23 18:04:48 olorin gnome-shell[2008]: Object .Gjs_ui_workspaceThumbnail_ThumbnailsBox (0x55e747d16c30), has been already disposed — impossible to get any property from it. This might be caused by the object having been destroyed from C code using something such as destroy(), dispose(), or remove() vfuncs.
Jan 23 18:04:48 olorin gnome-shell[2008]: #9   7ffe4bcea820 b   resource:///org/gnome/shell/ui/environment.js:150 (38640bd8c9c0 @ 39)
Jan 23 18:04:48 olorin gnome-shell[2008]: #10   55e747de9728 i   resource:///org/gnome/shell/ui/environment.js:316 (38640bd8cec0 @ 14)
Jan 23 18:04:48 olorin gnome-shell[2008]: JS ERROR: TypeError: windowInfo is undefined
                                          vfunc_allocate@resource:///org/gnome/shell/ui/workspace.js:705:17
                                          vfunc_allocate@resource:///org/gnome/shell/ui/workspacesView.js:555:25
                                          vfunc_allocate@resource:///org/gnome/shell/ui/workspacesView.js:711:30
                                          removeWindow@resource:///org/gnome/shell/ui/workspace.js:854:29
                                          addWindow/<.destroyId<@resource:///org/gnome/shell/ui/workspace.js:805:22
                                          vfunc_hide@resource:///org/gnome/shell/ui/workspacesView.js:1016:38
                                          vfunc_unmap@resource:///org/gnome/shell/ui/overviewControls.js:715:33
                                          hideOverview@resource:///org/gnome/shell/ui/layout.js:347:28
                                          _hideDone@resource:///org/gnome/shell/ui/overview.js:599:32
                                          _animateNotVisible/<@resource:///org/gnome/shell/ui/overview.js:582:55
                                          onStopped@resource:///org/gnome/shell/ui/overviewControls.js:757:21
                                          _makeEaseCallback/<@resource:///org/gnome/shell/ui/environment.js:150:22
                                          _easeActorProperty/<@resource:///org/gnome/shell/ui/environment.js:316:60
Jan 23 18:04:49 olorin gnome-shell[2008]: Object St.Button (0x55e746499480), has been already disposed — impossible to get any property from it. This might be caused by the object having been destroyed from C code using something such as destroy(), dispose(), or remove() vfuncs.
Jan 23 18:04:49 olorin gnome-shell[2008]: == Stack trace for context 0x55e744a9a4b0 ==
Jan 23 18:04:49 olorin gnome-shell[2008]: #0   55e747de9728 i   resource:///org/gnome/shell/ui/windowPreview.js:566 (3bc5b9fc42e0 @ 10)
Jan 23 18:04:49 olorin gnome-shell[2008]: == Stack trace for context 0x55e744a9a4b0 ==
Jan 23 18:04:49 olorin gnome-shell[2008]: #0   55e747de9728 i   resource:///org/gnome/shell/ui/windowPreview.js:567 (3bc5b9fc42e0 @ 36)
Jan 23 18:04:49 olorin gnome-shell[2008]: == Stack trace for context 0x55e744a9a4b0 ==
Jan 23 18:04:49 olorin gnome-shell[2008]: #0   55e747de9728 i   resource:///org/gnome/shell/ui/windowPreview.js:570 (3bc5b9fc42e0 @ 77)
Jan 23 18:04:49 olorin gnome-shell[2008]: Object St.Label (0x55e7463e9f70), has been already disposed — impossible to get any property from it. This might be caused by the object having been destroyed from C code using something such as destroy(), dispose(), or remove() vfuncs.
Jan 23 18:04:49 olorin gnome-shell[2008]: Object .Gjs_ui_windowPreview_WindowPreview (0x55e747e1cc60), has been already disposed — impossible to get any property from it. This might be caused by the object having been destroyed from C code using something such as destroy(), dispose(), or remove() vfuncs.
Jan 23 18:05:04 olorin systemd[1840]: Started app-gnome-org.gnome.Terminal-3873.scope - Application launched by gnome-shell.
Jan 23 18:05:04 olorin systemd[1840]: Created slice app-org.gnome.Terminal.slice - Slice /app/org.gnome.Terminal.
Jan 23 18:05:04 olorin systemd[1840]: Starting gnome-terminal-server.service - GNOME Terminal Server...
Jan 23 18:05:04 olorin systemd[1840]: Started gnome-terminal-server.service - GNOME Terminal Server.
Jan 23 18:05:04 olorin systemd[1840]: Started vte-spawn-abc62c66-a55d-4b29-9751-9138e93dc44f.scope - VTE child process 3897 launched by gnome-terminal-server process 3879.
Jan 23 18:05:18 olorin kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=12613, emitted seq=12615
Jan 23 18:05:18 olorin kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 2008 thread gnome-shel:cs0 pid 2022
Jan 23 18:05:18 olorin kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
Jan 23 18:05:19 olorin kernel: amdgpu 0000:03:00.0: amdgpu: IP block:gfx_v11_0 is hung!

I also noticed a severe coil whine this time, and a second crash immediatly after reaching desktop. I also has a double crash the last time I did it. The error message is the one I usually get. Note that I have sometimes anomalies in videos on firefox (hardware accelerated), quite logical given my situation.

I just did what they asked, if its needed. I’m not sure how to read them myself, to be frank.

Me too! Just… The process to use the RHEL official driver on Fedora is different, no?

And that’s precisely why I’m here! It’s not only bizarre but incredibly hard to decipher since I’m in a situation where it could come from anywhere…

It looks a little like a RAM problem… but the fact that it gets better after shutting down the aspm is… strange. I can play Cyberpunk for hours and have an unkillable process… but the unreal engine is crashing my machine two times in a row…

That bit of the log definitely shows the crash, and just near the end there is an error related to wayland as well as the crash data.

Please provide us the output of inxi -Fzxx so we can see the hardware details.

Do you by chance have lm_sensors installed and the output of sensors available? That might be worth looking into as well.

2 Likes

Here it is:

System:
  Kernel: 6.2.0-0.rc5.236.vanilla.fc37.x86_64 arch: x86_64 bits: 64
    compiler: gcc v: 2.38-25.fc37 Desktop: GNOME v: 43.2 tk: GTK v: 3.24.36
    wm: gnome-shell dm: GDM Distro: Fedora release 37 (Thirty Seven)
Machine:
  Type: Desktop System: ASUS product: N/A v: N/A serial: <superuser required>
  Mobo: ASUSTeK model: ROG STRIX Z690-G GAMING WIFI v: Rev 1.xx
    serial: <superuser required> UEFI: American Megatrends v: 0231
    date: 09/06/2021
CPU:
  Info: 12-core (8-mt/4-st) model: 12th Gen Intel Core i7-12700K bits: 64
    type: MST AMCP arch: Alder Lake rev: 2 cache: L1: 1024 KiB L2: 12 MiB
    L3: 25 MiB
  Speed (MHz): avg: 2060 high: 3600 min/max: 800/4900:5000:3800 cores:
    1: 800 2: 3600 3: 800 4: 3600 5: 800 6: 3600 7: 800 8: 3600 9: 800 10: 3600
    11: 800 12: 3600 13: 800 14: 3600 15: 800 16: 3600 17: 800 18: 800
    19: 3600 20: 800 bogomips: 144383
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
Graphics:
  Device-1: AMD Navi 31 [Radeon RX 7900 XT/7900 XTX] vendor: Tul / PowerColor
    driver: amdgpu v: kernel arch: RDNA-3 pcie: speed: 16 GT/s lanes: 16 ports:
    active: DP-3,HDMI-A-1 empty: DP-1,DP-2 bus-ID: 03:00.0 chip-ID: 1002:744c
  Display: wayland server: X.org v: 1.20.14 with: Xwayland v: 22.1.7
    compositor: gnome-shell driver: gpu: amdgpu display-ID: 0
  Monitor-1: DP-3 model: VG245 res: 1920x1080 dpi: 92 diag: 609mm (24")
  Monitor-2: HDMI-A-1 model: VG245 res: 1920x1080 dpi: 92 diag: 609mm (24")
  API: OpenGL v: 4.6 Mesa 22.3.3 renderer: AMD Radeon Graphics (gfx1100
    LLVM 15.0.6 DRM 3.49 6.2.0-0.rc5.236.vanilla.fc37.x86_64)
    direct render: Yes
Audio:
  Device-1: Intel Alder Lake-S HD Audio vendor: ASUSTeK driver: snd_hda_intel
    bus-ID: 1-4:2 v: kernel chip-ID: 0b05:1996 bus-ID: 00:1f.3
    chip-ID: 8086:7ad0
  Device-2: AMD driver: snd_hda_intel v: kernel pcie: speed: 16 GT/s
    lanes: 16 bus-ID: 03:00.1 chip-ID: 1002:ab30
  Device-3: ASUSTek USB Audio type: USB
    driver: hid-generic,snd-usb-audio,usbhid
  Sound API: ALSA v: k6.2.0-0.rc5.236.vanilla.fc37.x86_64 running: yes
  Sound Server-1: PulseAudio v: 16.1 running: no
  Sound Server-2: PipeWire v: 0.3.64 running: yes
Network:
  Device-1: Intel Alder Lake-S PCH CNVi WiFi driver: iwlwifi v: kernel
    bus-ID: 00:14.3 chip-ID: 8086:7af0
  IF: wlo1 state: down mac: <filter>
  Device-2: Intel Ethernet I225-V vendor: ASUSTeK driver: igc v: kernel
    pcie: speed: 5 GT/s lanes: 1 port: N/A bus-ID: 07:00.0 chip-ID: 8086:15f3
  IF: enp7s0 state: up speed: 1000 Mbps duplex: full mac: <filter>
Bluetooth:
  Device-1: Intel type: USB driver: btusb v: 0.8 bus-ID: 1-14:9
    chip-ID: 8087:0033
  Report: rfkill ID: hci0 rfk-id: 0 state: down bt-service: enabled,running
    rfk-block: hardware: no software: yes address: see --recommends
RAID:
  Hardware-1: Intel Volume Management Device NVMe RAID Controller driver: vmd
    v: 0.6 bus-ID: 00:0e.0 chip-ID: 8086:467f
Drives:
  Local Storage: total: 2.27 TiB used: 77.17 GiB (3.3%)
  ID-1: /dev/nvme0n1 vendor: Samsung model: SSD 980 PRO 500GB
    size: 465.76 GiB speed: 63.2 Gb/s lanes: 4 serial: <filter> temp: 38.9 C
  ID-2: /dev/sda vendor: Crucial model: CT1000MX500SSD1 size: 931.51 GiB
    speed: 6.0 Gb/s serial: <filter>
  ID-3: /dev/sdb vendor: Samsung model: SSD 870 EVO 1TB size: 931.51 GiB
    speed: 6.0 Gb/s serial: <filter>
Partition:
  ID-1: / size: 464.17 GiB used: 76.84 GiB (16.6%) fs: btrfs
    dev: /dev/nvme0n1p3
  ID-2: /boot size: 973.4 MiB used: 323.7 MiB (33.3%) fs: ext4
    dev: /dev/nvme0n1p2
  ID-3: /boot/efi size: 598.8 MiB used: 17.4 MiB (2.9%) fs: vfat
    dev: /dev/nvme0n1p1
  ID-4: /home size: 464.17 GiB used: 76.84 GiB (16.6%) fs: btrfs
    dev: /dev/nvme0n1p3
Swap:
  ID-1: swap-1 type: zram size: 8 GiB used: 0 KiB (0.0%) priority: 100
    dev: /dev/zram0
Sensors:
  System Temperatures: cpu: 25.0 C mobo: N/A gpu: amdgpu temp: 36.0 C
    mem: 55.0 C
  Fan Speeds (RPM): N/A gpu: amdgpu fan: 915
Info:
  Processes: 677 Uptime: 1h 28m Memory: 31.15 GiB used: 3.92 GiB (12.6%)
  Init: systemd v: 251 target: graphical (5) default: graphical Compilers:
  gcc: 12.2.1 clang: 15.0.7 Packages: pm: rpm pkgs: N/A note: see --rpm
  pm: flatpak pkgs: 3 Shell: Bash v: 5.2.15 running-in: gnome-terminal
  inxi: 3.3.24

I think yes older stable Kernels are supported with REHL. Fedora is in every sense ahead.
There is a LTS kernel available for Fedora. Might give this a try to have a alternative to work as you would like to.

kwizart/kernel-longterm-5.15 Copr
With this you are probably closer to the REHL setup.

Anyway if you want to do a bug report you can do this with the Fedoraproject account on
https://bugzilla.redhat.com/

I would do it with default Fedora settings like official Kernels etc.

1 Like

I note that you are using wayland. and have 2 monitors attached, one by HDMI and one by DP. Have you by chance tried logging in with xorg and tested to see if there is any difference in performance.?

The 7900 XT GPU is relatively new and has had some issues with proper support with the kernel/driver/firmware. Wayland is still not as 100% reliable as the older X11 desktop so there may be differences there. Switching to xorg as a test would simply take potential wayland issues out of the picture.

1 Like

I’ll try and see if something change, one way or the other.

I note that, okay.

I had no cursor in Xorg… so I tried to troubleshoot on my own and accidentally killed Xorg by doing something I shouldn’t, like an idiot. I thankfully reverted the change and I’m back.

New question, how to have a cursor on Xorg without destroying everything?

When logging in with the gui login screen where you enter the password, use the gear icon at the lower right of the screen to select “gnome on xorg”. It will log you in with the xorg environment instead of wayland.

A quick search for something like “how to log in on fedora with xorg” should give a lot of pointers.

2 Likes

I do not have any option to switch from Wayland to Xorg on my log screen (no gear icon for me)… that’s why I get to the solution of modifying files manually. Also, it doesn’t solve the issue that when I do get into with Xorg, I have no visible mouse cursor…

Sorry, I didn’t answered on that:

amdgpu-pci-0300
Adapter: PCI adapter
vddgfx:      546.00 mV 
fan1:         978 RPM  (min =    0 RPM, max = 3200 RPM)
edge:         +33.0°C  (crit = +100.0°C, hyst = -273.1°C)
                       (emerg = +105.0°C)
junction:     +38.0°C  (crit = +110.0°C, hyst = -273.1°C)
                       (emerg = +115.0°C)
mem:          +58.0°C  (crit = +108.0°C, hyst = -273.1°C)
                       (emerg = +113.0°C)
PPT:          32.00 W  (cap = 274.00 W)

acpitz-acpi-0
Adapter: ACPI interface
temp1:        +27.8°C  (crit = +105.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +28.0°C  (high = +80.0°C, crit = +100.0°C)
Core 0:        +24.0°C  (high = +80.0°C, crit = +100.0°C)
Core 4:        +24.0°C  (high = +80.0°C, crit = +100.0°C)
Core 8:        +24.0°C  (high = +80.0°C, crit = +100.0°C)
Core 12:       +24.0°C  (high = +80.0°C, crit = +100.0°C)
Core 16:       +26.0°C  (high = +80.0°C, crit = +100.0°C)
Core 20:       +22.0°C  (high = +80.0°C, crit = +100.0°C)
Core 24:       +23.0°C  (high = +80.0°C, crit = +100.0°C)
Core 28:       +23.0°C  (high = +80.0°C, crit = +100.0°C)
Core 36:       +26.0°C  (high = +80.0°C, crit = +100.0°C)
Core 37:       +26.0°C  (high = +80.0°C, crit = +100.0°C)
Core 38:       +26.0°C  (high = +80.0°C, crit = +100.0°C)
Core 39:       +25.0°C  (high = +80.0°C, crit = +100.0°C)

iwlwifi_1-virtual-0
Adapter: Virtual device
temp1:            N/A  

nvme-pci-0900
Adapter: PCI adapter
Composite:    +34.9°C  (low  = -273.1°C, high = +81.8°C)
                       (crit = +84.8°C)
Sensor 1:     +34.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +38.9°C  (low  = -273.1°C, high = +65261.8°C)

Looking at those sensor outputs I would be concerned about the temps on the GPU. For the relatively low load you seem to have it seems odd that the temp for the mem on that gpu should be at 58C. (20C higher than the next nearest temp)
Under heavy GPU load it seems quite possible that temp may get high enough to become a problem, especially since it only shows one fan there.
The indicated power is 32W with a 274W rated card, so assuming it can draw ~9 times the current power load that could quickly overheat.

These may be mere suppositions, but under heavy load the GPU temp goes up, the power draw goes up, fan speed should increase and all should top out with temps of 80 - 90C max. Not a lot of room for that one that already indicates 58C without overheating.

Looking at that card online it seems it should have 3 fans. If one fan is not spinning properly it seems it could create a hot spot on the card. Similarly, if one of the heat spreaders did not have a proper thermal paste contact to the chip it could create a hot spot.

My guess from info available right now – potential GPU overheating causing the crashes.

1 Like

I noticed that CoreCtrl was unable to access the fan control, so it’s possible something is wrong here.

an 24 00:53:17 olorin kernel: amdgpu: manual fan speed control should be enabled first

Ok, I’ll try to perform a full benchmark asap in the morning (I’m in Europe), case open to see if the fans actually accelerate properly, and check if the temperature goes too high. Keeping you informed.

After an entire day of testing, I got something.

First, I did as I said, and put my machine under a full stress test. CPU? Seems ok. RAM? Looks ok. Strangely, my GPU always start at 60°C and never goes up to 73 but never under 52 either. It constantly crash after being pushed for one or two minutes by a just one object in rendered with Nanites in Unreal Engine 5.1.

I was finally able to boot into Xorg, so I tried pushing from this side too. Crash with the same configuration (instead of Wayland, it’s Xorg who fail).

I changed the GPU PCIe port, the power cables, everything. One by one. Always a crash. Then, I noticed something very, very odd : I cannot access /sys/kernel/debug/dri/1

I tried to get inside using gnome files, instant crash. Another time, instant crash. With absolutely nothing open in background, instant crash. Here is the log:

Jan 24 17:48:38 olorin audit[2831]: USER_AUTH pid=2831 uid=1000 auid=1000 ses=3 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 msg='op=PAM:authentication grantors=pam_usertype,pam_localuser,pam_unix acct="ashizian" exe="/usr/>
Jan 24 17:48:38 olorin audit[2831]: USER_ACCT pid=2831 uid=1000 auid=1000 ses=3 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 msg='op=PAM:accounting grantors=pam_unix,pam_localuser acct="ashizian" exe="/usr/lib/polkit-1/polk>
Jan 24 17:48:38 olorin polkitd[933]: Operator of unix-session:2 successfully authenticated as unix-user:ashizian to gain TEMPORARY authorization for action org.gtk.vfs.file-operations for unix-process:2679:2798 [/usr/bin/nautilus --gappl>
Jan 24 17:48:42 olorin kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:40 param:0x00000000 message:AllowGfxOff?
Jan 24 17:48:42 olorin kernel: amdgpu 0000:03:00.0: amdgpu: Failed to enable gfxoff!
Jan 24 17:48:42 olorin kernel: [drm] Fence fallback timer expired on ring sdma0
Jan 24 17:48:42 olorin kernel: [drm] Fence fallback timer expired on ring gfx_0.0.0
Jan 24 17:48:43 olorin kernel: [drm] Fence fallback timer expired on ring sdma0
Jan 24 17:48:43 olorin kernel: [drm] Fence fallback timer expired on ring sdma0
Jan 24 17:48:44 olorin kernel: [drm] Fence fallback timer expired on ring sdma0
Jan 24 17:48:44 olorin kernel: [drm] Fence fallback timer expired on ring sdma0
Jan 24 17:48:45 olorin kernel: [drm] Fence fallback timer expired on ring sdma0
Jan 24 17:48:48 olorin systemd[1]: systemd-localed.service: Deactivated successfully.
Jan 24 17:48:48 olorin audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-localed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jan 24 17:48:49 olorin kernel: [drm] Fence fallback timer expired on ring sdma0
Jan 24 17:48:50 olorin nautilus[2679]: Error flushing display: Resource temporarily unavailable
Jan 24 17:48:52 olorin audit: BPF prog-id=72 op=UNLOAD
Jan 24 17:48:52 olorin audit: BPF prog-id=71 op=UNLOAD
Jan 24 17:48:52 olorin audit: BPF prog-id=70 op=UNLOAD
Jan 24 17:48:52 olorin kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
Jan 24 17:48:52 olorin kernel: [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
Jan 24 17:48:52 olorin kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
Jan 24 17:48:52 olorin kernel: [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
Jan 24 17:48:52 olorin kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
Jan 24 17:48:52 olorin kernel: [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
Jan 24 17:48:52 olorin kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
Jan 24 17:48:52 olorin kernel: [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
Jan 24 17:48:52 olorin kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:40 param:0x00000000 message:AllowGfxOff?
Jan 24 17:48:52 olorin kernel: amdgpu 0000:03:00.0: amdgpu: Failed to enable gfxoff!
Jan 24 17:48:53 olorin systemd[1]: systemd-hostnamed.service: Deactivated successfully.
Jan 24 17:48:53 olorin audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jan 24 17:48:54 olorin audit: BPF prog-id=69 op=UNLOAD
Jan 24 17:48:54 olorin audit: BPF prog-id=68 op=UNLOAD
Jan 24 17:48:54 olorin audit: BPF prog-id=67 op=UNLOAD

I am not an expert in analyzing that snippet from the log, but have to ask.
Is SELinux in permissive or enforcing mode? That can be checked with getenforce and if it is enforcing then can be set to permissive with sudo setenforce 0.

Changing it to permissive may be one way to check if the failure is a result of selinux or otherwise. If in permissive mode and the crash still occurs then it would appear that the cause is not selinux related. If the crash does not appear then it would seem to be caused by selinux.

1 Like

I tried and the crash still occur in permissive, yes.

That seems to indicate it may be hardware (GPU) or driver issue since it occurs with heavy GPU usage. I would consider a bug report, and direct it toward AMD mostly since they seem to be the provider of both the driver and the GPU itself.

1 Like