Hello, I recently installed Fedora 39 workstation (KDE Plasma spin) moving from Linux mint. I have and AMD Ryzen 7 3800x and 6700XT. About once per day, seemingly at random, the computer freezes for about 3 seconds before both of the displays become completely black. The computer does not shut off, and in some cases I can use my media keys to make a video pause and play, although with severe input lag (about 3 seconds) so the computer must still be running. Unplugging the display from the gpu and plugging it in again has no effect. I should also note that this usually happens while I am playing games, however since it happens so rarely I cannot confirm if its related, especially because I play games very often. I have tried switching to a TTY after the computer freezes but that also had no effect. Any help is appreciated. Thank you!
Have a look for clues in two places.
What does sudo dmesg
show after this happens?
What logs are in sudo journalctl
leading up to the event?
See man journalctl
for options like --since
to help focus on logs,
It would be helpful to know your full hardware configuration and whether your F39 is fully updated.
Please post (as text using the </>
button) the output from inxi -Fzxx
.
If you note the time of a freeze event you can try journalctl --since ...
to find details of the problem, You may also want to monitor memory usage and temperatures to see if there is an association with spikes in either variable.
Sorry for not including journalctl. it would fail to store logs of the incident because of the sudden crash. However I was able to get logs of it occurring and found an issue on gitlab tracking my issuering gfx_0.0.0 timeout when using firefox, chrome or icaclient when dpm performance level = auto
found at *ERROR* ring gfx_0.0.0 timeout when using firefox, chrome or icaclient when dpm performance level = auto (#1974) · Issues · drm / amd · GitLab. of the many solutions in this thread a combination of using amdgpu.ppfeaturemask=0xfffd3fff
in the command line parameters and setting performance level to high using echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level
I will continue testing today but it seems that one of these has fixed my issue. Thank you for the help.
Just after typing this, I have experienced another crash.
Journalctl:
Jan 17 20:44:18 localhost-live.lan systemd-journald[766]: /var/log/journal/7473050a3f174a3fbb09c81e7b556f89/user-1000.journal: Journal file uses a different sequence number ID, rotating.
Jan 18 10:30:07 localhost-live.lan kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=51434543, emitted seq=51434545
Jan 18 10:30:07 localhost-live.lan kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process java pid 343208 thread java:cs0 pid 343342
Jan 18 10:30:07 localhost-live.lan kernel: amdgpu 0000:2d:00.0: amdgpu: GPU reset begin!
Jan 18 10:30:07 localhost-live.lan kernel: amdgpu 0000:2d:00.0: amdgpu: MODE1 reset
Jan 18 10:30:07 localhost-live.lan kernel: amdgpu 0000:2d:00.0: amdgpu: GPU mode1 reset
Jan 18 10:30:07 localhost-live.lan kernel: amdgpu 0000:2d:00.0: amdgpu: GPU smu mode1 reset
Jan 18 10:30:18 localhost-live.lan kernel: amdgpu 0000:2d:00.0: amdgpu: GPU reset succeeded, trying to resume
I should note that without the extra configuration mentioned in my previous post, this would be followed by shutdown messages from xorg, but this time there are none and this is the end of the log. Also my display did not seem to turn off entirely but simply freeze, however that might be because I restarted my computer immediately after the freeze out of habit, not giving xorg the time to shut down. Also, without my configuration, the gpu reset would fail although I can’t seem to find evidence of this now, so I may be misremembering. Because I can’t get dmesg to show output form previous boots, and I cant use dmesg after a freeze, I’m using the -k argument in journalctl above which should include any. If there are some I’m missing please tell me.
inxi output:
System:
Kernel: 6.6.11-200.fc39.x86_64 arch: x86_64 bits: 64 compiler: gcc
v: 2.40-13.fc39 Desktop: KDE Plasma v: 5.27.10 tk: Qt v: 5.15.11
wm: kwin_x11 dm: SDDM Distro: Fedora release 39 (Thirty Nine)
Machine:
Type: Desktop Mobo: Micro-Star model: B550M PRO-VDH WIFI (MS-7C95) v: 1.0
serial: <superuser required> UEFI: American Megatrends LLC. v: 2.80
date: 06/22/2021
CPU:
Info: 8-core model: AMD Ryzen 7 3800X bits: 64 type: MT MCP arch: Zen 2
rev: 0 cache: L1: 512 KiB L2: 4 MiB L3: 32 MiB
Speed (MHz): avg: 2288 high: 3900 min/max: 2200/4559 boost: enabled cores:
1: 2200 2: 2200 3: 2200 4: 2200 5: 2200 6: 2200 7: 2200 8: 2200 9: 2200
10: 2200 11: 2200 12: 2200 13: 3900 14: 2057 15: 2200 16: 2056
bogomips: 124805
Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
Graphics:
Device-1: AMD Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M/6850M XT]
vendor: ASUSTeK driver: amdgpu v: kernel arch: RDNA-2 pcie: speed: 16 GT/s
lanes: 16 ports: active: DP-3,HDMI-A-1 empty: DP-1,DP-2 bus-ID: 2d:00.0
chip-ID: 1002:73df
Display: x11 server: X.Org v: 1.20.14 with: Xwayland v: 23.2.3
compositor: kwin_x11 driver: X: loaded: amdgpu
unloaded: fbdev,modesetting,radeon,vesa dri: radeonsi gpu: amdgpu
display-ID: :0 screens: 1
Screen-1: 0 s-res: 3840x1080 s-dpi: 96
Monitor-1: DP-3 mapped: DisplayPort-2 pos: primary,left
model: LG (GoldStar) ULTRAGEAR res: 1920x1080 dpi: 93 diag: 604mm (23.8")
Monitor-2: HDMI-A-1 mapped: HDMI-A-0 pos: right
model: LG (GoldStar) 32 FHD res: 1920x1080 dpi: 70 diag: 801mm (31.5")
API: EGL v: 1.5 platforms: device: 0 drv: radeonsi device: 1 drv: swrast
surfaceless: drv: radeonsi x11: drv: radeonsi inactive: gbm,wayland
API: OpenGL v: 4.6 compat-v: 4.5 vendor: amd mesa v: 23.3.3 glx-v: 1.4
direct-render: yes renderer: AMD Radeon RX 6700 XT (radeonsi navi22 LLVM
17.0.6 DRM 3.54 6.6.11-200.fc39.x86_64) device-ID: 1002:73df
API: Vulkan v: 1.3.268 surfaces: xcb,xlib device: 0 type: discrete-gpu
driver: mesa radv device-ID: 1002:73df device: 1 type: cpu
driver: mesa llvmpipe device-ID: 10005:0000
Audio:
Device-1: AMD Navi 21/23 HDMI/DP Audio driver: snd_hda_intel v: kernel pcie:
speed: 16 GT/s lanes: 16 bus-ID: 2d:00.1 chip-ID: 1002:ab28
Device-2: AMD Starship/Matisse HD Audio vendor: Micro-Star MSI
driver: snd_hda_intel v: kernel pcie: speed: 16 GT/s lanes: 16
bus-ID: 2f:00.4 chip-ID: 1022:1487
Device-3: FiiO [] driver: hid-generic,snd-usb-audio,usbhid type: USB
rev: 2.0 speed: 480 Mb/s lanes: 1 bus-ID: 1-1:2 chip-ID: 2972:0051
API: ALSA v: k6.6.11-200.fc39.x86_64 status: kernel-api
Server-1: PipeWire v: 1.0.1 status: active with: 1: pipewire-pulse
status: active 2: wireplumber status: active 3: pipewire-alsa type: plugin
4: pw-jack type: plugin
Network:
Device-1: Intel Dual Band Wireless-AC 3168NGW [Stone Peak] driver: iwlwifi
v: kernel pcie: speed: 2.5 GT/s lanes: 1 bus-ID: 29:00.0 chip-ID: 8086:24fb
IF: wlo1 state: down mac: <filter>
Device-2: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet
vendor: Micro-Star MSI driver: r8169 v: kernel pcie: speed: 2.5 GT/s
lanes: 1 port: f000 bus-ID: 2a:00.0 chip-ID: 10ec:8168
IF: enp42s0 state: up speed: 1000 Mbps duplex: full mac: <filter>
Bluetooth:
Device-1: Intel Wireless-AC 3168 Bluetooth driver: btusb v: 0.8 type: USB
rev: 2.0 speed: 12 Mb/s lanes: 1 bus-ID: 1-9:6 chip-ID: 8087:0aa7
Report: btmgmt ID: hci0 rfk-id: 0 state: up address: <filter> bt-v: 4.2
lmp-v: 8
Drives:
Local Storage: total: 1.82 TiB used: 467.77 GiB (25.1%)
ID-1: /dev/nvme0n1 vendor: Western Digital model: WDS100T2B0C-00PXH0
size: 931.51 GiB speed: 31.6 Gb/s lanes: 4 serial: <filter> temp: 41.9 C
ID-2: /dev/sda model: SATA3 1TB SSD size: 931.51 GiB speed: 6.0 Gb/s
serial: <filter> temp: 40 C
Partition:
ID-1: / size: 929.93 GiB used: 467.41 GiB (50.3%) fs: btrfs
dev: /dev/nvme0n1p3
ID-2: /boot size: 973.4 MiB used: 353.4 MiB (36.3%) fs: ext4
dev: /dev/nvme0n1p2
ID-3: /boot/efi size: 598.8 MiB used: 17.4 MiB (2.9%) fs: vfat
dev: /dev/nvme0n1p1
ID-4: /home size: 929.93 GiB used: 467.41 GiB (50.3%) fs: btrfs
dev: /dev/nvme0n1p3
Swap:
ID-1: swap-1 type: zram size: 8 GiB used: 0 KiB (0.0%) priority: 100
dev: /dev/zram0
Sensors:
System Temperatures: cpu: 45.8 C mobo: N/A gpu: amdgpu temp: 58.0 C
mem: 62.0 C
Fan Speeds (rpm): fan-2: 601 fan-3: 602 fan-4: 627 fan-5: 599 fan-6: 598
gpu: amdgpu fan: 0
Info:
Processes: 422 Uptime: 14m Memory: total: 16 GiB available: 15.54 GiB
used: 3.7 GiB (23.8%) Init: systemd v: 254 target: graphical (5)
default: graphical Compilers: N/A Packages: pm: rpm pkgs: N/A
note: see --rpm Shell: fish v: 3.7.0 running-in: alacritty inxi: 3.3.31
Thanks again for all the help.
Do you have to power off the PC to recover?
Can you ssh into it from another host and look at the state while its broken?
The PC in previous crashes has not responded until the restart button is pressed, and in same cases until the switch on the PSU is flipped. Next crash I will attempt to SSH.
When I do SSH, what should I be looking for? dmesg?
Grab dmesg and recent logs in the journal.
After grabbing information try a controlled reboot.
If its a GPU crash you shpuld be able to reboot under control.
Will do, but I don’t know how long it will be until then.
SSH doesn’t work, I can only get “network unreachable” followed by “connection refused” despite sshd running. It seems that the crash is damagin to more than just the GPU? I should also note that my area is experiencing a snowstorm which is unusual for my area. could this be caused by brownouts?
At this point I suspect you have hardware issues.
It could be thermal issues, is something over heating?
It might be that the power supply is underrated, only failing at points of high load?
From these logs it seems to be a problem in the kernel, most probably in the amdgpu driver. I think you should file a bug here:
and add a link here.
Looking at that gitlab repo I found a fellow with the same problem who opened an issue about 2 weeks ago found here: AMD gpu crash *ERROR* ring gfx_0.0.0 timeout while playing modded minecraft. (#3025) · Issues · drm / amd · GitLab
He notes that he is playing the same game that has been the biggest cause of these crashes for me recently (minecraft) and he determines by the end that his power draw was too high. I didn’t suspect this because my system ran perfectly fine previously (my PSU is about 100W above the rating not including rgb so probably just barely strong enough to power the computer) its very possible that fedora simply has higher power consumption than linux mint or windows. I am going to try setting hte performance profile in amdgpu to low and see if that reduces the frequency of the crashes. If so, I’m going to investigate corectrl and see if there are some other ways to lower power draw. I will report back on in this in a few days time at most. Thanks for the help.
It does seem to be an issue of power. Setting the profile to “low” causes the crashes to stop. I have begun to notice coil whine from the PSU when it is delivering a lot of power. I calculated the wattage of my pc and determined that my PSU is actually below the typical threshold of double PC wattage by about 100 watts, and its run like this for some time (about 3 years). I think I forgot to recalculate wattage when I upgraded my GPU a while ago. So I believe my PSU is at fault. Do you all think I should buy a new PSU? Is it safe for the rest of the computer to continue running the computer while I wait for a new one? The coil whine is very loud and rather consistent and is only becoming present at lower power thresholds as the machine is used, so I am worried for its health. Thank you for all of your help.
Yes, replace it as soon as possible.
No I don’t think it is safe to use a computer with a failing PSU. Who knows how quick or in what way it may fail. Under-voltage causes instability and overheating with fire or component damage possible. Over-voltage can cause fires, overheating, instability, and other component failure.
Unregulated voltages that are out of spec can cause damage that may not be apparent immediately but surely will cause other problems at some time.
I don’t think we can say with certainty that the PSU is at fault here. Commonly, if it can’t deliver enough power, the PC just reboots instantly. Also a coil whine doesn’t automatically mean it’s faulty. But if you can try your workload with a stronger PSU (perhaps borrow from a friend for a few days), that would certainly help.
If you suspect the PSU to be too weak, I’d avoid stressing the system. Regular usage should be fine.
With modern powerful GPUs what matters is the AMPs that it can supply to the GPU.
Findout what the GPU can draw peak and check that you have enough AMPs.
Usually its the 12V supply that is critical and your GPU may require 2 extra power connections from the PSU. That is the situation with my nvidia GPU.