AMD Radeon graphics lockup: could this be a hardware problem?

I’ve had this computer, an AMD based desktop with a Radeon Vega 56 graphics card, for about 2½ years. It’s been pretty solid all throughout, including playing games which make it run like a space heater. It’s crashed a couple of times in the past month, which is not great but I’ve been busy so I rebooted and moved on. Today, though, it’s crashing constantly. The crashes result in logs like this:

Jan 16 17:05:16 [hostname] kernel: rfkill: input handler disabled
Jan 16 17:05:21 [hostname] kernel: snd_hda_intel 0000:28:00.1: can't change power state from D0 to D3hot (config space inaccessible)
Jan 16 17:05:28 [hostname] kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
Jan 16 17:05:28 [hostname] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=77, emitted seq=79
Jan 16 17:05:28 [hostname] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 1396 thread gnome-shel:cs0 pid 1453
Jan 16 17:05:28 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: GPU reset begin!
Jan 16 17:05:28 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x9, input parameter: 0xf4, error code: 0xffffffff
Jan 16 17:05:28 [hostname] kernel: amdgpu: [powerplay] Failed message: 0xa, input parameter: 0xf1b000, error code: 0xffffffff
Jan 16 17:05:28 [hostname] kernel: amdgpu: [powerplay] Failed message: 0xe, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:05:28 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x42, input parameter: 0x1, error code: 0xffffffff
Jan 16 17:05:28 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x24, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:05:28 [hostname] kernel: [drm] REG_WAIT timeout 10us * 3000 tries - dce110_stream_encoder_dp_blank line:955
Jan 16 17:05:48 [hostname] kernel: [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
Jan 16 17:05:48 [hostname] kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing DF8C (len 824, WS 0, PS 0) @ 0xE10C
Jan 16 17:05:48 [hostname] kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing DE46 (len 326, WS 0, PS 0) @ 0xDF36
Jan 16 17:05:48 [hostname] kernel: [drm:dce110_link_encoder_disable_output [amdgpu]] *ERROR* dce110_link_encoder_disable_output: Failed to execute VBIOS command table!
Jan 16 17:06:08 [hostname] kernel: [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
Jan 16 17:06:08 [hostname] kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing C0B6 (len 62, WS 0, PS 0) @ 0xC0D2
Jan 16 17:06:08 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x4c, input parameter: 0x1, error code: 0xffffffff
Jan 16 17:06:08 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x4c, input parameter: 0x3, error code: 0xffffffff
Jan 16 17:06:08 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x9, input parameter: 0xf4, error code: 0xffffffff
Jan 16 17:06:08 [hostname] kernel: amdgpu: [powerplay] Failed message: 0xa, input parameter: 0xf1b000, error code: 0xffffffff
Jan 16 17:06:08 [hostname] kernel: amdgpu: [powerplay] Failed message: 0xe, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:06:08 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x42, input parameter: 0x1, error code: 0xffffffff
Jan 16 17:06:08 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x24, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:06:08 [hostname] kernel: [drm:dce110_vblank_set [amdgpu]] *ERROR* Failed to get VBLANK!
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x800000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x22, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x25, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x30, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x9, input parameter: 0xf4, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0xa, input parameter: 0xf1b000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0xe, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x10000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x4000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x8000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x8000000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x400, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x1000000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x30f, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x800, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x1000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x2000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x80000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x40, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x10000000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu 0000:28:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Jan 16 17:06:10 [hostname] kernel: [drm] Timeout wait for RLC serdes 0,0
Jan 16 17:06:10 [hostname] kernel: [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = 0000000034d786ac; ring_buffer_end = 00000000c05dc59d; write_frame = 0000000094e0183d
Jan 16 17:06:10 [hostname] kernel: [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
Jan 16 17:06:10 [hostname] kernel: [drm:psp_suspend [amdgpu]] *ERROR* Failed to unload asd
Jan 16 17:06:10 [hostname] kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <psp> failed -22
Jan 16 17:06:10 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: MODE1 reset
Jan 16 17:06:10 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: GPU mode1 reset
Jan 16 17:06:10 [hostname] kernel: [drm] psp is not working correctly before mode1 reset!
Jan 16 17:06:10 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: GPU mode1 reset failed
Jan 16 17:06:10 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: ASIC reset failed with error, -22 for drm dev, 0000:28:00.0
Jan 16 17:06:10 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: GPU reset(2) failed
Jan 16 17:06:10 [hostname] kernel: snd_hda_intel 0000:28:00.1: can't change power state from D3cold to D0 (config space inaccessible)
Jan 16 17:06:10 [hostname] kernel: snd_hda_intel 0000:28:00.1: CORB reset timeout#2, CORBRP = 65535
Jan 16 17:06:10 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: GPU reset end with ret = -22
Jan 16 17:06:20 [hostname] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Jan 16 17:06:30 [hostname] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered

When this happens, the monitor abruptly says there’s no signal and goes dark. However, the system isn’t actually down: I can SSH in and look at the logs, add and remove software, etc.

I know there’s a new kernel (5.10) and updated mesa in updates-testing, so the first thing I did was roll those back But the problems persist. In fact, they’re getting worse: at first, it was a couple of times over the course of a few hours, but as I’m trying to diagnose that, sometimes it won’t even let me log in before there’s a crash. So, the problem happens with:

  • kernel 5.10.7
  • kernel 5.9.16

and with

  • mesa-* 20.2.6
  • mesa-* 20.3.3

AND I even booted with a Fedora 33 Live image, and, while I can’t ssh in to test, I get the same crash after < 5 minutes where the monitor cuts out.

It’s weird for this to start all of the sudden. I’ve done some basic web searches, but most of what I see is old and points to various problems with drivers and card quirks. It seems like if that was the problem, this would have been happening all along.

Any insight?

I also don’t think it’s particularly hot — I’ve played Baldur’s Gate 3 under Wine previously (like, over the holiday break for quite a lot of hours) and I didn’t have any problems even though the fan was definitely running and pumping out heat like a space heater. Today, I let it sit turned off for half an hour and it still froze within a few minutes of booting up again.

More info. I made a loop to capture /sys/kernel/debug/dri/0/amdgpu_pm_info every half second, and right before the crash this went from:

GFX Clocks and Power:
	167 MHz (MCLK)
	29 MHz (SCLK)
	1138 MHz (PSTATE_SCLK)
	800 MHz (PSTATE_MCLK)
	806 mV (VDDGFX)
	4.0 W (average GPU)

GPU Temperature: 28 C
GPU Load: 16 %

SMC Feature Mask: 0x000000001ba1ff4f
UVD: Disabled

VCE: Disabled
Clock Gating Flags Mask: 0x30fe9fcf
	Graphics Medium Grain Clock Gating: On
	Graphics Medium Grain memory Light Sleep: On
	Graphics Coarse Grain Clock Gating: On
	Graphics Coarse Grain memory Light Sleep: On
	Graphics Coarse Grain Tree Shader Clock Gating: Off
	Graphics Coarse Grain Tree Shader Light Sleep: Off
	Graphics Command Processor Light Sleep: On
	Graphics Run List Controller Light Sleep: On
	Graphics 3D Coarse Grain Clock Gating: On
	Graphics 3D Coarse Grain memory Light Sleep: On
	Memory Controller Light Sleep: On
	Memory Controller Medium Grain Clock Gating: On
	System Direct Memory Access Light Sleep: On
	System Direct Memory Access Medium Grain Clock Gating: On
	Bus Interface Medium Grain Clock Gating: On
	Bus Interface Light Sleep: On
	Unified Video Decoder Medium Grain Clock Gating: Off
	Video Compression Engine Medium Grain Clock Gating: Off
	Host Data Path Light Sleep: On
	Host Data Path Medium Grain Clock Gating: Off
	Digital Right Management Medium Grain Clock Gating: On
	Digital Right Management Light Sleep: On
	Rom Medium Grain Clock Gating: On
	Data Fabric Medium Grain Clock Gating: On
	Address Translation Hub Medium Grain Clock Gating: On
	Address Translation Hub Light Sleep: On

to

GFX Clocks and Power:
	42949671 MHz (SCLK)
	1138 MHz (PSTATE_SCLK)
	800 MHz (PSTATE_MCLK)
	65493 mV (VDDGFX)
	16777215.0 W (average GPU)

GPU Temperature: 511 C
GPU Load: 100 %

SMC Feature Mask: 0x00000000ffffffff
UVD: Disabled

VCE: Disabled
Clock Gating Flags Mask: 0x30bc95ce
	Graphics Medium Grain Clock Gating: Off
	Graphics Medium Grain memory Light Sleep: On
	Graphics Coarse Grain Clock Gating: On
	Graphics Coarse Grain memory Light Sleep: On
	Graphics Coarse Grain Tree Shader Clock Gating: Off
	Graphics Coarse Grain Tree Shader Light Sleep: Off
	Graphics Command Processor Light Sleep: On
	Graphics Run List Controller Light Sleep: On
	Graphics 3D Coarse Grain Clock Gating: On
	Graphics 3D Coarse Grain memory Light Sleep: On
	Memory Controller Light Sleep: On
	Memory Controller Medium Grain Clock Gating: Off
	System Direct Memory Access Light Sleep: On
	System Direct Memory Access Medium Grain Clock Gating: Off
	Bus Interface Medium Grain Clock Gating: On
	Bus Interface Light Sleep: On
	Unified Video Decoder Medium Grain Clock Gating: Off
	Video Compression Engine Medium Grain Clock Gating: Off
	Host Data Path Light Sleep: On
	Host Data Path Medium Grain Clock Gating: Off
	Digital Right Management Medium Grain Clock Gating: Off
	Digital Right Management Light Sleep: On
	Rom Medium Grain Clock Gating: Off
	Data Fabric Medium Grain Clock Gating: On
	Address Translation Hub Medium Grain Clock Gating: On
	Address Translation Hub Light Sleep: On

Now, most notable is GPU load 100% and temperature 511C (oh my, I hope that’s not right), but also, all of the Medium Grain Clock Gating settings go from On to Off. I don’t know if this means anything to anyone.

I’m not doing anything of particular interest when this happens — normal web browsing which I’ve done all along.

This sounds to me like a hardware issue (just a guess). You could try e.g. running a live F31 to see if you have the same problem. It could be your video card (probably) or e.g. the psu. To be sure you have to replace the hardware to see it the problem disappears.

See if it’ll still be happening with powersave/battery profile in radeon-profile.
Could it be related to https://bodhi.fedoraproject.org/updates/FEDORA-2020-39f7307164 ?

Yeah, the Live doesn’t last very long without a freeze either. Even happens with an Ubuntu Live I booted just to be sure. Soooo, yeah, I guess it’s hardware.

If this was true you would smell it.

1 Like

Yeah, seriously. Also, the fan stays at normal speed and I can feel that the exhaust is cool. So, it’s clearly spurious.