I’ve had this computer, an AMD based desktop with a Radeon Vega 56 graphics card, for about 2½ years. It’s been pretty solid all throughout, including playing games which make it run like a space heater. It’s crashed a couple of times in the past month, which is not great but I’ve been busy so I rebooted and moved on. Today, though, it’s crashing constantly. The crashes result in logs like this:
Jan 16 17:05:16 [hostname] kernel: rfkill: input handler disabled
Jan 16 17:05:21 [hostname] kernel: snd_hda_intel 0000:28:00.1: can't change power state from D0 to D3hot (config space inaccessible)
Jan 16 17:05:28 [hostname] kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
Jan 16 17:05:28 [hostname] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=77, emitted seq=79
Jan 16 17:05:28 [hostname] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 1396 thread gnome-shel:cs0 pid 1453
Jan 16 17:05:28 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: GPU reset begin!
Jan 16 17:05:28 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x9, input parameter: 0xf4, error code: 0xffffffff
Jan 16 17:05:28 [hostname] kernel: amdgpu: [powerplay] Failed message: 0xa, input parameter: 0xf1b000, error code: 0xffffffff
Jan 16 17:05:28 [hostname] kernel: amdgpu: [powerplay] Failed message: 0xe, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:05:28 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x42, input parameter: 0x1, error code: 0xffffffff
Jan 16 17:05:28 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x24, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:05:28 [hostname] kernel: [drm] REG_WAIT timeout 10us * 3000 tries - dce110_stream_encoder_dp_blank line:955
Jan 16 17:05:48 [hostname] kernel: [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
Jan 16 17:05:48 [hostname] kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing DF8C (len 824, WS 0, PS 0) @ 0xE10C
Jan 16 17:05:48 [hostname] kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing DE46 (len 326, WS 0, PS 0) @ 0xDF36
Jan 16 17:05:48 [hostname] kernel: [drm:dce110_link_encoder_disable_output [amdgpu]] *ERROR* dce110_link_encoder_disable_output: Failed to execute VBIOS command table!
Jan 16 17:06:08 [hostname] kernel: [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
Jan 16 17:06:08 [hostname] kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing C0B6 (len 62, WS 0, PS 0) @ 0xC0D2
Jan 16 17:06:08 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x4c, input parameter: 0x1, error code: 0xffffffff
Jan 16 17:06:08 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x4c, input parameter: 0x3, error code: 0xffffffff
Jan 16 17:06:08 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x9, input parameter: 0xf4, error code: 0xffffffff
Jan 16 17:06:08 [hostname] kernel: amdgpu: [powerplay] Failed message: 0xa, input parameter: 0xf1b000, error code: 0xffffffff
Jan 16 17:06:08 [hostname] kernel: amdgpu: [powerplay] Failed message: 0xe, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:06:08 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x42, input parameter: 0x1, error code: 0xffffffff
Jan 16 17:06:08 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x24, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:06:08 [hostname] kernel: [drm:dce110_vblank_set [amdgpu]] *ERROR* Failed to get VBLANK!
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x800000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x22, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x25, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x30, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x9, input parameter: 0xf4, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0xa, input parameter: 0xf1b000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0xe, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x10000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x4000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x8000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x8000000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x400, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x1000000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x30f, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x800, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x1000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x2000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x80000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x40, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x10000000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu 0000:28:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Jan 16 17:06:10 [hostname] kernel: [drm] Timeout wait for RLC serdes 0,0
Jan 16 17:06:10 [hostname] kernel: [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = 0000000034d786ac; ring_buffer_end = 00000000c05dc59d; write_frame = 0000000094e0183d
Jan 16 17:06:10 [hostname] kernel: [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
Jan 16 17:06:10 [hostname] kernel: [drm:psp_suspend [amdgpu]] *ERROR* Failed to unload asd
Jan 16 17:06:10 [hostname] kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <psp> failed -22
Jan 16 17:06:10 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: MODE1 reset
Jan 16 17:06:10 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: GPU mode1 reset
Jan 16 17:06:10 [hostname] kernel: [drm] psp is not working correctly before mode1 reset!
Jan 16 17:06:10 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: GPU mode1 reset failed
Jan 16 17:06:10 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: ASIC reset failed with error, -22 for drm dev, 0000:28:00.0
Jan 16 17:06:10 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: GPU reset(2) failed
Jan 16 17:06:10 [hostname] kernel: snd_hda_intel 0000:28:00.1: can't change power state from D3cold to D0 (config space inaccessible)
Jan 16 17:06:10 [hostname] kernel: snd_hda_intel 0000:28:00.1: CORB reset timeout#2, CORBRP = 65535
Jan 16 17:06:10 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: GPU reset end with ret = -22
Jan 16 17:06:20 [hostname] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Jan 16 17:06:30 [hostname] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
When this happens, the monitor abruptly says there’s no signal and goes dark. However, the system isn’t actually down: I can SSH in and look at the logs, add and remove software, etc.
I know there’s a new kernel (5.10) and updated mesa in updates-testing
, so the first thing I did was roll those back But the problems persist. In fact, they’re getting worse: at first, it was a couple of times over the course of a few hours, but as I’m trying to diagnose that, sometimes it won’t even let me log in before there’s a crash. So, the problem happens with:
- kernel
5.10.7
- kernel
5.9.16
and with
- mesa-*
20.2.6
- mesa-*
20.3.3
AND I even booted with a Fedora 33 Live image, and, while I can’t ssh in to test, I get the same crash after < 5 minutes where the monitor cuts out.
It’s weird for this to start all of the sudden. I’ve done some basic web searches, but most of what I see is old and points to various problems with drivers and card quirks. It seems like if that was the problem, this would have been happening all along.
Any insight?