(On Fedora Kinoite 41, persisting since 40)
For the last month or so I’ve been trying to narrow down the cause of a GPU crash during intensive 3D use, e.g. video games. The kernel logs look something like this when it happens, though numbers/addresses vary:
16:58:34 clevergirl kernel: [drm] scheduler comp_1.1.0 is not ready, skipping
Nov 02 16:58:34 clevergirl kernel: [drm:gfx_v8_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream
Nov 02 16:58:34 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: ring gfx timeout, signaled seq=2271933, emitted seq=2271935
Nov 02 16:58:34 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: Process information: process Wrath.exe pid 56036 thread dxvk-submit pid 56085
Nov 02 16:58:34 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset begin!
Nov 02 16:58:34 clevergirl kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Nov 02 16:58:34 clevergirl steam[52635]: radv/amdgpu: The CS has been cancelled because the context is lost. This context is innocent.
Nov 02 16:58:34 clevergirl kernel: amdgpu: cp is busy, skip halt cp
Nov 02 16:58:34 clevergirl kernel: amdgpu: rlc is busy, skip halt rlc
Nov 02 16:58:34 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: Dumping IP State
Nov 02 16:58:34 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: Dumping IP State Completed
Nov 02 16:58:34 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: BACO reset
Nov 02 16:58:34 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset succeeded, trying to resume
Nov 02 16:58:34 clevergirl kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400380000).
Nov 02 16:58:34 clevergirl kernel: [drm] VRAM is lost due to GPU reset!
Nov 02 16:58:35 clevergirl kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.1.0 test failed (-110)
Nov 02 16:58:35 clevergirl kernel: [drm] UVD and UVD ENC initialized successfully.
Nov 02 16:58:35 clevergirl kernel: [drm] VCE initialized successfully.
Nov 02 16:58:35 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: recover vram bo from shadow start
Nov 02 16:58:35 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: recover vram bo from shadow done
Nov 02 16:58:35 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset(2) succeeded!
Nov 02 16:58:35 clevergirl kwin_wayland[2295]: kwin_scene_opengl: 0x3: GL_CONTEXT_LOST in context lost
When the crash happens, the system will become unresponsive to input for a couple seconds; then the cursor freezes, the screen goes black, and finally a frozen and corrupted image is displayed.
The only way I’ve found to avoid this crash is using Corectrl to limit the maximum clockspeeds of my GPU. With a bit of trial and error, I’ve found that limiting only the VRAM speed (to 1000 MHz as opposed to the maximum of 2120) seems to reliably prevent the crash. I’ve also noticed that, if the maximum VRAM speed is allowed, the GPU fans will often generate an annoying whiny noise during or after any 3D rendering. Limiting the VRAM speed reliably prevents this as well.
Anyone know what might cause this? Hardware issue maybe? I’d been hoping the most recent kernel updates would fix the crash without the workaround (since the changelogs show some amdgpu fixes), but they have not.