AMD RX580 system freeze on maximum VRAM speed

diziet-sma · November 12, 2024, 10:45pm

(On Fedora Kinoite 41, persisting since 40)

For the last month or so I’ve been trying to narrow down the cause of a GPU crash during intensive 3D use, e.g. video games. The kernel logs look something like this when it happens, though numbers/addresses vary:

16:58:34 clevergirl kernel: [drm] scheduler comp_1.1.0 is not ready, skipping
Nov 02 16:58:34 clevergirl kernel: [drm:gfx_v8_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream
Nov 02 16:58:34 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: ring gfx timeout, signaled seq=2271933, emitted seq=2271935
Nov 02 16:58:34 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: Process information: process Wrath.exe pid 56036 thread dxvk-submit pid 56085
Nov 02 16:58:34 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset begin!
Nov 02 16:58:34 clevergirl kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Nov 02 16:58:34 clevergirl steam[52635]: radv/amdgpu: The CS has been cancelled because the context is lost. This context is innocent.
Nov 02 16:58:34 clevergirl kernel: amdgpu: cp is busy, skip halt cp
Nov 02 16:58:34 clevergirl kernel: amdgpu: rlc is busy, skip halt rlc
Nov 02 16:58:34 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: Dumping IP State
Nov 02 16:58:34 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: Dumping IP State Completed
Nov 02 16:58:34 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: BACO reset
Nov 02 16:58:34 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset succeeded, trying to resume
Nov 02 16:58:34 clevergirl kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400380000).
Nov 02 16:58:34 clevergirl kernel: [drm] VRAM is lost due to GPU reset!
Nov 02 16:58:35 clevergirl kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.1.0 test failed (-110)
Nov 02 16:58:35 clevergirl kernel: [drm] UVD and UVD ENC initialized successfully.
Nov 02 16:58:35 clevergirl kernel: [drm] VCE initialized successfully.
Nov 02 16:58:35 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: recover vram bo from shadow start
Nov 02 16:58:35 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: recover vram bo from shadow done
Nov 02 16:58:35 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset(2) succeeded!
Nov 02 16:58:35 clevergirl kwin_wayland[2295]: kwin_scene_opengl: 0x3: GL_CONTEXT_LOST in context lost

When the crash happens, the system will become unresponsive to input for a couple seconds; then the cursor freezes, the screen goes black, and finally a frozen and corrupted image is displayed.

The only way I’ve found to avoid this crash is using Corectrl to limit the maximum clockspeeds of my GPU. With a bit of trial and error, I’ve found that limiting only the VRAM speed (to 1000 MHz as opposed to the maximum of 2120) seems to reliably prevent the crash. I’ve also noticed that, if the maximum VRAM speed is allowed, the GPU fans will often generate an annoying whiny noise during or after any 3D rendering. Limiting the VRAM speed reliably prevents this as well.

Anyone know what might cause this? Hardware issue maybe? I’d been hoping the most recent kernel updates would fix the crash without the workaround (since the changelogs show some amdgpu fixes), but they have not.

computersavvy · November 12, 2024, 11:22pm

Added f41, kinoite

siosm · November 13, 2024, 2:15pm

I would report that upstream to the AMD DRM folks: https://gitlab.freedesktop.org/drm/amd

diziet-sma · November 13, 2024, 9:16pm

Reported, thanks: RX580 GPU crash on maximum VRAM clockspeed (kernel 6.8 and later) (#3761) · Issues · drm / amd · GitLab

Espionage724 · November 13, 2024, 10:54pm

It doesn’t sound like running VRAM at max-speed should cause problems vs a lower speed unless the VRAM itself is defective; but could also be something with power delivery.

Before the bug report I’d try memtestCL from Linux (Fedora and a openSUSE Tumbleweed LiveUSB), and also Windows: GitHub - ihaque/memtestCL: OpenCL memory tester for GPUs I imagine on Fedora it’ll crash with the same mem clock thing, but a LiveUSB should be a completely different environment, along with Windows.

I had a few RX 580s and usually ran them max 3D core/mem clocks no problem Windows and Linux (Fedora 30s).

diziet-sma · November 14, 2024, 2:10pm

Oh that’s interesting, thanks. From experience I’d expect defective RAM to just prevent the video card from working, but I will run the test and see.

Topic		Replies	Views
AMDGPU randomly crashes my pc Ask Fedora f35 , amdgpu , gnome	3	1617	November 10, 2022
Amdgpu crash, timeout with amd 7840hs with iGOU radeon 780M on Fedora 39 Ask Fedora wayland , amd , amdgpu , gnome , desktop , radeon , f39 , workstation	8	1469	February 2, 2024
AMDGPU crash every 5 days Ask Fedora f37 , amd , amdgpu , intel , radeon , nvidia	29	3742	January 31, 2023
Troubleshooting graphics issues (AMD) Ask Fedora f37 , amd	6	1494	December 11, 2022
Frequent GPU hangs with 7900 XTX Ask Fedora amd , amdgpu	1	1279	January 24, 2023

AMD RX580 system freeze on maximum VRAM speed

Related topics