Severe instability after switching to AMD graphics card

Can you try a different HDMI cable? We know it was fine with the old GPU… But over the years I’ve found GPU + monitor combinations that were overly sensitive to cable length/quality.

I’ve also experienced ground loops that caused video+audio to glitch. In one case I broke a ground loop in my desktop system by removing the docking station for my laptop (which wasn’t even part of the signal chain!). It was connected to two of my four display panels on their secondary inputs. Just having it present and connected was destabilizing the entire setup.

On that note, try cutting the system back to the barest minimum required to ascertain correct operation. Ideally just the computer, the display, the mouse, and the keyboard. Both plugged into the same outlet (or power strip or UPS).

Get everything else out of the picture and set it aside. Any additional hardware, printers, scanners, switchboxes: gone. An A/V receiver for sound? Get it away. USB signaling from your UPS? Disconnect it. A second signal source to your display panel? Set it aside.

If this cures the issue then bring back one item at a time.

4 Likes

Hi there,

I read about your stability issues and the hitching you’re experiencing. I’ve encountered very similar symptoms on my own system running Fedora 43 with an RX 9060 XT. In my case, I had frequent stutters and micro-lags that felt exactly like what you’re describing.

For me, the issue was completely resolved by disabling FreeSync directly in the monitor’s OSD settings.

Hope this helps!

I almost thought this worked, it worked for the last couple days, but this morning again it only took about 24 minutes for the issue to appear once more. Also worth noting between my last post and now, I updated my BIOS (v1.1 > v3.9, wow).

I’ll try this, although my monitor has no controls for freesync directly in the OSD. I can only turn it off on the system side, which I’ve done through the display config window so I’ll see if that helps anything.

I’ll try this as well if the above doesn’t work. I intend to potentially grab a standalone battery bank and run off of that and see if it changes anything.

Had a graphics driver crash about 42 minutes into a new session, I let it sit for a while and everything eventually recovered (which I haven’t seen happen before but maybe I just wasn’t patient enough). Still NOTHING helpful in logs, besides LACT logging ... :got kernel drm subsystem event, reloading GPUs every 10 seconds from when the crash happened to when it recovered (about 3 minutes 20 seconds), and Process 2014 (maliit-keyboard) of user 1000 terminated with signal 11/SEGV, processing..., but that only happened over a minute after the crash (presumably while I was inputting keyboard commands to try to make anything happen).
I even have a notification from KWin Window Manager that says “Desktop effects were restarted due to a graphics reset” - but nothing reflecting this in the system log. Is there somewhere else I can/should be looking for more specific logs?

Also worth noting - no hitches/audio pops or anything so far, even during the graphics crash. Don’t know if that means anything yet, I just have to continue keeping an eye on it.

Just found out I can enter the “service menu” by a special button combination, where I can explicitly disable freesync. I will again keep an eye on it and see if this helps.

Hi,

before we continue adjusting ALSA settings, I’d first like to confirm whether a kernel-level GPU reset is actually happening.

Please start this in a terminal and leave it running:

sudo dmesg -w

Then wait for the next freeze or hitch.

If you see messages like amdgpu: GPU reset or ring timeout around the time it happens, we’ll likely need to investigate PCIe or power management.

If nothing relevant appears in the kernel log, we should shift focus toward userspace instead.

That distinction will determine the direction of further troubleshooting.

If you can share what shows up when the issue occurs, we can narrow this down properly.

Everything seemed to be working for the last day or so, but it’s back to the same as it was before again.

This hasn’t yielded anything useful, but I also haven’t had any (specifically) graphical issues so far, I’m just getting the hitches/audio pops. Nothing shows up in here that lines up with when the lag occurs.

I had a weird idea for a test, and the results are very interesting.
I decided to try running FurMark to see if extra GPU load would cause higher instability. Instead, it actually removed it entirely. While furmark is running, there has not been a single audio glitch (and presumably not a single system freeze at all), but as soon as or shortly after I stop furmark, the pops start happening again.

Okay, now something actually happened for once. I was just scrolling in discord, and specifically my second monitor (the one on HDMI, which discord is fullscreen on) froze, which resulted in this message in dmesg:

[ 8217.983109] amdgpu 0000:08:00.0: [drm] *ERROR* [CRTC:96:crtc-1] flip_done timed out

I tried turning off the monitor but nothing changed, so I was forced to unplug/replug the HDMI from the GPU, which then caused both monitors to freeze for about 10 seconds, and output this:

[ 8365.946586] amdgpu 0000:08:00.0: [drm] *ERROR* flip_done timed out
[ 8365.946591] amdgpu 0000:08:00.0: [drm] *ERROR* [CRTC:96:crtc-1] commit wait timed out
[ 8376.186360] amdgpu 0000:08:00.0: [drm] *ERROR* flip_done timed out
[ 8376.186366] amdgpu 0000:08:00.0: [drm] *ERROR* [PLANE:93:plane-9] commit wait timed out

Afterwards, everything was back to normal. Hopefully this means literally anything? I’m not really sure. Seeing [drm] makes me think HDMI shenanigans, but again this was still happening even with the HDMI completely unplugged before.

Hi, this narrows down the likely cause quite a bit.

The fact that the issues disappear under GPU load strongly suggests a problem with power management states or PCIe power saving.

The flip_done timed out message indicates that the driver is waiting for a response from the GPU that isn’t arriving in time — often due to deep sleep states or unstable clock transitions.

As a test, try forcing the GPU into a higher performance state:

echo high | sudo tee /sys/class/drm/card0/device/power_dpm_force_performance_level

Then check whether the desktop stuttering disappears.
If it does, the issue is very likely related to DPM (Dynamic Power Management).

Unfortunately this doesn’t seem to have any effect (after correcting card0 to card1), I also tried messing with similar settings via LACT with equally no effect. (unless there’s a conflict somehow? but I would expect at least one to work)

If the issue is related to aggressive power saving, you can try disabling some of these features at the kernel level temporaly as a test:

sudo grubby --update-kernel=ALL --args="amdgpu.sg_display=0 amdgpu.dcdebugmask=0x10 pcie_aspm=off"

Reboot afterwards and check if the problem persists.

Note that this disables certain PCIe and GPU power-saving mechanisms, so power consumption may increase. This is mainly intended as a diagnostic step to confirm the cause.

If it doesn’t help, you can easily revert it with:
sudo grubby --update-kernel=ALL --remove-args="amdgpu.sg_display=0 amdgpu.dcdebugmask=0x10 pcie_aspm=off"

Have you had a chance to test the system using that grounded extension cord from the basement yet to rule out the ungrounded outlet as the cause?

I have these notes from debugging an issue with my AMD GPU related to power management.

Fix PCIe BadDLLP errors

This can be caused by the kernel appling power management that the motherboard or device cannot handle.

Example logs:

kernel: pcieport 0000:00:1c.4: AER: Multiple Correctable error message received from 0000:06:00.0
kernel: pcieport 0000:06:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
kernel: pcieport 0000:06:00.0:   device [1002:1478] error status/mask=00000080/00002000
kernel: pcieport 0000:06:00.0:    [ 7] BadDLLP

Add pcie_aspm=off to the kernel command line.

grubby --args=pcie_aspm=off --update-kernel=ALL
1 Like

It’s been a while, and I’m happy to report I haven’t had any significant issues since doing this.

I would love to figure out what’s actually going on here that causes this to be a problem in the first place, but I know that’s way over my head.

I greatly appreciate the help from everyone!

1 Like

Good that you were able to narrow down the cause.

Check the BIOS

for PCIe Native Power Management.

• If Enabled: Try disabling it (letting the OS handle it).
• If Disabled: Try enabling it (letting the BIOS handle it).

Then disable the ACPM kernel fix like shown earlier.

pcie_aspm=off disables power savings for all PCIe devices and can lead to higher temps and shorter lifespan for alle devices in the long run, including NVME SSDs and is not recommended as permanent solution.

Agree with @benjamin-scheuer – you probably don’t want pcie_aspm=off disabling power management system-wide.

After reverting the above try echo on | sudo tee /sys/class/drm/card1/power/control – the default is auto.

[I wonder if anything in tuned might interfere with this effort…]

The only thing I could find was PM L1 SS which is set to Disabled, for now I’m not touching that.
I’ve removed the pcie_asm=off argument now.

I’m trying this now (after noting it does not persist after a restart), will report back with any findings.

I initially thought this was working (it seemed to be fine yesterday), but now that it’s actively giving me the lag spikes again, putting that command into the terminal makes no difference (in fact my primary monitor froze while I was trying to type this post…).

I’ll have to go back to the pcie argument for now just so I have a stable system.

1 Like