AMDGPU crash every 5 days

Ok, I’ll do that. Thank you for your time!

Here is a list of open AMD GPU issues. Do any of them sound like what you are seeing?

1 Like

Unfortunately, no! I did see something very close once, but I tried everything they proposed with no result. I came here precisely because I was totally out of ideas. Now, I think it could be hardware related, but impossible to confirm without any machine to swap the card on…

I will try to see with my reseller if they can do something for me about that. If it’s not hardware, it cannot be anything else but a driver issue.

I just saw this post, it looks like they have some of the same error messages as you. It has a link to a bug.

That seems strange. Everyone should be able to access that directory.

# ls /sys/kernel/debug/dri
1  128

# ls /sys/kernel/debug/dri/1
clients  framebuffer  internal_clients  state                virtio-gpu-host-visible-mm  Virtual-1
crtc-0   gem_names    name              virtio-gpu-features  virtio-gpu-irq-fence

# ls -ld /sys/kernel/debug/dri/1
drwxr-xr-x. 4 root root 0 Jan 24 13:34 /sys/kernel/debug/dri/1

# ls -ld /sys/kernel/debug/dri/1/*
-r--r--r--. 1 root root 0 Jan 24 13:34 /sys/kernel/debug/dri/1/clients
drwxr-xr-x. 2 root root 0 Jan 24 13:34 /sys/kernel/debug/dri/1/crtc-0
-r--r--r--. 1 root root 0 Jan 24 13:34 /sys/kernel/debug/dri/1/framebuffer
-r--r--r--. 1 root root 0 Jan 24 13:34 /sys/kernel/debug/dri/1/gem_names
-r--r--r--. 1 root root 0 Jan 24 13:34 /sys/kernel/debug/dri/1/internal_clients
-r--r--r--. 1 root root 0 Jan 24 13:34 /sys/kernel/debug/dri/1/name
-r--r--r--. 1 root root 0 Jan 24 13:34 /sys/kernel/debug/dri/1/state
-r--r--r--. 1 root root 0 Jan 24 13:34 /sys/kernel/debug/dri/1/virtio-gpu-features
-r--r--r--. 1 root root 0 Jan 24 13:34 /sys/kernel/debug/dri/1/virtio-gpu-host-visible-mm
-r--r--r--. 1 root root 0 Jan 24 13:34 /sys/kernel/debug/dri/1/virtio-gpu-irq-fence
drwxr-xr-x. 2 root root 0 Jan 24 13:34 /sys/kernel/debug/dri/1/Virtual-1

Have you checked the permissions.?
Is this potentially caused by the repeated crashes?

It is known that a crash during a write (either to memory or drive) has the potential to corrupt data. After a crash a full power off before rebooting is suggested as a good thing to minimize the potential for corrupt data remaining in memory. The tmpfs structures in ram (/sys, /proc, /dev, /run, among others and including the GPU memory) all may retain corrupt data during a reboot after a crash unless a full power off is performed.

How do I check? I mean, I access the directory with sudo, so… I should have all permissions to get in.

I will try that now, see if something change.

Yes, that’s precisely the thread I’m talking about. i tried a lot of things (albeit, not every single feature mask), with no avail. But I admit I had not a reliable way to crash my computer at that time… now I have. I will look more closely, but it could be long.

I can confirm that it still crash even after a complete shutdown. Note that I can access the file with my Terminal, just not with Gnome File (instant crash), and I can access the elements inside it with a Terminal too if needed… but any attempt to use Gnome File result in an instant (GPU!) crash.

I do note that this time, ‘Problem Reporting’ was triggered and now show me this message:

The kernel log indicates that hardware errors were detected.
This is most likely not a software problem.

Hardware related, you think?

The ‘gear’ appears after you select the user name, before you enter your password. It would be remarkably unusual if it is not available on your system.
The gear is also useful if you have an alternative desktop environment. You can choose gnome or anything else you wish on login. (After installing it that is) Some only support xorg at this time, gnome should offer X and Wayland.
I think your most stable option would be to get an LTS stable release running on your system. If you search the downloads for the ‘everything’ version the live environment will have the best chance of success. It is bloated with everything, as the name implies, but has all possible drivers. On installation you can choose just what you need.

If you have a wrong resolution on the login screen it might will cut you off the gear.
I never understood why they transferred it in that corner. Normally it would be under or by side the login field

1 Like

That is a good point given that all this discussion started over graphics driver issues. When it is wrong the display usually appears much smaller than the physical screen, but I have seen it bleed over all the edges in weird circumstances.