AMD System Will Not Boot on Kernels >= 6.8

Hello everyone!

I’ve been having some issues booting my Fedora 39 install ever since installing my first 6.8.X-series kernel (via normal dnf upgrade and the default repo list). At the time, I thought it was a LUKS issue, as it seemed similar to this issue since the main thing I noticed was that the LUKS prompt no longer came up at boot. However, after doing some more digging (including upgrading my LUKS headers from LUKS1 to LUKS2), it seems this issue goes a bit deeper than LUKS.

Configurations I’ve tested that do not boot:

  • Fedora 39 installation using kernel-6.8.5-201.fc39.x86_64
  • Fedora 39 installation using kernel-6.8.6-200.fc39.x86_64
  • Fedora 39 installation using kernel-6.8.7-200.fc39.x86_64
  • Fedora 39 installation using kernel-6.8.8-200.fc39.x86_64
  • Fedora 40 Live USB
  • Ubuntu 24.04 Live USB

All of the above configurations end up resulting in a black screen (although seemingly not a “no signal” screen, as my monitors stay active). For the configurations involving my Fedora 39 installation, they reach this state before any LUKS prompt is shown (my root, home, and swap directories are LUKS2 encrypted). This still seemed to be the case even when rhgb and quiet were excluded from the kernel parameters (they would show some output for a bit, but eventually the screen would go black). The Fedora 40 and Ubuntu 24.04 Live USB tests were included as evidence that this may be related to kernel 6.8.

Configurations I’ve tested that do boot (and their resultant behavior):

  • Fedora 39 installation using kernel-6.7.11-200.fc39.x86_64
    • This has been my default boot method lately - everything works fine this way other than not using the latest kernel.
  • Fedora 39 installation using any of the 6.8.X kernels above in addition to kernel parameter nomodeset
    • This boots properly, but only with basic/minimal graphics.
  • Fedora 40 Live USB using the “Basic Graphics” boot option in the troubleshooting menu.
    • Same result as above.

My system’s hardware:

  • Motherboard: ASRock X470 Taichi (BIOS v2.00)
  • CPU: AMD Ryzen 7 2700X
  • RAM: 4x8GB DDR4-3000
  • GPU: Sapphire Nitro+ Radeon RX 580 8GB GDDR5
  • Displays: 2x 3840x2160 60Hz monitors

Does anyone have any other ideas about how to isolate what’s causing this apparent issue with kernel 6.8? I’m just about out of ideas.

Added workstation

Hmm, I feel for you, @elijahmathews . I’m not seeing obviously what’s wrong, but it looks like your hardware is supported. You clearly noticed that the common denominator is kernel 6.8. I don’t understand why that is causing your particular symptom, because that is supported, as best I can tell. But, 580 does seem old, now, so if support was dropped, I guess I wouldn’t be surprised, but what I see says it’s still supported.

You don’t have any “old” AMD pkgs from amdgpu-install.sh or from AMD direct repos, do you?

The only other thing I can suggest is something I found when I went through a similar issue with kernel 6.8. Try getting a boot log from kernel 6.8 by adding ‘rd.debug’ boot parameter, and probably the ‘single’ parameter, too, so it will drop you to a root shell. With rd.debug enabled, just try to get the current boot (‘journalctl -xb’) into a file and off the system before you reboot. You could post that here and/or start a bugzilla report.

Oh, and you’ll need your root password to login.

1 Like

Hello @elijahmathews ,
Welcome to :fedora: !
Have you tried to get an alternate tty to login at the command line with? Try <Ctrl><Alt><F2> to get TTY2 at the time you are sitting with the black screen. Then at least you can diagnose from the running system.

1 Like

Do you have another system you can use to login to the problem system using ssh? You may need to enable sshd if you have already done so.

If you boot using nomodeset or a 6.7 kernel you can use journalctl -b -N where N is the number for the previous 6.8.x boot to search for details. journalctl collects copious data, but there are “filter” options (see man journalctl) to help focus on relevant entries. Try journalctl -b -N -p P (where P=3 or 4 is often useful) and journalctl -b -N -g radeon. Read man journalctl for explanations of the above options. Note that journalctl creates very long lines. For posting I often use journalctl ... | cat to get wrapped lines.

2 Likes

Thanks to all of you for your help! I think I’ve gotten a little closer to diagnosing this issue.

I don’t believe so - as far as I can remember I only have worked with the AMDGPU drivers supplied by default with Fedora. Just to check (this was run on a working kernel-6.7.11-200.fc39.x86_64 boot):

$ dnf list installed | grep amdgpu
xorg-x11-drv-amdgpu.x86_64                        23.0.0-2.fc39                          @fedora

$ dnf list installed | grep mesa
mesa-dri-drivers.i686                             23.3.6-1.fc39                          @updates
mesa-dri-drivers.x86_64                           23.3.6-1.fc39                          @updates
mesa-filesystem.i686                              23.3.6-1.fc39                          @updates
mesa-filesystem.x86_64                            23.3.6-1.fc39                          @updates
mesa-libEGL.i686                                  23.3.6-1.fc39                          @updates
mesa-libEGL.x86_64                                23.3.6-1.fc39                          @updates
mesa-libGL.i686                                   23.3.6-1.fc39                          @updates
mesa-libGL.x86_64                                 23.3.6-1.fc39                          @updates
mesa-libGLU.i686                                  9.0.3-1.fc39                           @fedora
mesa-libGLU.x86_64                                9.0.3-1.fc39                           @fedora
mesa-libOSMesa.i686                               23.3.6-1.fc39                          @updates
mesa-libOSMesa.x86_64                             23.3.6-1.fc39                          @updates
mesa-libOpenCL.x86_64                             23.3.6-1.fc39                          @updates
mesa-libgbm.i686                                  23.3.6-1.fc39                          @updates
mesa-libgbm.x86_64                                23.3.6-1.fc39                          @updates
mesa-libglapi.i686                                23.3.6-1.fc39                          @updates
mesa-libglapi.x86_64                              23.3.6-1.fc39                          @updates
mesa-libxatracker.x86_64                          23.3.6-1.fc39                          @updates
mesa-va-drivers.i686                              23.3.6-1.fc39                          @updates
mesa-va-drivers.x86_64                            23.3.6-1.fc39                          @updates
mesa-vulkan-drivers.i686                          23.3.6-1.fc39                          @updates
mesa-vulkan-drivers.x86_64                        23.3.6-1.fc39                          @updates

$ glxinfo -B
name of display: :0
display: :0  screen: 0
direct rendering: Yes
Extended renderer info (GLX_MESA_query_renderer):
    Vendor: AMD (0x1002)
    Device: AMD Radeon RX 580 Series (radeonsi, polaris10, LLVM 17.0.6, DRM 3.57, 6.7.11-200.fc39.x86_64) (0x67df)
    Version: 23.3.6
    Accelerated: yes
    Video memory: 8192MB
    Unified memory: no
    Preferred profile: core (0x1)
    Max core profile version: 4.6
    Max compat profile version: 4.6
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 3.2
Memory info (GL_ATI_meminfo):
    VBO free memory - total: 7191 MB, largest block: 7191 MB
    VBO free aux. memory - total: 15923 MB, largest block: 15923 MB
    Texture free memory - total: 7191 MB, largest block: 7191 MB
    Texture free aux. memory - total: 15923 MB, largest block: 15923 MB
    Renderbuffer free memory - total: 7191 MB, largest block: 7191 MB
    Renderbuffer free aux. memory - total: 15923 MB, largest block: 15923 MB
Memory info (GL_NVX_gpu_memory_info):
    Dedicated video memory: 8192 MB
    Total available memory: 24200 MB
    Currently available dedicated video memory: 7191 MB
OpenGL vendor string: AMD
OpenGL renderer string: AMD Radeon RX 580 Series (radeonsi, polaris10, LLVM 17.0.6, DRM 3.57, 6.7.11-200.fc39.x86_64)
OpenGL core profile version string: 4.6 (Core Profile) Mesa 23.3.6
OpenGL core profile shading language version string: 4.60
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile

OpenGL version string: 4.6 (Compatibility Profile) Mesa 23.3.6
OpenGL shading language version string: 4.60
OpenGL context flags: (none)
OpenGL profile mask: compatibility profile

OpenGL ES profile version string: OpenGL ES 3.2 Mesa 23.3.6
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20

Furthermore, given that this issue also exhibits itself when booting Live USB versions of Fedora 40 and Ubuntu 24.04, my guess is that this is independent of my Fedora 39 install.

I tried adding the rd.debug and single kernel parameters to a kernel kernel-6.8.8-200.fc39.x86_64 boot, but that unfortunately resulted in the same black screen behavior.

I also tried that, but the black screen persists even after using the <Ctrl><Alt><F2> key combination.

The tricky thing I’ve had here is that this issue presents itself before the LUKS prompt is displayed (and blindly typing my LUKS passphrase and hitting enter doesn’t seem to result in any change, so it seems LUKS decryption is halted by whatever this issue is), so SSHD hasn’t loaded yet.

However, I have used dracut-sshd on one of my servers (for remote LUKS decryption), so I decided to give that a try on this system. Thankfully, after setting that up and running dracut -f -v --kver <kernel version> (tested for the 6.7.11 and 6.8.8 kernels), I can now SSH into the desktop for all kernels. In the case of kernel-6.7.11-200.fc39.x86_64, I can type in my LUKS passphrase and decrypt the device (as expected). However, in the case of kernel-6.8.8-200.fc39.x86_64, the LUKS decryption hangs after entering the passphrase (although the system itstelf doesn’t hang, as I can SSH in using a different session and still run commands while LUKS is frozen), and the desktop does not continue booting.

Thankfully, I can run journalctl commands within the SSH session I mentioned above. I’ve put the output of journalctl -b -0 in that session booted with kernel-6.8.8-200.fc39.x86_64 over on Pastebin (link here) since it’s too long for here.

I’m not that experienced peering through journalctl output, but of note to me is the amdgpu errors that begin popping up at around line 1330. At line 3090 you can see where I SSH’d in to the desktop from a different machine.

Hopefully some of that is helpful for diagnosis!

I just tried installing the latest Rawhide kernel (kernel-6.9.0-0.rc7.58.fc41.x86_64) to see if the issue was resolved upstream, but it still has the same issue. Thus, this appears to not be exclusively limited to 6.8.X-series kernels.

It would be a good time to file a bugzilla report. It sounds pretty major.

I’ve filed a bug report on Bugzilla now in regards to this issue.

Bug report: 2279621 – Boot fails due to AMDGPU on kernel 6.8

2 Likes

Sounds like you did the right thing. Let us know how it goes.

If the single parameter didn’t get you to a shell, then that seems like it’s hanging, not just failing to switch to graphical mode.

Thank you för reporting this, I’m seeing the exact same thing in combination with Bug 2278534, which feels loosely related.

Update - the issue is caused by revisions in this commit: drm/amd: Use the first non-dGPU PCI device for BW limits · torvalds/linux@466a7d1 · GitHub

That commit dealt with how the PCI speed and bandwidth is established for the GPU. It’s unclear why this affects my specific hardware, but I’ve confirmed this is indeed the culprit.

Refer to the drm/amd bug tracker for more details on this issue: amdgpu driver not being loaded (#3289) · Issues · drm / amd · GitLab

If you happen to be running into this issue, you should be able to use this patch to make your system bootable again:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 7753a2e64d41..c17c3204a7bd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5852,8 +5852,8 @@ static void amdgpu_device_get_pcie_info(struct amdgpu_device *adev)
 	if (adev->pm.pcie_gen_mask && adev->pm.pcie_mlw_mask)
 		return;
 
-	amdgpu_device_partner_bandwidth(adev, &platform_speed_cap,
-					&platform_link_width);
+	pcie_bandwidth_available(adev->pdev, NULL,
+				 &platform_speed_cap, &platform_link_width);
 
 	if (adev->pm.pcie_gen_mask == 0) {
 		/* asic caps */

This will, however, require you to build a custom (patched) kernel. I found this guide to be particularly useful for building a custom kernel for Fedora. I was building mainline kernels at specific commits, but it should be way easier to use dist-git now that a patch is known (and probably safer too since you’re just slightly modifying a couple lines in an otherwise known-good kernel).

3 Likes

Impressive sleuthing. Never had the patience for git-bisect, but I’m sure it’s way more helpful.