AMD System Will Not Boot on Kernels >= 6.8

Hello everyone!

I’ve been having some issues booting my Fedora 39 install ever since installing my first 6.8.X-series kernel (via normal dnf upgrade and the default repo list). At the time, I thought it was a LUKS issue, as it seemed similar to this issue since the main thing I noticed was that the LUKS prompt no longer came up at boot. However, after doing some more digging (including upgrading my LUKS headers from LUKS1 to LUKS2), it seems this issue goes a bit deeper than LUKS.

Configurations I’ve tested that do not boot:

  • Fedora 39 installation using kernel-6.8.5-201.fc39.x86_64
  • Fedora 39 installation using kernel-6.8.6-200.fc39.x86_64
  • Fedora 39 installation using kernel-6.8.7-200.fc39.x86_64
  • Fedora 39 installation using kernel-6.8.8-200.fc39.x86_64
  • Fedora 40 Live USB
  • Ubuntu 24.04 Live USB

All of the above configurations end up resulting in a black screen (although seemingly not a “no signal” screen, as my monitors stay active). For the configurations involving my Fedora 39 installation, they reach this state before any LUKS prompt is shown (my root, home, and swap directories are LUKS2 encrypted). This still seemed to be the case even when rhgb and quiet were excluded from the kernel parameters (they would show some output for a bit, but eventually the screen would go black). The Fedora 40 and Ubuntu 24.04 Live USB tests were included as evidence that this may be related to kernel 6.8.

Configurations I’ve tested that do boot (and their resultant behavior):

  • Fedora 39 installation using kernel-6.7.11-200.fc39.x86_64
    • This has been my default boot method lately - everything works fine this way other than not using the latest kernel.
  • Fedora 39 installation using any of the 6.8.X kernels above in addition to kernel parameter nomodeset
    • This boots properly, but only with basic/minimal graphics.
  • Fedora 40 Live USB using the “Basic Graphics” boot option in the troubleshooting menu.
    • Same result as above.

My system’s hardware:

  • Motherboard: ASRock X470 Taichi (BIOS v2.00)
  • CPU: AMD Ryzen 7 2700X
  • RAM: 4x8GB DDR4-3000
  • GPU: Sapphire Nitro+ Radeon RX 580 8GB GDDR5
  • Displays: 2x 3840x2160 60Hz monitors

Does anyone have any other ideas about how to isolate what’s causing this apparent issue with kernel 6.8? I’m just about out of ideas.

Added workstation

Hmm, I feel for you, @elijahmathews . I’m not seeing obviously what’s wrong, but it looks like your hardware is supported. You clearly noticed that the common denominator is kernel 6.8. I don’t understand why that is causing your particular symptom, because that is supported, as best I can tell. But, 580 does seem old, now, so if support was dropped, I guess I wouldn’t be surprised, but what I see says it’s still supported.

You don’t have any “old” AMD pkgs from amdgpu-install.sh or from AMD direct repos, do you?

The only other thing I can suggest is something I found when I went through a similar issue with kernel 6.8. Try getting a boot log from kernel 6.8 by adding ‘rd.debug’ boot parameter, and probably the ‘single’ parameter, too, so it will drop you to a root shell. With rd.debug enabled, just try to get the current boot (‘journalctl -xb’) into a file and off the system before you reboot. You could post that here and/or start a bugzilla report.

Oh, and you’ll need your root password to login.

1 Like

Hello @elijahmathews ,
Welcome to :fedora: !
Have you tried to get an alternate tty to login at the command line with? Try <Ctrl><Alt><F2> to get TTY2 at the time you are sitting with the black screen. Then at least you can diagnose from the running system.

1 Like

Do you have another system you can use to login to the problem system using ssh? You may need to enable sshd if you have already done so.

If you boot using nomodeset or a 6.7 kernel you can use journalctl -b -N where N is the number for the previous 6.8.x boot to search for details. journalctl collects copious data, but there are “filter” options (see man journalctl) to help focus on relevant entries. Try journalctl -b -N -p P (where P=3 or 4 is often useful) and journalctl -b -N -g radeon. Read man journalctl for explanations of the above options. Note that journalctl creates very long lines. For posting I often use journalctl ... | cat to get wrapped lines.

2 Likes

Thanks to all of you for your help! I think I’ve gotten a little closer to diagnosing this issue.

I don’t believe so - as far as I can remember I only have worked with the AMDGPU drivers supplied by default with Fedora. Just to check (this was run on a working kernel-6.7.11-200.fc39.x86_64 boot):

$ dnf list installed | grep amdgpu
xorg-x11-drv-amdgpu.x86_64                        23.0.0-2.fc39                          @fedora

$ dnf list installed | grep mesa
mesa-dri-drivers.i686                             23.3.6-1.fc39                          @updates
mesa-dri-drivers.x86_64                           23.3.6-1.fc39                          @updates
mesa-filesystem.i686                              23.3.6-1.fc39                          @updates
mesa-filesystem.x86_64                            23.3.6-1.fc39                          @updates
mesa-libEGL.i686                                  23.3.6-1.fc39                          @updates
mesa-libEGL.x86_64                                23.3.6-1.fc39                          @updates
mesa-libGL.i686                                   23.3.6-1.fc39                          @updates
mesa-libGL.x86_64                                 23.3.6-1.fc39                          @updates
mesa-libGLU.i686                                  9.0.3-1.fc39                           @fedora
mesa-libGLU.x86_64                                9.0.3-1.fc39                           @fedora
mesa-libOSMesa.i686                               23.3.6-1.fc39                          @updates
mesa-libOSMesa.x86_64                             23.3.6-1.fc39                          @updates
mesa-libOpenCL.x86_64                             23.3.6-1.fc39                          @updates
mesa-libgbm.i686                                  23.3.6-1.fc39                          @updates
mesa-libgbm.x86_64                                23.3.6-1.fc39                          @updates
mesa-libglapi.i686                                23.3.6-1.fc39                          @updates
mesa-libglapi.x86_64                              23.3.6-1.fc39                          @updates
mesa-libxatracker.x86_64                          23.3.6-1.fc39                          @updates
mesa-va-drivers.i686                              23.3.6-1.fc39                          @updates
mesa-va-drivers.x86_64                            23.3.6-1.fc39                          @updates
mesa-vulkan-drivers.i686                          23.3.6-1.fc39                          @updates
mesa-vulkan-drivers.x86_64                        23.3.6-1.fc39                          @updates

$ glxinfo -B
name of display: :0
display: :0  screen: 0
direct rendering: Yes
Extended renderer info (GLX_MESA_query_renderer):
    Vendor: AMD (0x1002)
    Device: AMD Radeon RX 580 Series (radeonsi, polaris10, LLVM 17.0.6, DRM 3.57, 6.7.11-200.fc39.x86_64) (0x67df)
    Version: 23.3.6
    Accelerated: yes
    Video memory: 8192MB
    Unified memory: no
    Preferred profile: core (0x1)
    Max core profile version: 4.6
    Max compat profile version: 4.6
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 3.2
Memory info (GL_ATI_meminfo):
    VBO free memory - total: 7191 MB, largest block: 7191 MB
    VBO free aux. memory - total: 15923 MB, largest block: 15923 MB
    Texture free memory - total: 7191 MB, largest block: 7191 MB
    Texture free aux. memory - total: 15923 MB, largest block: 15923 MB
    Renderbuffer free memory - total: 7191 MB, largest block: 7191 MB
    Renderbuffer free aux. memory - total: 15923 MB, largest block: 15923 MB
Memory info (GL_NVX_gpu_memory_info):
    Dedicated video memory: 8192 MB
    Total available memory: 24200 MB
    Currently available dedicated video memory: 7191 MB
OpenGL vendor string: AMD
OpenGL renderer string: AMD Radeon RX 580 Series (radeonsi, polaris10, LLVM 17.0.6, DRM 3.57, 6.7.11-200.fc39.x86_64)
OpenGL core profile version string: 4.6 (Core Profile) Mesa 23.3.6
OpenGL core profile shading language version string: 4.60
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile

OpenGL version string: 4.6 (Compatibility Profile) Mesa 23.3.6
OpenGL shading language version string: 4.60
OpenGL context flags: (none)
OpenGL profile mask: compatibility profile

OpenGL ES profile version string: OpenGL ES 3.2 Mesa 23.3.6
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20

Furthermore, given that this issue also exhibits itself when booting Live USB versions of Fedora 40 and Ubuntu 24.04, my guess is that this is independent of my Fedora 39 install.

I tried adding the rd.debug and single kernel parameters to a kernel kernel-6.8.8-200.fc39.x86_64 boot, but that unfortunately resulted in the same black screen behavior.

I also tried that, but the black screen persists even after using the <Ctrl><Alt><F2> key combination.

The tricky thing I’ve had here is that this issue presents itself before the LUKS prompt is displayed (and blindly typing my LUKS passphrase and hitting enter doesn’t seem to result in any change, so it seems LUKS decryption is halted by whatever this issue is), so SSHD hasn’t loaded yet.

However, I have used dracut-sshd on one of my servers (for remote LUKS decryption), so I decided to give that a try on this system. Thankfully, after setting that up and running dracut -f -v --kver <kernel version> (tested for the 6.7.11 and 6.8.8 kernels), I can now SSH into the desktop for all kernels. In the case of kernel-6.7.11-200.fc39.x86_64, I can type in my LUKS passphrase and decrypt the device (as expected). However, in the case of kernel-6.8.8-200.fc39.x86_64, the LUKS decryption hangs after entering the passphrase (although the system itstelf doesn’t hang, as I can SSH in using a different session and still run commands while LUKS is frozen), and the desktop does not continue booting.

Thankfully, I can run journalctl commands within the SSH session I mentioned above. I’ve put the output of journalctl -b -0 in that session booted with kernel-6.8.8-200.fc39.x86_64 over on Pastebin (link here) since it’s too long for here.

I’m not that experienced peering through journalctl output, but of note to me is the amdgpu errors that begin popping up at around line 1330. At line 3090 you can see where I SSH’d in to the desktop from a different machine.

Hopefully some of that is helpful for diagnosis!

I just tried installing the latest Rawhide kernel (kernel-6.9.0-0.rc7.58.fc41.x86_64) to see if the issue was resolved upstream, but it still has the same issue. Thus, this appears to not be exclusively limited to 6.8.X-series kernels.

It would be a good time to file a bugzilla report. It sounds pretty major.

I’ve filed a bug report on Bugzilla now in regards to this issue.

Bug report: 2279621 – Boot fails due to AMDGPU on kernel 6.8

3 Likes

Sounds like you did the right thing. Let us know how it goes.

If the single parameter didn’t get you to a shell, then that seems like it’s hanging, not just failing to switch to graphical mode.

Thank you för reporting this, I’m seeing the exact same thing in combination with Bug 2278534, which feels loosely related.

Update - the issue is caused by revisions in this commit: drm/amd: Use the first non-dGPU PCI device for BW limits · torvalds/linux@466a7d1 · GitHub

That commit dealt with how the PCI speed and bandwidth is established for the GPU. It’s unclear why this affects my specific hardware, but I’ve confirmed this is indeed the culprit.

Refer to the drm/amd bug tracker for more details on this issue: amdgpu driver not being loaded (#3289) · Issues · drm / amd · GitLab

If you happen to be running into this issue, you should be able to use this patch to make your system bootable again:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 7753a2e64d41..c17c3204a7bd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5852,8 +5852,8 @@ static void amdgpu_device_get_pcie_info(struct amdgpu_device *adev)
 	if (adev->pm.pcie_gen_mask && adev->pm.pcie_mlw_mask)
 		return;
 
-	amdgpu_device_partner_bandwidth(adev, &platform_speed_cap,
-					&platform_link_width);
+	pcie_bandwidth_available(adev->pdev, NULL,
+				 &platform_speed_cap, &platform_link_width);
 
 	if (adev->pm.pcie_gen_mask == 0) {
 		/* asic caps */

This will, however, require you to build a custom (patched) kernel. I found this guide to be particularly useful for building a custom kernel for Fedora. I was building mainline kernels at specific commits, but it should be way easier to use dist-git now that a patch is known (and probably safer too since you’re just slightly modifying a couple lines in an otherwise known-good kernel).

4 Likes

Impressive sleuthing. Never had the patience for git-bisect, but I’m sure it’s way more helpful.

Hi, is there any news regarding this issue? I’ve been experiencing this for several months (my last bootable kernel is 6.6.11) and I can’t find how to fix it. Is there a solution that doesn’t involve patching the kernel?

Usually big fixes appear in newer kernels – rare exceptions are patches that break other systems or introduce secirity issues.

Try the pre-release Live USB installer for Fedora 41 to see if your system can boot. If not, you should report the problem against Basic Release Criteria: Installer must run.

I’ve tested with last kernel 6.11. It boots with acpi=off or noapic boot parameters and the system boots, but only with one cpu

I think your system board only supports one CPU, but multiple cores. Now that it boots. please post the output from running inxi -Fzxx in a terminal (as pre-formatted text).

acpi=off or noapic are usually overkill. Some Dell systems support acpi=linux.

Arch Linux ACPI Modules has troubelshooting advice that may help you find more appropriate ACPI settings (Arch Linux is known for high-quality documentation – ACPI settings depend mainly opn the system BIOS, not the linux distro, so most the Arch document should also apply to Fedora).

Here is the information you asked:

# inxi -Fzxx
System:
Kernel: 6.11.3-100.fc39.x86_64 arch: x86_64 bits: 64 compiler: gcc v: 2.40-14.fc39
Console: pty pts/2 DM: SDDM Distro: Fedora Linux 39 (KDE Plasma)
Machine:
Type: Desktop System: ASUS product: N/A v: N/A serial: N/A
Mobo: ASUSTeK model: ROG CROSSHAIR VIII EXTREME v: Rev X.0x serial: <filter> part-nu: SKU
UEFI: American Megatrends v: 1602 date: 08/29/2024
Battery:
Device-1: hidpp_battery_0 model: Logitech MX Ergo Multi-Device Trackball serial: <filter>
charge: 50% (should be ignored) status: N/A
CPU:
Info: single core model: AMD Ryzen 9 5950X bits: 64 type: UP arch: Zen 3+ rev: 2 cache:
L1: 64 KiB L2: 512 KiB L3: 32 MiB
Speed (MHz): 550 min/max: 550/5084 boost: enabled core: 1: 550 bogomips: 6787
Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3
Graphics:
Device-1: AMD Navi 31 [Radeon RX 7900 XT/7900 XTX/7900 GRE/7900M] vendor: Sapphire NITRO+
driver: amdgpu v: kernel arch: RDNA-3 pcie: speed: 16 GT/s lanes: 16 ports: active: HDMI-A-1
empty: DP-1, DP-2, HDMI-A-2, Writeback-1 bus-ID: 45:00.0 chip-ID: 1002:744c
Display: server: X.org v: 1.20.14 with: Xwayland v: 23.2.7 compositor: kwin_x11 driver: X:
loaded: amdgpu unloaded: fbdev,modesetting,radeon,vesa dri: radeonsi gpu: amdgpu tty: 139x34
Monitor-1: HDMI-A-1 model: Samsung SE790C res: 3440x1440 dpi: 110 diag: 864mm (34")
API: EGL v: 1.5 platforms: device: 0 drv: radeonsi device: 1 drv: swrast surfaceless:
drv: radeonsi inactive: gbm,wayland,x11
API: OpenGL v: 4.6 compat-v: 4.5 vendor: mesa v: 23.3.6 note: console (EGL sourced)
renderer: AMD Radeon RX 7900 XTX (radeonsi navi31 LLVM 17.0.6 DRM 3.59 6.11.3-100.fc39.x86_64),
llvmpipe (LLVM 17.0.6 256 bits)
API: Vulkan v: 1.3.275 surfaces: N/A device: 0 type: discrete-gpu driver: N/A
device-ID: 1002:744c device: 1 type: cpu driver: N/A device-ID: 10005:0000
Audio:
Device-1: AMD Navi 31 HDMI/DP Audio driver: snd_hda_intel v: kernel pcie: speed: 16 GT/s
lanes: 16 bus-ID: 45:00.1 chip-ID: 1002:ab30
Device-2: AMD Starship/Matisse HD Audio driver: snd_hda_intel v: kernel pcie: speed: 16 GT/s
lanes: 16 bus-ID: 47:00.4 chip-ID: 1022:1487
Device-3: ASUSTek USB Audio driver: hid-generic,snd-usb-audio,usbhid type: USB rev: 2.0
speed: 480 Mb/s lanes: 1 bus-ID: 7-5:3 chip-ID: 0b05:1a27
API: ALSA v: k6.11.3-100.fc39.x86_64 status: kernel-api
Server-1: PipeWire v: 1.0.8 status: off with: 1: pipewire-pulse status: active 2: wireplumber
status: active 3: pipewire-alsa type: plugin 4: pw-jack type: plugin
Network:
Device-1: Aquantia AQtion AQC113CS NBase-T/IEEE 802.3an Ethernet [Antigua 10G] vendor: ASUSTeK
ProArt X570-CREATOR WIFI driver: atlantic v: kernel pcie: speed: 16 GT/s lanes: 2 port: N/A
bus-ID: 3d:00.0 chip-ID: 1d6a:94c0
IF: enp61s0 state: up speed: 1000 Mbps duplex: full mac: <filter>
Device-2: Intel Wi-Fi 6E AX210/AX1675 2x2 [Typhoon Peak] driver: iwlwifi v: kernel pcie:
speed: 5 GT/s lanes: 1 bus-ID: 3e:00.0 chip-ID: 8086:2725
IF: wlp62s0 state: down mac: <filter>
Device-3: Intel Ethernet I225-V vendor: ASUSTeK driver: igc v: kernel pcie: speed: 5 GT/s
lanes: 1 port: N/A bus-ID: 3f:00.0 chip-ID: 8086:15f3
IF: enp63s0 state: up speed: 1000 Mbps duplex: full mac: <filter>
IF-ID-1: bond0 state: up speed: 2000 Mbps duplex: full mac: <filter>
IF-ID-2: bonding_masters state: N/A speed: N/A duplex: N/A mac: N/A
IF-ID-3: bridge0 state: up speed: 2000 Mbps duplex: unknown mac: <filter>
Bluetooth:
Device-1: Intel AX210 Bluetooth driver: btusb v: 0.8 type: USB rev: 2.0 speed: 12 Mb/s lanes: 1
bus-ID: 7-2:2 chip-ID: 8087:0032
Report: btmgmt ID: hci0 rfk-id: 0 state: up address: <filter> bt-v: 5.3 lmp-v: 12
Drives:
Local Storage: total: 10.01 TiB used: 1.26 TiB (12.6%)
ID-1: /dev/nvme0n1 vendor: Western Digital model: WD BLACK SN850X 2000GB size: 1.82 TiB
speed: 63.2 Gb/s lanes: 4 serial: <filter> temp: 47.9 C
ID-2: /dev/sda vendor: Seagate model: ST8000NM0055-1RM112 size: 7.28 TiB speed: 6.0 Gb/s
serial: <filter>
ID-3: /dev/sdb vendor: Western Digital model: WD5003ABYX-01WERA1 size: 465.76 GiB
speed: 3.0 Gb/s serial: <filter> temp: 43 C
ID-4: /dev/sdc vendor: Samsung model: SSD 860 EVO 500GB size: 465.76 GiB speed: 6.0 Gb/s
serial: <filter> temp: 33 C
Partition:
ID-1: / size: 97.87 GiB used: 55.51 GiB (56.7%) fs: ext4 dev: /dev/nvme0n1p1
ID-2: /boot/efi size: 1022 MiB used: 21.3 MiB (2.1%) fs: vfat dev: /dev/nvme0n1p3
ID-3: /home size: 1.69 TiB used: 1.21 TiB (71.2%) fs: ext4 dev: /dev/nvme0n1p2
Swap:
ID-1: swap-1 type: zram size: 8 GiB used: 8 MiB (0.1%) priority: 100 dev: /dev/zram0
Sensors:
System Temperatures: cpu: 44.0 C mobo: 46.0 C gpu: amdgpu temp: 38.0 C mem: 40.0 C
Fan Speeds (rpm): fan-1: 760 fan-2: 767 fan-3: 636 fan-4: 0 fan-5: 0 fan-6: 0 fan-7: 0
gpu: amdgpu fan: 0
Info:
Memory: total: 128 GiB available: 125.7 GiB used: 4.07 GiB (3.2%)
Processes: 283 Power: uptime: 6h 16m wakeups: 0 Init: systemd v: 254 target: graphical (5)
default: graphical
Packages: pm: flatpak pkgs: 23 Compilers: clang: 17.0.6 gcc: 13.3.1 Shell: Bash v: 5.2.26
running-in: pty pts/2 inxi: 3.3.34

I will try with the Arch page you linked to test some things

Please show us inxi -Cza, which should include processor topology. I’m not sure if Fedora 39 has the current inxi version – you can compare withlscpu. Can you boot the Fedeora 41 beta installer? If so you can install inxi and see if the results agree with what you see on Fedora 39.

This is the output of inxi -Cza. The command doesn’t show the real topology of the processor. It is a 16 core/32 threats CPU

# inxi -Cza
CPU:
  Info: model: AMD Ryzen 9 5950X socket: AM4 bits: 64 type: UP arch: Zen 3+ gen: 4 level: v3
    note: check built: 2022 process: TSMC n6 (7nm) family: 0x19 (25) model-id: 0x21 (33)
    stepping: 2 microcode: 0xA201210
  Topology: cpus: 1x cores: 1 smt: <unsupported> cache: L1: 64 KiB
    desc: d-1x32 KiB; i-1x32 KiB L2: 512 KiB desc: 1x512 KiB L3: 32 MiB desc: 1x32 MiB
  Speed (MHz): 4695 min/max: 550/5084 boost: enabled base/boost: 3400/5050 scaling:
    driver: amd-pstate-epp governor: powersave volts: 1.1 V ext-clock: 100 MHz core: 1: 4695
    bogomips: 6787
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3
  Vulnerabilities:
  Type: gather_data_sampling status: Not affected
  Type: itlb_multihit status: Not affected
  Type: l1tf status: Not affected
  Type: mds status: Not affected
  Type: meltdown status: Not affected
  Type: mmio_stale_data status: Not affected
  Type: reg_file_data_sampling status: Not affected
  Type: retbleed status: Not affected
  Type: spec_rstack_overflow mitigation: Safe RET
  Type: spec_store_bypass mitigation: Speculative Store Bypass disabled via prctl
  Type: spectre_v1 mitigation: usercopy/swapgs barriers and __user pointer sanitization
  Type: spectre_v2 mitigation: Retpolines; IBPB: conditional; IBRS_FW; STIBP: disabled; RSB
    filling; PBRSB-eIBRS: Not affected; BHI: Not affected
  Type: srbds status: Not affected
  Type: tsx_async_abort status: Not affected

lscpu and /proc/cpuinfo doesn’t show the correct information neither

Tomorrow I will share the info you requested from fedora 41