Amdgpu: Fatal error during GPU init on 5.19.8 (works on 5.19.6)

Hey everyone,

After a recent kernel update my AMD GPU no longer works in Fedora on the latest version. The system does boot (with a black screen during RHGB), however it’s stuck at a lower resolution with what looks like software compositing. Examining the output from dmesg shows the following:

Kernel 5.19.8 boot
[    6.182789] [drm] Initialized simpledrm 1.0.0 20200625 for simple-framebuffer.0 on minor 0
[    6.182973] simple-framebuffer simple-framebuffer.0: [drm] fb0: simpledrmdrmfb frame buffer device
[    7.459139] AMD-Vi: AMD IOMMUv2 functionality not available on this system - This is not a bug.
[    8.497633] [drm] amdgpu kernel modesetting enabled.
[    8.497721] amdgpu: CRAT table not found
[    8.497723] amdgpu: Virtual CRAT table created for CPU
[    8.497732] amdgpu: Topology: Add CPU node
[    8.509404] amdgpu 0000:03:00.0: vgaarb: deactivate vga console
[    8.509497] [drm] initializing kernel modesetting (NAVI10 0x1002:0x731F 0x1002:0x0B36 0xC4).
[    8.509504] [drm] register mmio base: 0x50300000
[    8.509505] [drm] register mmio size: 524288
[    8.510544] [drm] add ip block number 0 <nv_common>
[    8.510545] [drm] add ip block number 1 <gmc_v10_0>
[    8.510546] [drm] add ip block number 2 <navi10_ih>
[    8.510547] [drm] add ip block number 3 <psp>
[    8.510548] [drm] add ip block number 4 <smu>
[    8.510549] [drm] add ip block number 5 <dm>
[    8.510549] [drm] add ip block number 6 <gfx_v10_0>
[    8.510550] [drm] add ip block number 7 <sdma_v5_0>
[    8.510551] [drm] add ip block number 8 <vcn_v2_0>
[    8.510551] [drm] add ip block number 9 <jpeg_v2_0>
[    8.510560] amdgpu 0000:03:00.0: No more image in the PCI ROM
[    8.510572] amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ROM BAR
[    8.510573] amdgpu: ATOM BIOS: 113-D1820201-101
[    8.510579] [drm] VCN decode is enabled in VM mode
[    8.510580] [drm] VCN encode is enabled in VM mode
[    8.510580] [drm] JPEG decode is enabled in VM mode
[    8.510581] amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[    8.510598] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[    8.510602] amdgpu 0000:03:00.0: amdgpu: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used)
[    8.510603] amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[    8.510604] amdgpu 0000:03:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 - 0x0000FFFFFFFFFFFF
[    8.510611] [drm] Detected VRAM RAM=8176M, BAR=256M
[    8.510611] [drm] RAM width 256bits GDDR6
[    8.510635] [drm] amdgpu: 8176M of VRAM memory ready
[    8.510636] [drm] amdgpu: 7965M of GTT memory ready.
[    8.510640] [drm] GART: num cpu pages 131072, num gpu pages 131072
[    8.510710] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[    8.510778] amdgpu 0000:03:00.0: Direct firmware load for amdgpu/navi10_sos.bin failed with error -2
[    8.510781] amdgpu 0000:03:00.0: amdgpu: failed to init sos firmware
[    8.510781] [drm:psp_sw_init [amdgpu]] *ERROR* Failed to load psp firmware!
[    8.510957] [drm:amdgpu_device_init.cold [amdgpu]] *ERROR* sw_init of IP block <psp> failed -2
[    8.511123] amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_init failed
[    8.511124] amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init
[    8.511125] amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.
[    8.511271] amdgpu: probe of 0000:03:00.0 failed with error -2
[    8.511343] [drm] amdgpu: ttm finalized
[   10.106075] systemd[1]: Starting modprobe@drm.service - Load Kernel Module drm...

Thankfully, Fedora keeps previous kernels installed, and booting with the older version makes everything work, including the GPU:

Kernel 5.19.6 boot
[    6.248132] [drm] Initialized simpledrm 1.0.0 20200625 for simple-framebuffer.0 on minor 0
[    6.248339] simple-framebuffer simple-framebuffer.0: [drm] fb0: simpledrmdrmfb frame buffer device
[    7.405006] AMD-Vi: AMD IOMMUv2 functionality not available on this system - This is not a bug.
[    8.411044] [drm] amdgpu kernel modesetting enabled.
[    8.411133] amdgpu: CRAT table not found
[    8.411135] amdgpu: Virtual CRAT table created for CPU
[    8.411141] amdgpu: Topology: Add CPU node
[    8.423868] amdgpu 0000:03:00.0: vgaarb: deactivate vga console
[    8.423957] [drm] initializing kernel modesetting (NAVI10 0x1002:0x731F 0x1002:0x0B36 0xC4).
[    8.423963] [drm] register mmio base: 0x50300000
[    8.423964] [drm] register mmio size: 524288
[    8.425016] [drm] add ip block number 0 <nv_common>
[    8.425018] [drm] add ip block number 1 <gmc_v10_0>
[    8.425018] [drm] add ip block number 2 <navi10_ih>
[    8.425019] [drm] add ip block number 3 <psp>
[    8.425019] [drm] add ip block number 4 <smu>
[    8.425020] [drm] add ip block number 5 <dm>
[    8.425020] [drm] add ip block number 6 <gfx_v10_0>
[    8.425021] [drm] add ip block number 7 <sdma_v5_0>
[    8.425022] [drm] add ip block number 8 <vcn_v2_0>
[    8.425023] [drm] add ip block number 9 <jpeg_v2_0>
[    8.425031] amdgpu 0000:03:00.0: No more image in the PCI ROM
[    8.425043] amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ROM BAR
[    8.425044] amdgpu: ATOM BIOS: 113-D1820201-101
[    8.425050] [drm] VCN decode is enabled in VM mode
[    8.425051] [drm] VCN encode is enabled in VM mode
[    8.425051] [drm] JPEG decode is enabled in VM mode
[    8.425052] amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[    8.425070] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[    8.425074] amdgpu 0000:03:00.0: amdgpu: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used)
[    8.425075] amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[    8.425077] amdgpu 0000:03:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 - 0x0000FFFFFFFFFFFF
[    8.425082] [drm] Detected VRAM RAM=8176M, BAR=256M
[    8.425083] [drm] RAM width 256bits GDDR6
[    8.425106] [drm] amdgpu: 8176M of VRAM memory ready
[    8.425107] [drm] amdgpu: 7965M of GTT memory ready.
[    8.425112] [drm] GART: num cpu pages 131072, num gpu pages 131072
[    8.425181] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[    8.434226] amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist
[    8.434229] amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist
[    8.464178] [drm] Found VCN firmware Version ENC: 1.17 DEC: 5 VEP: 0 Revision: 2
[    8.464183] amdgpu 0000:03:00.0: amdgpu: Will use PSP to load VCN firmware
[    8.515369] [drm] reserve 0x900000 from 0x81fe400000 for PSP TMR
[    8.557166] amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode is not available
[    8.563168] amdgpu 0000:03:00.0: amdgpu: RAP: optional rap ta ucode is not available
[    8.563169] amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[    8.563218] amdgpu 0000:03:00.0: amdgpu: use vbios provided pptable
[    8.563219] amdgpu 0000:03:00.0: amdgpu: smc_dpm_info table revision(format.content): 4.5
[    8.598649] amdgpu 0000:03:00.0: amdgpu: SMU is initialized successfully!
[    8.598794] [drm] Display Core initialized with v3.2.187!
[    8.656942] [drm] kiq ring mec 2 pipe 1 q 0
[    8.659069] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[    8.659454] [drm] JPEG decode initialized successfully.
[    8.661072] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[    8.661196] amdgpu: sdma_bitmap: ffff
[    8.690262] amdgpu: HMM registered 8176MB device memory
[    8.690388] amdgpu: SRAT table not found
[    8.690389] amdgpu: Virtual CRAT table created for GPU
[    8.690504] amdgpu: Topology: Add dGPU node [0x731f:0x1002]
[    8.690507] kfd kfd: amdgpu: added device 1002:731f
[    8.690521] amdgpu 0000:03:00.0: amdgpu: SE 2, SH per SE 2, CU per SH 10, active_cu_number 36
[    8.690577] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[    8.690578] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[    8.690579] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[    8.690580] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[    8.690580] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[    8.690581] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[    8.690581] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[    8.690582] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[    8.690583] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[    8.690584] amdgpu 0000:03:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[    8.690584] amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[    8.690585] amdgpu 0000:03:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[    8.690586] amdgpu 0000:03:00.0: amdgpu: ring vcn_dec uses VM inv eng 0 on hub 1
[    8.690586] amdgpu 0000:03:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 1 on hub 1
[    8.690587] amdgpu 0000:03:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 4 on hub 1
[    8.690588] amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 1
[    8.691545] amdgpu 0000:03:00.0: amdgpu: Using BACO for runtime pm
[    8.692130] [drm] Initialized amdgpu 3.47.0 20150101 for 0000:03:00.0 on minor 0
[    8.696228] fbcon: amdgpudrmfb (fb0) is primary device
[    8.696274] [drm] DSC precompute is not needed.
[    8.837141] amdgpu 0000:03:00.0: [drm] fb0: amdgpudrmfb frame buffer device
[   10.515630] systemd[1]: Starting modprobe@drm.service - Load Kernel Module drm...
[   11.184293] snd_hda_intel 0000:03:00.1: bound 0000:03:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])

I’ve been dual-booting Windows a lot over the past week and had a couple of improper shutdowns during OS boot after the GPU froze under stress (maybe that’s the reason for trying to load the amdgpu/navi10_sos.bin firmware), so I can’t say for sure if 5.19.8 worked and somehow got corrupted, or if it was broken from the start.

The GPU itself works without any issues on Windows and Fedora with 5.19.6, so I have a couple of questions:

  • Can I verify the installation of kernel and firmware to rule out corruption from an improper shutdown?
  • How can I downgrade the kernel to 5.19.6 or make it the default option on boot, so that I don’t need to go through GRUB every time?

After a bit of searching I found the issue being discussed here:
https://bugzilla.redhat.com/show_bug.cgi?id=2125536#c18

Steps to fix:

  1. Install amd-gpu-firmware package
  2. Run sudo dracut --force to rebuild initramfs while running 5.19.8
1 Like