What caused my kernel to log an exception?

Context

I recently hit a kernel bug that I can’t report:

  1. #!/usr/bin/env sh
    abrt info 002f75f
    
  2. Id            002f75f  
    Component     kernel  
    Count         1  
    Time          2025-08-29 11:55:05  
    Command line  BOOT_IMAGE=(hd0,gpt2)/vmlinuz-6.15.10-200.fc42.x86_64 root=UUID=0548c378-fdad-49db-9a7b-2e4779892f93 ro rootflags=subvol=root rd.luks.uuid=luks-31b94f4a-915c-47ee-af7a-a7c948b23cf4 rhgb quiet  
    Package       kernel-core-6.15.10-200.fc42  
    Path          /var/spool/abrt/oops-2025-08-29-11:55:05-3117-0  
                  Not reportable  
                  A kernel problem occurred, but your kernel has been tainted (flags:GW). Explanation:  
                  W - Kernel issued warning.  
                  Kernel maintainers are unable to diagnose tainted reports.
    

It’s not merely strange because of its uniquely unreportableness, but also because I appear to have solely hit it once. Consequently, can anyone explain to me what might have caused it? I ask because I’m unable to read most stack traces.

Diagnosis

  1. #!/usr/bin/env sh
    abrt bt 002f75f
    
  2. WARNING: CPU: 10 PID: 152466 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:8997 amdgpu_dm_atomic_commit_tail+0xf38/0x1150 [amdgpu]
    Modules linked in: squashfs hid_logitech_hidpp uinput rfcomm snd_seq_dummy snd_hrtimer nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct uhid nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables qrtr bnep binfmt_misc vfat fat snd_sof_amd_acp70 snd_sof_amd_acp63 snd_sof_amd_vangogh snd_sof_amd_rembrandt snd_sof_amd_renoir snd_sof_amd_acp snd_sof_pci snd_sof_xtensa_dsp snd_sof amd_atl intel_rapl_msr snd_sof_utils intel_rapl_common snd_pci_ps cros_usbpd_logger leds_cros_ec cros_usbpd_charger snd_soc_acpi_amd_match cros_charge_control cros_ec_sysfs led_class_multicolor gpio_cros_ec snd_amd_sdw_acpi cros_ec_hwmon cros_ec_chardev cros_usbpd_notify soundwire_amd snd_hda_codec_realtek soundwire_generic_allocation soundwire_bus mt7921e snd_hda_codec_generic mt7921_common snd_soc_sdca snd_hda_scodec_component edac_mce_amd mt792x_lib snd_soc_core btusb snd_hda_codec_hdmi mt76_connac_lib
     btrtl cros_ec_dev uvcvideo btintel snd_hda_intel snd_compress mt76 btbcm ac97_bus kvm_amd uvc snd_pcm_dmaengine snd_intel_dspcfg btmtk videobuf2_vmalloc cros_ec_lpcs snd_intel_sdw_acpi snd_rpl_pci_acp6x videobuf2_memops spd5118 snd_hda_codec cros_ec snd_acp_pci videobuf2_v4l2 mac80211 bluetooth snd_amd_acpi_mach hid_sensor_als kvm snd_hda_core snd_acp_legacy_common videobuf2_common hid_sensor_trigger snd_hwdep snd_pci_acp6x hid_sensor_iio_common snd_seq snd_pci_acp5x videodev industrialio_triggered_buffer libarc4 kfifo_buf snd_seq_device snd_rn_pci_acp3x irqbypass snd_acp_config industrialio rapl i2c_piix4 snd_pcm wmi_bmof snd_soc_acpi thunderbolt pcspkr mc cfg80211 k10temp i2c_smbus snd_pci_acp3x snd_timer amd_pmf snd amdtee rfkill soundcore amd_sfh tee platform_profile amd_pmc joydev loop zram lz4hc_compress lz4_compress dm_crypt typec_displayport typec_thunderbolt amdgpu amdxcp i2c_algo_bit drm_ttm_helper ttm drm_exec drm_suballoc_helper drm_panel_backlight_quirks drm_buddy drm_display_helper cec
     cdc_mbim cdc_wdm cdc_ncm cdc_ether usbnet nvme polyval_clmulni nvme_core polyval_generic ghash_clmulni_intel amdxdna sha512_ssse3 hid_multitouch sha256_ssse3 ucsi_acpi hid_sensor_hub nvme_keyring typec_ucsi sha1_ssse3 gpu_sched sp5100_tco typec nvme_auth video i2c_hid_acpi wmi i2c_hid r8152 mii sunrpc iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi fuse i2c_dev nfnetlink
    CPU: 10 UID: 0 PID: 152466 Comm: kworker/u64:44 Tainted: G        W           6.15.10-200.fc42.x86_64 #1 PREEMPT(lazy) 
    Tainted: [W]=WARN
    Hardware name: Framework Laptop 16 (AMD Ryzen 7040 Series)/FRANMZCP07, BIOS 03.05 11/13/2024
    Workqueue: async async_run_entry_fn
    RIP: 0010:amdgpu_dm_atomic_commit_tail+0xf38/0x1150 [amdgpu]
    Code: 95 70 fe ff ff e8 c8 67 75 c3 4c 8b 95 70 fe ff ff e9 68 fd ff ff 4c 89 e7 e8 64 53 74 c3 e9 8c f5 ff ff 0f 0b e9 8c f3 ff ff <0f> 0b e9 0e f4 ff ff 0f 0b e9 f7 f3 ff ff 0f 0b e9 48 f7 ff ff 49
    RSP: 0018:ffffd10f482439a0 EFLAGS: 00010282
    RAX: 00000000ffffffea RBX: ffff893154204c00 RCX: 0000000000000000
    RDX: 0000000000000286 RSI: 0000000000000286 RDI: ffff893256d0015c
    RBP: ffffd10f48243b60 R08: ffff893257afa000 R09: ffff893154204c00
    R10: ffff893256d45908 R11: 000000000000000a R12: ffff8932579b4000
    R13: ffff89376f5e7a00 R14: 0000000000000000 R15: 0000000000000000
    FS:  0000000000000000(0000) GS:ffff893a17770000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f6560dd7eb8 CR3: 000000030942e000 CR4: 0000000000f50ef0
    PKRU: 55555554
    Call Trace:
     <TASK>
     commit_tail+0xad/0x150
     drm_atomic_helper_commit+0x13c/0x180
     drm_atomic_commit+0xae/0xe0
     ? __pfx___drm_printfn_info+0x10/0x10
     drm_atomic_helper_commit_duplicated_state+0xde/0xf0
     drm_atomic_helper_resume+0xa1/0x170
     dm_resume+0x398/0xaa0 [amdgpu]
     amdgpu_ip_block_resume+0x24/0x50 [amdgpu]
     amdgpu_device_ip_resume_phase3+0x88/0x110 [amdgpu]
     amdgpu_device_resume+0xae/0x2e0 [amdgpu]
     ? __pfx_pci_pm_resume+0x10/0x10
     amdgpu_pmops_resume+0x46/0x80 [amdgpu]
     ? __pfx_pci_pm_resume+0x10/0x10
     dpm_run_callback+0x47/0x140
     device_resume+0x152/0x250
     async_resume+0x21/0x30
     async_run_entry_fn+0x31/0x130
     process_one_work+0x18f/0x350
     worker_thread+0x25a/0x3a0
     ? __pfx_worker_thread+0x10/0x10
     kthread+0xf9/0x240
     ? __pfx_kthread+0x10/0x10
     ret_from_fork+0x31/0x50
     ? __pfx_kthread+0x10/0x10
     ret_from_fork_asm+0x1a/0x30
     </TASK>
    

See https://linux-audit.com/kernel/faq/how-to-find-the-specific-cause-of-a-tainted-kernel/. There is a table explaining that “G” means “Proprietary kernel module loaded” or parse /proc/sys/kernel/tainted. Usually proprietary modules are located in /lib/modules/<kernel_version>/extra/.

1 Like

@gnwiii, not in this case:

  1. #!/usr/bin/env sh
    tree -aL1 /lib/modules/$(uname -r)
    
    RokeJulianLockhart@Beedell:/lib/modules/6.16.3-200.fc42.x86_64$ tree -aL1 /lib/modules/$(uname -r)
    /lib/modules/6.16.3-200.fc42.x86_64
    ├── build -> /usr/src/kernels/6.16.3-200.fc42.x86_64
    ├── config
    ├── kernel
    ├── modules.alias
    ├── modules.alias.bin
    ├── modules.block
    ├── modules.builtin
    ├── modules.builtin.alias.bin
    ├── modules.builtin.bin
    ├── modules.builtin.modinfo
    ├── modules.dep
    ├── modules.dep.bin
    ├── modules.devname
    ├── modules.drm
    ├── modules.modesetting
    ├── modules.networking
    ├── modules.order
    ├── modules.softdep
    ├── modules.symbols
    ├── modules.symbols.bin
    ├── modules.weakdep
    ├── source -> build
    ├── symvers.xz
    ├── System.map
    ├── systemtap
    ├── updates
    ├── vdso
    ├── vmlinuz
    ├── .vmlinuz.hmac
    └── weak-updates
    
    6 directories, 25 files
    
    1. #!/usr/bin/env sh
      file /lib/modules/$(uname -r)/extra
      
    2. /lib/modules/6.16.3-200.fc42.x86_64/extra: cannot open `/lib/modules/6.16.3-200.fc42.x86_64/extra’ (No such file or directory)

I imagine that the aforementioned because it doesn’t appear to be tainted:

  1. sudo dmesg | grep Tainted returned nothing.

  2. #!/usr/bin/env sh
    for i in $(seq 18); do echo $(($i-1)) $(($(cat /proc/sys/kernel/tainted)>>($i-1)&1));done
    
    RokeJulianLockhart@Beedell:/lib/modules/6.16.3-200.fc42.x86_64$ for i in $(seq 18); do echo $(($i-1)) $(($(cat /proc/sys/kernel/tainted)>>($i-1)&1));done
    0 0
    1 0
    2 0
    3 0
    4 0
    5 0
    6 0
    7 0
    8 0
    9 0
    10 0
    11 0
    12 0
    13 0
    14 0
    15 0
    16 0
    17 0
    

You already posted a list of modules from the abrt.. It is possible one of them is somehow misconfigured/corrupt. If your root uses btrfs, boot a Live USB and use btrfs check --readonly <filesystem> to rule out bitrot.

Per https://www.kernel.org/doc/html/v6.16-rc1/admin-guide/module-signing.html,

If CONFIG_MODULE_SIG_FORCE is enabled or module.sig_enforce=1 is supplied on the kernel command line, the kernel will only load validly signed modules for which it has a public key. Otherwise, it will also load modules that are unsigned.

Here:

% grep CONFIG_MODULE_SIG_FORCE /lib/modules/*/config                     
/lib/modules/6.15.10-200.fc42.x86_64/config:# CONFIG_MODULE_SIG_FORCE is not set
/lib/modules/6.16.2-200.fc42.x86_64/config:# CONFIG_MODULE_SIG_FORCE is not set
/lib/modules/6.16.3-200.fc42.x86_64/config:# CONFIG_MODULE_SIG_FORCE is not set

Assuming none of the modules are rotten, try adding module.sig_enforce=1 to the kernel command line.

1 Like

Your kernel issued a warning, which causes it to be marked as tainted.

That particular warning happens when the amdgpu driver is completing a DRM (Direct Rendering Manager) data transfer and detects a problem. A comment at the top of the function amdgpu_dm_atomic_commit_tail says:

This will tell DC to commit the constructed DC state from atomic_check, programming the hardware. Any failures here implies a hardware failure, since atomic check should have filtered anything non-kosher.

So your AMD GPU apparently had some kind of hardware hiccup. If it has only happened once, I wouldn’t worry about it.

1 Like

@gnwiii, although @jjames’s response likely renders this redundant:

        1. #!/usr/bin/env sh
          lsblk
          
        2. NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
          loop0         7:0    0   2.4G  1 loop /run/rootfsbase
          sda           8:0    1  14.5G  0 disk
          ├─sda1        8:1    1   2.6G  0 part /run/initramfs/live
          └─sda2        8:2    1    30M  0 part
          zram0       251:0    0     8G  0 disk [SWAP]
          nvme0n1     259:0    0 931.5G  0 disk
          ├─nvme0n1p1 259:1    0   600M  0 part
          ├─nvme0n1p2 259:2    0     1G  0 part
          └─nvme0n1p3 259:3    0 929.9G  0 part
          
        3. #!/usr/bin/env sh
          sudo cryptsetup luksOpen /dev/nvme0n1p3 cryptroot && \
          lsblk
          
        4. NAME          MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
          loop0           7:0    0   2.4G  1 loop  /run/rootfsbase
          sda             8:0    1  14.5G  0 disk
          ├─sda1          8:1    1   2.6G  0 part  /run/initramfs/live
          └─sda2          8:2    1    30M  0 part
          zram0         251:0    0     8G  0 disk  [SWAP]
          nvme0n1       259:0    0 931.5G  0 disk
          ├─nvme0n1p1   259:1    0   600M  0 part
          ├─nvme0n1p2   259:2    0     1G  0 part
          └─nvme0n1p3   259:3    0 929.9G  0 part
            └─cryptroot 252:0    0 929.9G  0 crypt
          
        1. #!/usr/bin/env sh
          sudo btrfs check --readonly /dev/mapper/cryptroot
          
        2. Opening filesystem to check...
          Checking filesystem on /dev/mapper/cryptroot
          UUID: 0548c378-fdad-49db-9a7b-2e4779892f93
          [1/8] checking log skipped (none written)
          [2/8] checking root items
          [3/8] checking extents
          [4/8] checking free space tree
          [5/8] checking fs roots
          [6/8] checking only csums items (without verifying data)
          [7/8] checking root refs
          [8/8] checking quota groups skipped (not enabled on this FS)
          found 219027546112 bytes used, no error found
          total csum bytes: 211609660
          total tree bytes: 1566031872
          total fs tree bytes: 1219887104
          total extent tree bytes: 101728256
          btree space waste bytes: 274194642
          file data blocks allocated: 320589209600
           referenced 274643537920
          
  1. Because:

    1. #!/usr/bin/env sh
      grep CONFIG_MODULE_SIG_FORCE /lib/modules/*/config
      
    2. /lib/modules/6.15.10-200.fc42.x86_64/config:# CONFIG_MODULE_SIG_FORCE is not set
      /lib/modules/6.15.9-201.fc42.x86_64/config:# CONFIG_MODULE_SIG_FORCE is not set
      /lib/modules/6.16.3-200.fc42.x86_64/config:# CONFIG_MODULE_SIG_FORCE is not set
      

    …I’ll do so. However, what am I looking for? (It successfully booting, with that enabled?)

If you had an incorrectly configured module, this should produce an error. Not all modules are loaded at boot time, so you would need to reproduce conditions where the abrt occurred.

We are in a period of high solar activity. At work we had SGI IRIX64 systems with ECC RAM and would get ECC notices during periods with high solar activity. See:
https://www.natureworldnews.com/articles/62449/20241113/beyond-the-aurora-how-do-solar-flares-affect-electronics.htm.

If your system crashes again, note the time and what you were doing and compare with the previous crash.

You can also check to see if any of the modules listed in the abrt have been updated recently.

1 Like

@gnwiii, I’ve just had it reproduce (ignore the entrant beneath it):

However, because abrt appears to delete (duplicate) recurrences, I don’t know whether I am able to glean anything from it. Except for solely gnome-abrt reporting what that screenshot depicts, both abrt displays the old data, and gnome-abrt directs me to the retained directory of the old report.

This FW16 has passed memtest86+-7.20-2.fc42.x86_64 4 times. [1]

How do you advise that I do so?


  1. bugzilla.mozilla.org/attachment.cgi?id=9509970 ↩︎