Kernel 6.4.11-200 - Emergency mode during boot

Hello. I’m running Fedora 38 on my Dell XPS 9560 and recently I have some boot issues with it.
Right now on my system I have 3 Kernels

  • 6.4.10
  • 6.4.11
  • 6.4.12

With 6.4.10 everything works perfectly, with 6.4.11 and 12 the boot process end up in emergency mode. I attached a photo of the journalctl -xe output. I’m sorry I can paste a complete file but I’ve spent two hours trying to copy the log on an USB drive to attach it here without success.

None of the tests I’ve done on my laptop’s disk shows any kind of malfunction or error

I’d be very thankful is there’s someone who can support me with this issue.
Thank you in advance!

When booted into the older kernel can you check:
sudo grubby --info=ALL

Probably want to compare the args value between 6.4.10 and the others.

Thanks

1 Like

Thank you for the quick reply!
This is the output of the requested command:

index=0
kernel=“/boot/vmlinuz-6.4.12-200.fc38.x86_64”
args=“ro rootflags=subvol=root rhgb quiet”
root=“UUID=0c50e137-fb61-4e35-b2e5-4ba13b5a972d”
initrd=“/boot/initramfs-6.4.12-200.fc38.x86_64.img”
title=“Fedora Linux (6.4.12-200.fc38.x86_64) 38 (Workstation Edition)”
id=“94d1775986f54659b9aac924e55650af-6.4.12-200.fc38.x86_64”
index=1
kernel=“/boot/vmlinuz-6.4.11-200.fc38.x86_64”
args=“ro rootflags=subvol=root rhgb quiet”
root=“UUID=0c50e137-fb61-4e35-b2e5-4ba13b5a972d”
initrd=“/boot/initramfs-6.4.11-200.fc38.x86_64.img”
title=“Fedora Linux (6.4.11-200.fc38.x86_64) 38 (Workstation Edition)”
id=“94d1775986f54659b9aac924e55650af-6.4.11-200.fc38.x86_64”
index=2
kernel=“/boot/vmlinuz-6.4.10-200.fc38.x86_64”
args=“ro rootflags=subvol=root rhgb quiet”
root=“UUID=0c50e137-fb61-4e35-b2e5-4ba13b5a972d”
initrd=“/boot/initramfs-6.4.10-200.fc38.x86_64.img”
title=“Fedora Linux (6.4.10-200.fc38.x86_64) 38 (Workstation Edition)”
id=“94d1775986f54659b9aac924e55650af-6.4.10-200.fc38.x86_64”
index=3
kernel=“/boot/vmlinuz-0-rescue-94d1775986f54659b9aac924e55650af”
args=“ro rootflags=subvol=root rhgb quiet”
root=“UUID=0c50e137-fb61-4e35-b2e5-4ba13b5a972d”
initrd=“/boot/initramfs-0-rescue-94d1775986f54659b9aac924e55650af.img”
title=“Fedora Linux (0-rescue-94d1775986f54659b9aac924e55650af) 36 (Workstation Edition)”
id=“94d1775986f54659b9aac924e55650af-0-rescue”

can you check:
df -h

For logs can you look at journalctl -b -1 -p 4 --no-pager > logs.txt you may need to post that to a pastebin.

Thanks

@All

are you aware that this/a similar bug is (already) on the Fedora User List and what might be more important: a bug report in the first post

https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org/thread/UVOKYO55JBEDXCZVOFG6S27JG6TKPMEY/

the report itsself:
https://bugzilla.redhat.com/show_bug.cgi?id=2232838

2 Likes

Sure, this is the output of df -f:

devtmpfs        4,0M     0    4,0M   0% /dev
tmpfs           7,8G     0    7,8G   0% /dev/shm
tmpfs           3,1G  2,1M    3,1G   1% /run
/dev/nvme0n1p6  172G  136G     34G  81% /
/dev/nvme0n1p6  172G  136G     34G  81% /home
/dev/nvme0n1p5  974M  296M    611M  33% /boot
tmpfs           7,8G  208K    7,8G   1% /tmp
/dev/nvme0n1p1   96M   62M     35M  64% /boot/efi
tmpfs           1,6G  172K    1,6G   1% /run/user/1000

And this is the output of `journalctl -b -1 -p 4 --no-pager > logs.txt’: journalctl - Pastebin.com

Thank you for your support!

The bug report looks like my exact case, if there’s anything I can do to help, I’m here. Thank you!

1 Like

yup, if you own a Redhat Bugzilla account you can do the follwoing:

  • add a comment that you’re also suffering from this bug
  • add your email address to the “CC List” (top right, third row) to get informed when a solution is found, etc.

If you do not own a RH BZ account:

  • bookmark the BZ and scan from time to time the comments there
  • drop me a notice so I can a add a link to this thread in the RH BZ, but I can’t add you to the CC List (I guess)

care to test a boot parameter ?

if yes:

  • when grub comes up and has selected the lastest and faulty kernel to boot (default)

  • instantly hit the “E” key once

  • move with the cursor behind “rhgb quiet”
    quickest way:
    * move down to the “i” from initrd
    * move left and you’re just behind rhgb quiet

  • add a blank

  • add “mitigations=off” without the quotes
    be aware the keyboard is in English, so the = sign is on key “´”. (key between “?” and backspace)
    if you faulty typed junk; hit the ESC key once to start over (with the “E” key)

  • if you correctly typed the boot parameter (double check) hit simultaneous keys “Crtl + x”

=> the selected kernel will boot

in emergency mode again ?

===
background:
Kernel 6.4.9 introduced some CPU BUG fixes, which let to some wired behaviour on some boxes.
“mitigations=off” switches them off, what isn’t the desired option for daily use!
in short: mitgation should be ON !

Thank you for the suggestions!
I’ve added a comment on the bug report on bugzilla so I’ll be updated.

I’ve also tried the “mitigation=off” parameter but I’ve still ended up in emergency mode but with 2 differences:

  • It took much more time before ending up in emergency mode (a couple of minutes vs few seconds)

  • journalctl shows a warning about dracut-initqueue (not present when I don’t use the mitigation=off parameter), screenshot attached

Thank you and have a great day!

This may be a significant clue

okay, was just a try.

I would do the following:

  • try the boot parameters mentioned in the journal: “nvme_core.default_ps …”
  • do an nvme test with a booting kernel when you’re unsure if the ssd is healthy

on my samsung nvme’s I do:

sudo nvme device-self-test /dev/nvme0n1 -n 1 -s 2h 
or with "1h" insteed of "2h" at the end

and afterwards (test runs ~2 minutes)

sudo nvme smart-log /dev/nvme0 -o normal

post the output of the last command

whereby I need to mention that I can’t interpret all values, but some are obvious …

I’m encounteringing the same issue with kernel versions higher than 6.4.10-200 on Fedora 38 as well. (I’m also subscribed to the bugzilla issue.) PC is a Dell XPS 9560 (07BE) with an Intel 1TB NVME drive (lshw says: INTEL SSDPEKNW010T9).

Here’s the output of the above nvme command on my system:

Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning			: 0
temperature					: 29 °C (302 K)
available_spare				: 100%
available_spare_threshold	: 10%
percentage_used				: 2%
endurance group critical warning summary: 0
Data Units Read				: 24,506,297 (12.55 TB)
Data Units Written			: 33,940,787 (17.38 TB)
host_read_commands			: 304,459,260
host_write_commands			: 703,649,052
controller_busy_time		: 12,138
power_cycles				: 480
power_on_hours				: 20,086
unsafe_shutdowns			: 29
media_errors				: 0
num_err_log_entries			: 0
Warning Temperature Time	: 0
Critical Composite Temperature Time	: 0
Thermal Management T1 Trans Count	: 0
Thermal Management T2 Trans Count	: 0
Thermal Management T1 Total Time	: 0
Thermal Management T2 Total Time	: 0

It looks pretty normal to me.

I am happy to continue helping troubleshoot this issue.

Thanks!

I’m also seeing this bug on my XPS 15 9560 – 6.4.10 is the last working kernel for me.

yeah, your ssd seems to be okay

BTW:
for Intel SSD you could also use “Solidigm Storage Tool”
=> Linux => CLI(rpm)
docs are included in the zip file !

commands are somewhat cryptic

some important commands:

list all devices:    sudo sst show -ssd

secure erase:        sudo sst delete -ssd 1                # For SATA devices, this will issue an "ATA Secure Erase" if supported, or "Sanitize erase" if supported.
               or    sudo sst start -ssd 1 -secureerase    # "ATA Secure Erase" !

Data Integrity:      sudo sst start -ssd 1 -scan      [(DataIntegrity|ReadScan|Logs)]
                           
smart selftest:      sudo sst start -ssd 1 -selftest [short|extended]
smart selftest done: sudo sst show -ssd 1 -selftest 
smart Values:         sudo sst show -ssd 1 -smart

health/hours:        sudo sst show -ssd 1 -sensor 
LBA written:         sudo sst show -ssd 1 -performance
Wearout Indicator:   sudo sst show -ssd 1 -smart E9 

Device Idenify:      sudo sst show -ssd 1 -identify  
Drive Info/Feature:  sudo sst show -a -ssd 1 

I think we’re going somewhere… It’s quite clear that there’s something wrong with kernel 6.4.11 and 6.4.12 and Dell XPS 9560 considering that both @dklann and @cnelsonak have my same machine.

Two things:

This is the output of the sudo nvme smart-log /dev/nvme0 -o normal command on my machine (i’m using a Nvme Samsung SSD 980 500GB)

Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning			: 0
temperature				: 38 °C (311 K)
available_spare				: 100%
available_spare_threshold		: 10%
percentage_used				: 1%
endurance group critical warning summary: 0
Data Units Read				: 10.989.615 (5,63 TB)
Data Units Written			: 11.373.766 (5,82 TB)
host_read_commands			: 164.925.265
host_write_commands			: 162.967.221
controller_busy_time			: 424
power_cycles				: 1.411
power_on_hours				: 205
unsafe_shutdowns			: 154
media_errors				: 0
num_err_log_entries			: 0
Warning Temperature Time		: 334
Critical Composite Temperature Time	: 0
Temperature Sensor 1           : 38 °C (311 K)
Temperature Sensor 2           : 40 °C (313 K)
Thermal Management T1 Trans Count	: 0
Thermal Management T2 Trans Count	: 16184
Thermal Management T1 Total Time	: 0
Thermal Management T2 Total Time	: 15106

Second thing: as Sixpack13 suggested I’ve tried to boot with the parameter nvme_core.default_ps_max_latency_us=0 pcie_aspm=off. I’ve also removed the parameter rhgb quiet to see what’s going on. In this case I don’t end up in emergency mode but system hangs at this status:

Thank you again for your support

I’m seeing this issue as well with both 6.4.11-200 and 6.4.12-200. My specific config is a SW RAID0 with 2x Samsung 860 SATA SSDs. This isn’t the first time a kernel has broken due to a SW RAID for me. Maybe a clue?

I’ve been able to install this just fine on another machine with a single NVMe drive.

It wasn’t me, it was the kernel maintainer who wrote the parameter in the log file.
anyway it was also just a try.

another try would be: selinux relabling the filesystem (I guess it won’t hurt).

in an terminal run:

sudo touch /.autorelabel;

and

  • reboot a booting kernel
    box boots twice: boot, starts the relabling, boots again
    and
  • then reboot the faulty kernel

emergency mode again ?

To me the idea of using RAID0 for multiple drives is repugnant.
Having the data spread across 2 devices does gain space but at minimum doubles the failure points and risk of total data loss with a single failure. The total number of failure points depends upon the number of devices involved and in every case a single failure may cause total data loss when using RAID0.

In your case, what happens if you do not use the raid config and instead install to a single drive in that machine? If that works then it would seem the failure may be related to the raid config itself.

It may be related to using 2 different SSDs in that config. Even though they are supposed to be identical and work the same it is possible that some small factor in the hardware is incompatible with raid of that style.

Hey @sixpack13, all good suggestions! Thanks for those.

I relabeled the file system with sudo touch /.autorelabel early on in troubleshooting this. No effect. The problem is that the kernel cannot even mount the root file system because it thinks the NVME drive is unavailable. So SELinux never even enters the picture here.

If I had better kernel debugging skillz (and more time) I would run the kernel under a debugger and try to learn what happens when it initializes the NVME subsystem and the attached SSD. Sigh…

@j-pow, I do not think your software RAID0 is causing this issue. The RAID0 configuration may be compounding your troubles, but I think, at the core, this is something else.

I welcome others’ thoughts and suggestions!

Thanks!

1 Like