Hello. I’m running Fedora 38 on my Dell XPS 9560 and recently I have some boot issues with it.
Right now on my system I have 3 Kernels
6.4.10
6.4.11
6.4.12
With 6.4.10 everything works perfectly, with 6.4.11 and 12 the boot process end up in emergency mode. I attached a photo of the journalctl -xe output. I’m sorry I can paste a complete file but I’ve spent two hours trying to copy the log on an USB drive to attach it here without success.
when grub comes up and has selected the lastest and faulty kernel to boot (default)
instantly hit the “E” key once
move with the cursor behind “rhgb quiet” quickest way:
* move down to the “i” from initrd
* move left and you’re just behind rhgb quiet
add a blank
add “mitigations=off” without the quotes be aware the keyboard is in English, so the = sign is on key “´”. (key between “?” and backspace) if you faulty typed junk; hit the ESC key once to start over (with the “E” key)
if you correctly typed the boot parameter (double check) hit simultaneous keys “Crtl + x”
=> the selected kernel will boot
in emergency mode again ?
===
background:
Kernel 6.4.9 introduced some CPU BUG fixes, which let to some wired behaviour on some boxes.
“mitigations=off” switches them off, what isn’t the desired option for daily use!
in short: mitgation should be ON !
I’m encounteringing the same issue with kernel versions higher than 6.4.10-200 on Fedora 38 as well. (I’m also subscribed to the bugzilla issue.) PC is a Dell XPS 9560 (07BE) with an Intel 1TB NVME drive (lshw says: INTEL SSDPEKNW010T9).
Here’s the output of the above nvme command on my system:
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning : 0
temperature : 29 °C (302 K)
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 2%
endurance group critical warning summary: 0
Data Units Read : 24,506,297 (12.55 TB)
Data Units Written : 33,940,787 (17.38 TB)
host_read_commands : 304,459,260
host_write_commands : 703,649,052
controller_busy_time : 12,138
power_cycles : 480
power_on_hours : 20,086
unsafe_shutdowns : 29
media_errors : 0
num_err_log_entries : 0
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0
It looks pretty normal to me.
I am happy to continue helping troubleshoot this issue.
I think we’re going somewhere… It’s quite clear that there’s something wrong with kernel 6.4.11 and 6.4.12 and Dell XPS 9560 considering that both @dklann and @cnelsonak have my same machine.
Two things:
This is the output of the sudo nvme smart-log /dev/nvme0 -o normal command on my machine (i’m using a Nvme Samsung SSD 980 500GB)
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning : 0
temperature : 38 °C (311 K)
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 1%
endurance group critical warning summary: 0
Data Units Read : 10.989.615 (5,63 TB)
Data Units Written : 11.373.766 (5,82 TB)
host_read_commands : 164.925.265
host_write_commands : 162.967.221
controller_busy_time : 424
power_cycles : 1.411
power_on_hours : 205
unsafe_shutdowns : 154
media_errors : 0
num_err_log_entries : 0
Warning Temperature Time : 334
Critical Composite Temperature Time : 0
Temperature Sensor 1 : 38 °C (311 K)
Temperature Sensor 2 : 40 °C (313 K)
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 16184
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 15106
Second thing: as Sixpack13 suggested I’ve tried to boot with the parameter nvme_core.default_ps_max_latency_us=0 pcie_aspm=off. I’ve also removed the parameter rhgb quiet to see what’s going on. In this case I don’t end up in emergency mode but system hangs at this status:
I’m seeing this issue as well with both 6.4.11-200 and 6.4.12-200. My specific config is a SW RAID0 with 2x Samsung 860 SATA SSDs. This isn’t the first time a kernel has broken due to a SW RAID for me. Maybe a clue?
I’ve been able to install this just fine on another machine with a single NVMe drive.
To me the idea of using RAID0 for multiple drives is repugnant.
Having the data spread across 2 devices does gain space but at minimum doubles the failure points and risk of total data loss with a single failure. The total number of failure points depends upon the number of devices involved and in every case a single failure may cause total data loss when using RAID0.
In your case, what happens if you do not use the raid config and instead install to a single drive in that machine? If that works then it would seem the failure may be related to the raid config itself.
It may be related to using 2 different SSDs in that config. Even though they are supposed to be identical and work the same it is possible that some small factor in the hardware is incompatible with raid of that style.
Hey @sixpack13, all good suggestions! Thanks for those.
I relabeled the file system with sudo touch /.autorelabel early on in troubleshooting this. No effect. The problem is that the kernel cannot even mount the root file system because it thinks the NVME drive is unavailable. So SELinux never even enters the picture here.
If I had better kernel debugging skillz (and more time) I would run the kernel under a debugger and try to learn what happens when it initializes the NVME subsystem and the attached SSD. Sigh…
@j-pow, I do not think your software RAID0 is causing this issue. The RAID0 configuration may be compounding your troubles, but I think, at the core, this is something else.