>nov 03 09:03:56 kernel: BTRFS info (device nvme0n1p3): start tree-log replay
This suggests a prior crash, power fail (or forced power off) during writes with fsync. The good news is the replay succeeds, there are no errors, and the file system mounts ok. There are ten of these in the log provided.
However, it seems no prior boot with the crash/power fail is in this log. Chances are it didn’t make it to persistent media because of the problem you’re having. The kernel messages related to the problem are only in volatile journal, and therefore it’s gone upon reboot.
Here’s an example from the provided log:
nov 10 08:18:07 systemd[1]: Started systemd-journald.service - Journal Service.
nov 10 09:20:38 kernel: Linux version 6.14.0-63.fc42.x86_64 (mockbuild@d5701c6d040c430c8283c8c9847dc93f) (gcc (GCC) 15.0.1 20250228 (Red Hat 15.0.1-0), GNU ld version 2.44-3.fc42) #1 SMP PREEMPT_DYNAMIC Mon Mar 24 19:53:37 UTC 2025
…
nov 10 09:20:40 kernel: BTRFS info (device nvme0n1p3): start tree-log replay
The first line, time stamp 08:18:07 is the last line for that boot. Nothing happens after that so we have no idea what happened. And the very next boot has tree-log replay which only happens if there is a log tree present, and log tree is only present during tree logging
which only happens during writes with fsync. It’s an fsync performance optimizer, and is crash safe. So it just gets replayed at next boot.
There are no btrfs or nvme problems in the log. So whatever is happening doesn’t offhand seem to be file system related.
It might be storage device related, maybe it’s wigging out at some point during the boot, the device itself goes read only (?) and therefore no more journal entries. That would be consistent with the earlier screenshot showing dropped reads and writes. The nvme device is just hung. (speculation)
What we need is the full complete text version of that screen shot. But that log is only in volatile memory. If the drive is having firmware or kernel driver issues (seems more likely this is a kernel bug since it’s working ok with 6.14 kernel) that likely prevents the journal from being written to persistent media and why we aren’t seeing the issue in the provided log - only the tree-log replay which tells us, yeah there was a crash/power fail (or the whole drive just got reset or dropped off the pci bus or … who knows)
OK so now what?
I suggest booting the new kernel (I’m sorta guessing this is the problem kernel) with this boot parameter:
rd.systemd.debug-shell=1
That will provide a root shell on tty9. So when the system hangs at boot, you should still be able to get to tty9, and use the root shell there to extract the journal for this boot so we can see what’s failing.
How do you extract the journal in this situation? The nvme drive may not be available at all here to save the journal to.
You will need some other drive, like a USB stick drive, but anything will do.
You can use blkid to find the device node for this stick, e.g. /dev/sda1 and mount it somewhere, anywhere, doesn’t really matter. It can be somewhere in /tmp or /run - those are volatile so you can’t hurt anything there, it’ll all go away at next boot anyway.
mkdir /tmp/mnt
mount /dev/sda1 /tmp/mnt
journalctl -k --no-hostname > /tmp/mnt/journal.log
umount /tmp/mnt
reboot
So let’s see if this reveals what’s going on.
Alternative ideas to also check in the meantime, since apparently my turn around time is 20+ hours.
Make sure both the logic board firmware (UEFI) and the drive firmware are up to date as provided by the system manufacturer. I just ran into another user having an issue with an nvme drive that was intermittently disappearing and a firmware update purports to fix that problem specifically.
But since it seems to cooperate with kernel 6.14 and not 6.17… we probably need to test 6.18 and possibly even 6.16 to see if it’s working in the newer kernel. And if now when it stopped working in kernels before 6.17 - or if it stopped working in 6.17.
Introducing koji build system. This is the link for kernels.
Example, kernel-6.18.0-0.rc5.44.fc44 is the current upstream 6.18-rc5 kernel. Click on that and you’ll see a list of RPMs by arch. Find x86_64 and you can download four files: kernel, kernel-core, kernel-modules, kernel-modules-core for this version. And then install them using either dnf install or rpm -iv, your choice.
It’s possible the bug is fixed in 6.18 and just hasn’t been backported yet to stable. It’s also possible it hasn’t been fixed which would be good to know now because it might mean no one has reported it yet.
Ergo, this is now a bug hunt.
It might also be a good idea to get a Red Hat Bugzilla account or a Fedora account - either can be used to log into bugzilla.redhat.com and file bugs against the kernel and provide the dmesg you hopefully capture and save to your USB stick. And then we can start asking around for a fix, most likely search upstream and if no one has reported it there, report it and it should get fixed pretty quickly once they’re aware there’s been a regression.
But they will need dmesg showing the failure.
Whew!