Boot freezes after starting Grub successfully | btrfs filesystem errors

Hello community,

my Fedora installation is about 2 1/2 years old and was always running very stable and reliable. This is my production system running Fedora Workstation 42. I always keep the system current, even if there are only tiny updates, and never had problems.

Some days ago, I had downloaded about 6 tiny (non-Kernel, non-NVidia) updates as usual through Gmome Software and used the function to install the updates during the shutdown of the workstation.

I had been astonished, that the update did not work as expected. Even after 5 hours, I see a turning circle along with the Fedora logo after a successful start of the bootloader. Thus, I did a warm start the system: The reboot through keyboard worked, which seems to be an indicator, that the system was working to a certain extent.

Now, every time I reboot, I get a successful start in Grub, but independent of which kernel I am choosing, I end up in the described boot loop of Fedora showing the turning circle and the Fedora logo at the bottom.

I had successfully chroot’ed from a boot media into my BTRFS installation. The file system seems to be okay.

There are plenty of explanations on how to fix grub on the net, but this does not seem to be the problem in my case. What to try next, e.g. from the chroot environment?

My Assumption: The update is blocking. How could I remove the ā€œupdate on next bootā€ setting? I cannot find documentation on this.

I did try to update Fedora through ā€œdnf updateā€ in the chgroot environment, but this did not solve the problem either.

Any idea on how to solve this?

I am running on a standard Intel i5 with a NVidia Geforce 3060Ti, UEFI (currently non-secure) boot. Other OSs are booting successfully on this machine.

I know, that recreation of NVidia kernel modules may block the boot process. As the updates did not involve a new kernel nor a NVidia update, I believe, that this is not the case.

Any ideas on how to fix this are highly welcome.

Thank you in advance
Thomas

If you hit the Escape key at this stage, does it show you log messages instead of the graphical screen? If so, those messages would help in diagnosing what’s happening.

Oh god, that simple: Yes, it does. :face_with_open_eyes_and_hand_over_mouth:

There is an error message:

ACPI BIOS Error (bug): Failure creating named object [\_SB.PC00.PEG1.PEGP._DSM.USRG], AE_ALREADY_EXISTS ()20240827/dsfield-184)

ACPI Error: AE_ALREADY_EXISTS, CreateBufferFieldfailure (20240827/dswload2-477)

ACPI Error: Aborting method \_SB.PC00.PEG1.PEGP._DSM.USRG due to previous error (AE_ALREADY_EXISTS) (240827/psparse-529)

The rest shows green okays, but I cannot scroll up to search for further messages.

This is strange, as the systems worked reliable for 2,5 years…

Any hint on this?

Screenshot:

OK, great that we can see those messages at least!

I don’t think my hints are very reliable, but I would try disconnecting any non-essential USB devices and see if that makes any difference.

However, from a bit of searching around, it seems those ACPI messages are often non-fatal errors so they might not be the source of your actual problem.

Someone else will hopefully have a better idea!

In your screenshot, is that the ā€œfrozenā€ state of the system? Is that message ā€œStarted iio-sensor-proxy.serviceā€ the last one that you see?

1 Like

Yes, this is the frozen state.

I had attempted another try with disconnected USB devices, but with the same result.

Furthermore, I tried to watch the boot messages while scrolling up. Some error message about mounting the /home volume slid through the display, which I need to follow up. Maybe I need to check and repair the btrfs file system home and root are both located on the same btrfs volume (standard installation). Is there a kind of fschk for btrfs, which might help?

There have been issues recently where people needed to do a btrfs rescue, for example:

Their screenshots look different from yours. But then again, maybe it looks different when the problem is on the home subvolume rather than root.

Sorry, I don’t want to advise anything too definitive since I don’t use btrfs and I’m honestly not too sure here!

1 Like

I see, we are on the right track!

Until now, I only had looked at the subvolume @rootroot, which seems to be okay and can be mounted without errors.

When I try mounting the subvol@homeme @home, I receive the following error: ā€œCan’t read superblock on volume ā€.

I had tried the suggested command ā€œbtrfs rescue zero-log ā€œ. It completed instantly, but did not show an improvement. The su@homevolume @home can still not be mounted.

dmesg shows some checksum failures. See screenshot:

All this looks to me like a btrfs bug and/or corruption, as explained.

The good point is, if I mount subvolume ā€œ/ā€ (btrfs ID 0), I@homecan access @home". Thus, data seems to be available still.

Does anybody have a recommendation on how to repair this btrfs file system, or is there a need to create a new @home subvolume?

I am not very fluent with btrfs…

Still, I have tried btrfschk on the volume, which shows a lot of checksum errors, but does not actually repair the volume. The option ā€œā€“repairā€ should only be used by advise of developers, and thus I did not dare to try it.

Thank you and best regards
Thomas

This might require interactive conversation. Can you meet me in here and just @cmurf?

btrfs check /dev/nvme1n1p4 | fpaste
btrfs insp dump-t -b 12714019888096 /dev/nvme1n1p4 | fpaste

These are safe commands and might tell us more about what’s going on.

The warning is complaining about checksum verification failing on a block. That it refers to mirror 1 and mirror 2 tells me the block is a metadata block, two copies of metadata, and it seem both have a problem.

I’m not sure if it’s a memory bitflip or a storage device failure yet.

Dear Chris @chrismurphy ,

thank you for your prompt follow-up. Here are the outputs of the commands through fpaste:

btrfs check: UNTITLED - Pastebin Service / see also output below.

btrfs insp dump-t only returns an error:

ERROR: tree block bytenr 12714019888096 is not aligned to sectorsize 4096

The point is, that the btrfs check command returns a significant amount of errors to stderr output, which is not included in the fpast. Redirecting this to stdout and using fpaste, returns an error, that the size is too big for fpaste (error: ā€œno file hosting serviceā€œ).

I try to redirect to a file and show the output here:

root@localhost-live:~# btrfs check /dev/nvme0n1p4 2>btrfs_check.error.log | tee btrfs_check.log
Opening filesystem to check…
Checking filesystem on /dev/nvme0n1p4
UUID: 364f2d1f-b939-4aae-8b1a-bbc6d242a18e
The following tree block(s) is corrupted in tree 256:
tree block bytenr: 493174784, level: 1, node key: (14696013, 1, 0)
found 1009981825024 bytes used, error(s) found
total csum bytes: 882381192
total tree bytes: 8690319360
total fs tree bytes: 7226294272
total extent tree bytes: 419790848
btree space waste bytes: 1710313252
file data blocks allocated: 2037460811776
referenced 1603884122112

The error log contains the following (shortened):

[1/8] checking log skipped (none written)
[2/8] checking root items
[3/8] checking extents
checksum verify failed on 1271401988096 wanted 0x1e498126 found 0x09cacbe1
checksum verify failed on 1271401988096 wanted 0x1e498126 found 0x09cacbe1
checksum verify failed on 1271401988096 wanted 0x1e498126 found 0x09cacbe1
Csum didn’t match
ref mismatch on [16531456 4096] extent item 1, found 0
data extent[16531456, 4096] bytenr mimsmatch, extent item bytenr 16531456 file item bytenr 0
data extent[16531456, 4096] referencer count mismatch (root 256 owner 14696018 offset 1388544) wanted 1 have 0
backpointer mismatch on [16531456 4096]
owner ref check failed [16531456 4096]
ref mismatch on [21241856 4096] extent item 1, found 0
data extent[21241856, 4096] bytenr mimsmatch, extent item bytenr 21241856 file item bytenr 0
data extent[21241856, 4096] referencer count mismatch (root 256 owner 14696018 offset 57344) wanted 1 have 0
backpointer mismatch on [21241856 4096]
owner ref check failed [21241856 4096]
ref mismatch on [1112039424 4096] extent item 1, found 0
data extent[1112039424, 4096] bytenr mimsmatch, extent item bytenr 1112039424 file item bytenr 0
data extent[1112039424, 4096] referencer count mismatch (root 256 owner 14696018 offset 270336) wanted 1 have 0
backpointer mismatch on [1112039424 4096]
owner ref check failed [1112039424 4096]
ref mismatch on [1115729920 4096] extent item 1, found 0
data extent[1115729920, 4096] bytenr mimsmatch, extent item bytenr 1115729920 file item bytenr 0

…. shortened ….

root 256 inode 14747577 errors 2001, no inode item, link count wrong
unresolved ref dir 613 index 28793 namelen 64 name c578ea83c97fc8fae3b8014d6f76080b13a42b65-app.zen_browser.zen.png filetype 1 errors 4, no inode ref
root 256 inode 14747578 errors 2001, no inode item, link count wrong
unresolved ref dir 613 index 28794 namelen 68 name c863804b04b425df0bf5ed2ea1a5e845a5cf0c8e-app.opencomic.OpenComic.png filetype 1 errors 4, no inode ref
root 256 inode 14747579 errors 2001, no inode item, link count wrong
unresolved ref dir 272 index 125488 namelen 4 name user filetype 1 errors 4, no inode ref
root 256 inode 14747580 errors 2001, no inode item, link count wrong
unresolved ref dir 295 index 53372 namelen 27 name session-active-history.json filetype 1 errors 4, no inode ref
root 256 inode 14747581 errors 2001, no inode item, link count wrong
unresolved ref dir 98912 index 3662 namelen 11 name soup.cache2 filetype 1 errors 4, no inode ref
root 256 inode 14747582 errors 2001, no inode item, link count wrong
unresolved ref dir 295 index 53374 namelen 12 name session.gvdb filetype 1 errors 4, no inode ref
ERROR: errors found in fs roots

(Please note, that the device numbering changes in different boot environments, e.g. with the live system. But these should be the correct device files and renaming does not matter imo.)

Does this help analyzing the problem? - Any hint is very welcome!

What is the status right now:

  • I had restored an (too old) snapshot of @home and have put it in the place, renaming the current damaged subvolume to @home-corrupted.
  • The system boots again with the outdated @home in place
  • I have updated the kernel from 6.15.10 to the latest kernel 6.16.x, hoping that a potential bug in @home-corrupted has been removed.
  • @home-corrupted mounts in the booted system. Trying to copy files from this corrupted file system only shows 2 broken files. One logfile of Nextcloud and a cache file of Gnome Evolution. Both not being important.
  • Current problem is: I cannot delete those files, due to an error message. Checksums do not fit. Nor can I eliminate the file system error.
  • The device now holds mainly: an updated @root file system, @home-corrupted and working, but outdated @home partition. But the filesystem contains errors, which I do not feel comfortable with on a production system!

Thus my question is: Is there a way to remove those errors and to repair the device (e.g. through a sort of fscheck)?

Or do i need to go through the full cycle of backup reformatting and restoring?

Fixing the filesystem would allow for a much faster and easier process…

Thank you in advance & best regards
Thomas

1 Like

This contained a typo, it should be

btrfs insp dump-t -b 1271401988096 /dev/nvme1n1p4 | fpaste

I’ll look at the btrfs check output in the meantime. It’s still better to have interactive chat on matrix for these kinds of issues, I think.

There is a single Btrfs file system with a single block device that’s sometimes /dev/nvme0n1p4 and other times /dev/nvme1n1p4 but they are actually the same block device, correct?

I’ve asked a Btrfs developer to look at the btrfs check output and see if it’s repairable. Some are repairable but others I’m not sure about. However, in the meantime it’s best to get the important data out no matter whether you decide to use –repair or reformat, reinstall, and restore.

Boot a recent Fedora LiveOS from USB stick. (Where $DEV is device node, and $MNT is path to where you want it mounted)

This first set of steps is optional, but it allows you to find out exactly what files and portions of them are corrupt.

In one terminal tab:

journalctl -fk

In another terminal tab:

mount -o ro $DEV $MNT
btrfs scrub start -Bdr $MNT

In the first tab you will see any checksum mismatches, both metadata and data. Any messages that do not include a full path to a filename can be ignored, those are just problems with metadata. You want to know if there’s any bad data, so that you can decide how to handle such files if you choose to copy them out… warts and all.

NOTE: that other file systems permit corrupt data blocks to enter user space. Btrfs doesn’t do that, it withholds bad data blocks and returns EIO for those blocks instead. It’s up to the application how to handle both cases.

The next set of steps is how to recover the files on a damaged Btrfs:

umount $MNT
mount -o ro,rescue=all $DEV $MNT

That mount option should permit Btrfs to skip over damaged parts of the file system, hence it’s read-only in order to avoid writing more confusion back to the file system.

There is a gotcha. rescue=all implies rescue=ignoredatacsums which means it is possible that any data corruption in files is allowed to be copied out. And hence the reason for the earlier scrub, so you know in advance what files have problems.

And this means you can copy the files out normally - using DE, cp, rsync, whatever.

1 Like

Hi Chris,

yes, the device is always the same. It seems to be listed under a different device files (either 0 or 1), depending on the boot media (USB live file system or installed system).

Sorry for the delay I am introducing right now, but I am out of office at the moment. I am very thankful for your advice and will follow up as soon as possible.

Best regards
Thomas

Hi Chris,

here are the results of the commands:

btrfs insp dump-t -b 1271401988096 /dev/nvme1n1p4 | fpaste

stderr output:

root@localhost-live:~# btrfs insp dump-t -b 1271401988096 /dev/nvme1n1p4 | fpaste
checksum verify failed on 1271401988096 wanted 0x1e498126 found 0x09cacbe1
checksum verify failed on 1271401988096 wanted 0x1e498126 found 0x09cacbe1
ERROR: failed to read tree block 1271401988096
Uploading (0.1KiB)…

I am unsure about this command. Was this correct?

root@localhost-live:~# mount -o ro /dev/nvme1n1p4 /mnt

Journalctrl output:

Sep 01 17:26:06 localhost-live kernel: BTRFS: device label fedora devid 1 transid 483030 /dev/nvme1n1p4 (259:4) scanned by mount (5045)
Sep 01 17:26:06 localhost-live kernel: BTRFS info (device nvme1n1p4): first mount of filesystem 364f2d1f-b939-4aae-8b1a-bbc6d242a18e
Sep 01 17:26:06 localhost-live kernel: BTRFS info (device nvme1n1p4): using crc32c (crc32c-x86_64) checksum algorithm
Sep 01 17:26:06 localhost-live kernel: BTRFS info (device nvme1n1p4): using free-space-tree

root@localhost-live:~# btrfs scrub start -Bdr /mnt
Starting scrub on devid 1
Scrub device /dev/nvme1n1p4 (id 1) done
Scrub started: Mon Sep 1 17:28:22 2025
Status: finished
Duration: 0:06:15
Total to scrub: 948.31GiB
Rate: 2.53GiB/s
Error summary: verify=8
Corrected: 0
Uncorrectable: 8
Unverified: 0
ERROR: there are 1 uncorrectable errors

Journal output:

Sep 01 17:28:22 localhost-live kernel: BTRFS info (device nvme1n1p4): scrub: started on devid 1
Sep 01 17:34:36 localhost-live kernel: BTRFS warning (device nvme1n1p4): tree block 1271401988096 mirror 1 has bad csum, has 0x1e498126 want 0x09cacbe1
Sep 01 17:34:36 localhost-live kernel: BTRFS warning (device nvme1n1p4): tree block 1271401988096 mirror 1 has bad csum, has 0x1e498126 want 0x09cacbe1
Sep 01 17:34:36 localhost-live kernel: BTRFS warning (device nvme1n1p4): tree block 1271401988096 mirror 1 has bad csum, has 0x1e498126 want 0x09cacbe1
Sep 01 17:34:36 localhost-live kernel: BTRFS warning (device nvme1n1p4): tree block 1271401988096 mirror 1 has bad csum, has 0x1e498126 want 0x09cacbe1
Sep 01 17:34:36 localhost-live kernel: BTRFS error (device nvme1n1p4): unable to fixup (regular) error at logical 1271401938944 on dev /dev/nvme1n1p4 physical 1093169184768
Sep 01 17:34:36 localhost-live kernel: BTRFS warning (device nvme1n1p4): header error at logical 1271401938944 on dev /dev/nvme1n1p4, physical 1093169184768: metadata leaf (level 0) in tree 256
Sep 01 17:34:36 localhost-live kernel: BTRFS error (device nvme1n1p4): unable to fixup (regular) error at logical 1271401938944 on dev /dev/nvme1n1p4 physical 1093169184768
Sep 01 17:34:36 localhost-live kernel: BTRFS warning (device nvme1n1p4): header error at logical 1271401938944 on dev /dev/nvme1n1p4, physical 1093169184768: metadata leaf (level 0) in tree 256
Sep 01 17:34:36 localhost-live kernel: BTRFS error (device nvme1n1p4): unable to fixup (regular) error at logical 1271401938944 on dev /dev/nvme1n1p4 physical 1093169184768
Sep 01 17:34:36 localhost-live kernel: BTRFS warning (device nvme1n1p4): header error at logical 1271401938944 on dev /dev/nvme1n1p4, physical 1093169184768: metadata leaf (level 0) in tree 256
Sep 01 17:34:36 localhost-live kernel: BTRFS error (device nvme1n1p4): unable to fixup (regular) error at logical 1271401938944 on dev /dev/nvme1n1p4 physical 1093169184768
Sep 01 17:34:36 localhost-live kernel: BTRFS warning (device nvme1n1p4): header error at logical 1271401938944 on dev /dev/nvme1n1p4, physical 1093169184768: metadata leaf (level 0) in tree 256
Sep 01 17:34:36 localhost-live kernel: BTRFS warning (device nvme1n1p4): tree block 1271401988096 mirror 2 has bad csum, has 0x1e498126 want 0x09cacbe1
Sep 01 17:34:36 localhost-live kernel: BTRFS warning (device nvme1n1p4): tree block 1271401988096 mirror 2 has bad csum, has 0x1e498126 want 0x09cacbe1
Sep 01 17:34:36 localhost-live kernel: BTRFS warning (device nvme1n1p4): tree block 1271401988096 mirror 2 has bad csum, has 0x1e498126 want 0x09cacbe1
Sep 01 17:34:36 localhost-live kernel: BTRFS warning (device nvme1n1p4): tree block 1271401988096 mirror 2 has bad csum, has 0x1e498126 want 0x09cacbe1
Sep 01 17:34:36 localhost-live kernel: BTRFS error (device nvme1n1p4): unable to fixup (regular) error at logical 1271401938944 on dev /dev/nvme1n1p4 physical 1094242926592
Sep 01 17:34:36 localhost-live kernel: BTRFS warning (device nvme1n1p4): header error at logical 1271401938944 on dev /dev/nvme1n1p4, physical 1094242926592: metadata leaf (level 0) in tree 256
Sep 01 17:34:36 localhost-live kernel: BTRFS error (device nvme1n1p4): unable to fixup (regular) error at logical 1271401938944 on dev /dev/nvme1n1p4 physical 1094242926592
Sep 01 17:34:36 localhost-live kernel: BTRFS warning (device nvme1n1p4): header error at logical 1271401938944 on dev /dev/nvme1n1p4, physical 1094242926592: metadata leaf (level 0) in tree 256
Sep 01 17:34:36 localhost-live kernel: BTRFS error (device nvme1n1p4): unable to fixup (regular) error at logical 1271401938944 on dev /dev/nvme1n1p4 physical 1094242926592
Sep 01 17:34:36 localhost-live kernel: BTRFS warning (device nvme1n1p4): header error at logical 1271401938944 on dev /dev/nvme1n1p4, physical 1094242926592: metadata leaf (level 0) in tree 256
Sep 01 17:34:36 localhost-live kernel: BTRFS error (device nvme1n1p4): unable to fixup (regular) error at logical 1271401938944 on dev /dev/nvme1n1p4 physical 1094242926592
Sep 01 17:34:36 localhost-live kernel: BTRFS warning (device nvme1n1p4): header error at logical 1271401938944 on dev /dev/nvme1n1p4, physical 1094242926592: metadata leaf (level 0) in tree 256
Sep 01 17:34:37 localhost-live kernel: BTRFS info (device nvme1n1p4): scrub: finished on devid 1 with status: 0

Scrub finds 8 uncorrectable errors. There is no information about the files involved. They had been listed when copying them via rsync. Good point: they were not important (one cache file and one log file).

I will do another backup following your advise above before trying to repair the disk.

Does the log output help analyzing root cause and repair options?

If you have an advise on repairing, I will try out for sure.

Would it be save to just leave the errors on the unused @home-corrupted? I have a working @home subvolume in the meantime, where i had merged recent files with an old snapshot.

The files cannot be simply deleted by rm. Would it be possible with the above mentioned mount option rescue=all, thus eliminating the errors?

Would it be save to delete the subvolume @home-corrupted and eliminate the errors with this step?

I am afraid that the last two ideas won’t work on BTRFS with the checksum-errors, but wanted to hear your advise, before keying in different attempts…

Thank you very much again. I am learning a lot!

Best regards
Thomas

Yep, that works.

It looks like a single metadata leaf has some kind of corruption (header), both copies are affected and (I think) identically because the ā€œhasā€ checksums are the same for both copies.

I don’t know if –-repair will safely fix this problem. If you’re prepared for things to get worse, then give it a try. Worse case scenario is reformat, reinstall, restore.

Optional: Before --repair, create an image of the unmounted file system. This omits file data, but includes metadata (the fs itself) for developers to inspect and maybe figure out what went wrong, and also how to improve –-repair if it can’t fix the problem.

btrfs-image -c7 -t4 -ss /dev/nvme1n1p4 /path/to/file.btrfsimage

-ss just hashes the filenames

You can just put that aside in case a dev wants to see it. It’s a holiday weekend in the U.S. so it might be a few more days before a dev gets around to responding.

I’m a little confused. I thought the system was not booting, and dnf update hangs. That suggests the problem is with a b-tree shared by all the subvolumes.

Bit of background: a subvolume is a dedicated file b-tree, but it shares the other b-trees: root, extent, dev, chunk, uuid, etc) with all other subvolumes.

I think if the problem is isolated to a specific file b-tree then it would be avoidable by just not accessing that subvolume; and in fact we can just btrfs sub del that subvolume, and the problem gets fixed by virtue of the fact that subvolume delete does not require reading that subvolume’s file b-tree.

No because rescue=all implies ro and it’s not possible to make file system changes in this mode in order to avoid any confusion making the file system problem worse by writing to it.

Only if the problem is really isolated to a leaf or node in a particular file (subvolume) b-tree, and not a shared one. It’s worth a try if you don’t mind losing that subvolume.

The only reason why I think the problem might not be isolated to a home subvolume, is that you’re having boot and dnf problems, which suggests the corrupt block is something shared by the root subvolume as well. Normally I’d be able to tell, if the block could be dumped, but the tool is refusing to dump the block due to the checksum verify failure.

The common handling in Btrfs is to treat blocks with checksum verify failure as entirely unreliable. It doesn’t have a concept (yet) of treating the block with suspicion, or how to figure out what it should be based on the checksum. We don’t know if it’s wrong by one bit, one byte, o many. So it’s ignored.


If this command works:

mount -o subvolid=5 /dev/nvme1n1p4 /mnt

And you get a read-write mounted file system, then you can try to btrfs sub del the subvolume you think contains the corruption.

(I suggest explicitly mounting subvolid 5, the top-level of the file system, because it will avoid reading the contents of the subvolumes themselves and thus maybe avoid hitting the corruption - I’m not sure if there is a default subvolume set on this file system other than ID 5, that’s common with 3rd party btrfs utilities none of which I’m very familiar with how they operate.)

Because of the concept of shared extents (both data and metadata) it’s not always obvious that a corrupt block can affect multiple subvolumes and snapshots.

1 Like

Sorry for the possible confusion. This is the current status of the device, which mainly holds root and home:

  • The Fedora ā€˜root’ subvolume is functional and I did not touch it.
  • The ā€˜home’ subvolume was damaged, hindering boot, and I have moved it away to ā€˜home-corrupted’.
  • I moved an older snapshot of ā€˜home’ in place.
  • This allows me for booting the system again (with an outdated home directory).
  • Next, I had mounted ā€˜home-corrupted’ and restored all healthy files back to the now active ā€˜home’ (only two files were damaged).
  • Now all the required data (root and home) are located on healthy subvolumes of this btrfs, but I have a damaged subvolume (called home-corrupted) on the same device.
  • Best option: some way to eliminate the errors on the device in order to avoid reformatting.
  • Second-best option: Full backup, formatting and restore.

I guess on a more simple ext4 drive, a fschk would have already solved the issue of remaining fs-errors.

Currently I would consider an attempt to delete the subvolume ā€˜home-corrupted’ (data is copied to another subvolume already) in order to get rid of the file system errors, but I am unsure, whether this is good enough or whether this may cause additional problems.

Next in a row, I would try the –repair option.

Should I try this now or wait for your (respectively the developers) advise?

Best regards
Thomas

OK now I understand.

I recommend btrfs subvolume delete home-corrupted, which should complete immediately. A background cleaner thread will eventually kick off to do the work of removing the subvolume and freeing the extents. This might take a minute or two.

In a separate terminal tab you can journalctl -fk to monitor any kernel messages that might indicate further issues.

I’d follow this up with scrub btrfs scrub start / and see if that comes up clean or not. And then reboot LiveOS and run another fsck, btrfs check /dev/nvme1n1p4.

The scrub checks all metadata and data checksums for every in use block, it just tells us whether changes occurred since a block was written, not whether the file system is correct. The offline check tool is a file system consistency check, making sure the file system is correct.

I guess on a more simple ext4 drive, a fschk would have already solved the issue of remaining fs-errors.

It’s speculation but it’s probably true. e2fsck has the benefit of having very long history and a file system with fixed locations so it’s possible for the tool to infer a lot about what each metadata block should look like. Btrfs metadata doesn’t have any fixed locations, which is a side effect of COW. That should make it safer compared to overwrites but as a consequence, it’s difficult to do repairs by inference.

But ext4 only checksums metadata. Btrfs also checksums data. That means there’s a much bigger target for any corruption, and therefore Btrfs is more likely to detect problems than other file systems.

Anyway, the nature of the problem you encountered isn’t yet understood. I can’t tell you with certainty this is the result of a Btrfs bug or a hardware issue. My advice is biased toward Btrfs plus backups.

Btrfs protects the working copy by warning early if there’s a problem, and everything is unambiguous. The backup protects the work effort. With send/receive replication of snapshots, this is very cheap to do often, losing very little if disaster strikes. (See btrfk for automation assistance with such a backup workflow, including remote backup via ssh.)