BTRFS no longer mounts

I am having a hard btrfs failure. The symptom started when my laptop locked for about 12 hours, while downloading, compiling, and managing a ton of chromium/firefox tabs.
In situations like this I usually just hit the hard reset, and continue on my way. This usually works, and I attempted this. However on reboot the system dropped me down to the emergency console, saying that the btrfs partition with everything on it could not be mounted.
I have a Fedora 40 KDE thumbdrive handy, and get the following set of errors when running almost any btrfs command:
Opening filesystem to check…

liveuser@localhost-live:~$ sudo mount /dev/nvme0n1p3 /mnt/orig
mount: /mnt/orig: can't read superblock on /dev/nvme0n1p3.
       dmesg(1) may have more information after failed mount system call.
[23013.297352] BTRFS: device label fedora devid 1 transid 420465 /dev/nvme0n1p3 scanned by pool-udisksd (14419)
[23013.301458] BTRFS info (device nvme0n1p3): first mount of filesystem e40e2cfb-83f5-48ff-a481-bf9f0cc22543
[23013.301475] BTRFS info (device nvme0n1p3): using crc32c (crc32c-intel) checksum algorithm
[23013.301479] BTRFS info (device nvme0n1p3): using free-space-tree
[23013.303686] BTRFS error (device nvme0n1p3): bad tree block start, mirror 1 want 823396630528 have 0
[23013.303789] BTRFS error (device nvme0n1p3): bad tree block start, mirror 2 want 823396630528 have 0
[23013.303800] BTRFS warning (device nvme0n1p3): couldn't read tree root
[23013.304150] BTRFS error (device nvme0n1p3): open_ctree failed
[27001.727157] BTRFS: device label fedora devid 1 transid 420465 /dev/nvme0n1p3 scanned by mount (16522)
[27001.728732] BTRFS info (device nvme0n1p3): first mount of filesystem e40e2cfb-83f5-48ff-a481-bf9f0cc22543
[27001.728745] BTRFS info (device nvme0n1p3): using crc32c (crc32c-intel) checksum algorithm
[27001.728748] BTRFS info (device nvme0n1p3): using free-space-tree
[27001.729786] BTRFS error (device nvme0n1p3): bad tree block start, mirror 1 want 823396630528 have 0
[27001.729887] BTRFS error (device nvme0n1p3): bad tree block start, mirror 2 want 823396630528 have 0
[27001.729891] BTRFS warning (device nvme0n1p3): couldn't read tree root
[27001.730126] BTRFS error (device nvme0n1p3): open_ctree failed
checksum verify failed on 823396630528 wanted 0x02000000 found 0xabe960d0
checksum verify failed on 823396630528 wanted 0x00000000 found 0x8b095422
checksum verify failed on 823396630528 wanted 0x02000000 found 0xabe960d0
bad tree block 823396630528, bytenr mismatch, want=823396630528, have=0
Couldn't read tree root
ERROR: cannot open file system

I spent a good day looking for commands to find out the issue, and solve it to no avail.
Following is a series of btrfs commands that do not seem to give me anything I can use:

checksum verify failed on 823396630528 wanted 0x02000000 found 0xabe960d0
checksum verify failed on 823396630528 wanted 0x00000000 found 0x8b095422
Couldn't read tree root
ERROR: could not open ctree
liveuser@localhost-live:~$ sudo btrfs rescue super-recover /dev/nvme0n1p3
sudo btrfs rescue super-recover /dev/nvme0n1p3
All supers are valid, no need to recover
liveuser@localhost-live:~$ sudo btrfs inspect-internal rootid /dev/nvme0n1p3
ERROR: not a btrfs filesystem: /dev/nvme0n1p3
liveuser@localhost-live:~$ sudo btrfs-find-root /dev/nvme0n1p3
Couldn't read tree root
Superblock thinks the generation is 420465
Superblock thinks the level is 0
Well block 822913368064(gen: 419921 level: 0) seems good, but generation/level doesn't match, want gen: 420465 level: 0
Well block 822850994176(gen: 419920 level: 0) seems good, but generation/level doesn't match, want gen: 420465 level: 0
Well block 822769500160(gen: 419917 level: 0) seems good, but generation/level doesn't match, want gen: 420465 level: 0
Well block 822661070848(gen: 419903 level: 0) seems good, but generation/level doesn't match, want gen: 420465 level: 0
Well block 822652043264(gen: 419902 level: 0) seems good, but generation/level doesn't match, want gen: 420465 level: 0
Well block 822608560128(gen: 419901 level: 0) seems good, but generation/level doesn't match, want gen: 420465 level: 0
Well block 822536175616(gen: 419890 level: 0) seems good, but generation/level doesn't match, want gen: 420465 level: 0
Well block 822557655040(gen: 419889 level: 0) seems good, but generation/level doesn't match, want gen: 420465 level: 0
Well block 335032057856(gen: 419888 level: 0) seems good, but generation/level doesn't match, want gen: 420465 level: 0
Well block 334896906240(gen: 419873 level: 0) seems good, but generation/level doesn't match, want gen: 420465 level: 0
Well block 334756773888(gen: 419858 level: 0) seems good, but generation/level doesn't match, want gen: 420465 level: 0
Well block 334651310080(gen: 419853 level: 0) seems good, but generation/level doesn't match, want gen: 420465 level: 0
Well block 334631600128(gen: 419852 level: 0) seems good, but generation/level doesn't match, want gen: 420465 level: 0
Well block 334524530688(gen: 419839 level: 0) seems good, but generation/level doesn't match, want gen: 420465 level: 0
Well block 334506164224(gen: 419838 level: 0) seems good, but generation/level doesn't match, want gen: 420465 level: 0
Well block 334489812992(gen: 419837 level: 0) seems good, but generation/level doesn't match, want gen: 420465 level: 0
Well block 334387347456(gen: 419826 level: 0) seems good, but generation/level doesn't match, want gen: 420465 level: 0
Well block 334370078720(gen: 419825 level: 0) seems good, but generation/level doesn't match, want gen: 420465 level: 0
Well block 334333640704(gen: 419824 level: 0) seems good, but generation/level doesn't match, want gen: 420465 level: 0
Well block 334192967680(gen: 419821 level: 0) seems good, but generation/level doesn't match, want gen: 420465 level: 0
Well block 334104657920(gen: 419818 level: 0) seems good, but generation/level doesn't match, want gen: 420465 level: 0
Well block 334067269632(gen: 419817 level: 0) seems good, but generation/level doesn't match, want gen: 420465 level: 0
<< snip 190 more entries>>
liveuser@localhost-live:~$ sudo btrfs check /dev/nvme0n1p3
Opening filesystem to check...
checksum verify failed on 823396630528 wanted 0x02000000 found 0xabe960d0
checksum verify failed on 823396630528 wanted 0x00000000 found 0x8b095422
checksum verify failed on 823396630528 wanted 0x02000000 found 0xabe960d0
bad tree block 823396630528, bytenr mismatch, want=823396630528, have=0
Couldn't read tree root
ERROR: cannot open file system

Any time I try to do anything to recover, I only get the above set of “Couldn’t read tree root” errors.
Checking for hard disk failure on the NVME shows nothing:

liveuser@localhost-live:~$ sudo smartctl -x /dev/nvme0
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.8.5-301.fc40.x86_64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SOLIDIGM SSDPFKKW020X7
Serial Number:                      SSC6N492010506M4K
Firmware Version:                   001C
PCI Vendor/Subsystem ID:            0x025e
IEEE OUI Identifier:                0xace42e
Controller ID:                      0
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,048,408,248,320 [2.04 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            aca32f 036500880f
Local Time is:                      Wed May  1 16:23:33 2024 EDT
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x00df):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp Verify
Log Page Attributes (0x1e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size:         64 Pages
Warning  Comp. Temp. Threshold:     86 Celsius
Critical Comp. Temp. Threshold:     87 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.50W       -        -    0  0  0  0        5     305
 1 +   3.9000W       -        -    1  1  1  1       30     330
 2 +   1.5000W       -        -    2  2  2  2      100     400
 3 -   0.0500W       -        -    3  3  3  3      500    1500
 4 -   0.0050W       -        -    4  4  4  4     1000    9000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        42 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    385,133,057 [197 TB]
Data Units Written:                 20,969,617 [10.7 TB]
Host Read Commands:                 2,179,118,293
Host Write Commands:                434,322,982
Controller Busy Time:               30,833
Power Cycles:                       32
Power On Hours:                     3,786
Unsafe Shutdowns:                   18
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               36 Celsius
Temperature Sensor 2:               37 Celsius
Thermal Temp. 1 Transition Count:   5
Thermal Temp. 1 Total Time:         508

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
No Self-tests Logged

I do not have a backup of critical files (about 10Gb) on this system. There is also about 1Tb of data I would like to keep in place. Is there a way to rebuild the missing file tree? Can I copy (part of) the file system off the drive?

Added btrfs

Removed intel, kde

It sounds like you likely have a failing drive, especially given:

In situations like this I usually just hit the hard reset, and continue on my way. This usually works

That’s not a normal thing you should have to do and shutting off the power where there’s a ton of IO isn’t particularly healthy for it either. Also, NVMe drives are generally very fast, so the fact that you were getting regular IO lockups is further not a good sign.

1 Like

First you could try to do a scrub of the device which recalculates your checksums, but will take time for sure. Boot from the live usb and do the scrub on the drive. There are other rescue options but first try the scrub, it doesn’t really “fix” anything on the device just redoes checksum’s.
If you do a kernel backtrace it can have some info about what is happening, if it includes info like the following …

? replay_one_dir_item+0xb5/0xb5 [btrfs]
? walk_log_tree+0x9c/0x19d [btrfs]
? btrfs_read_fs_root_no_radix+0x169/0x1a1 [btrfs]
? btrfs_recover_log_trees+0x195/0x29c [btrfs]
? replay_one_dir_item+0xb5/0xb5 [btrfs]
? btree_read_extent_buffer_pages+0x76/0xbc [btrfs]
? open_ctree+0xff6/0x132c [btrfs]

then a sudo btrfs-rescue zero-log <device> should fix things, but don’t do it unless your kernel log backtrace supports this premise.

1 Like

That is what I’m afraid of. the laptop only has a single slot for an NVME drive, so I don’t have a mirror setup. The disk is relatively new, less than 8 months. The lockup issue seems to come from chromium, or web-browsing in general. This behavior also happened on another NVME drive and an external USB boot disk I used for a while. Due to this prior behavior I discounted the hard drive being the culprit.

As far as using “btrfs scrub”, I get this result:

root@localhost-live:/home/liveuser# btrfs scrub start -B -R /dev/nvme0n1p3 > /run/media/liveuser/Backup/scrub_results.txt
ERROR: '/dev/nvme0n1p3' is not a mounted btrfs device

I can’t seem to get past the mounting error, as noted in the original post.

How would I get the kernel backtrace for mounting the btrfs partition on a live copy of Fedora 40 KDE spin?

As mentioned, “unsafe shutdowns” can damage filesystems. Have you tried getting a text console with <Ctrl-Alt F3> or connecting via ssh from another system? It may help with recovery to understand more about the lockups. In addition to a failing SSS, heat can cause lockups. How hot does the system get

and have you checked that the cooling system is free of accumulated dust, etc. (see: Intel Overheating Symptoms and Troubleshooting for Intel® Boxed Processors)? There have been reports of problems in linux with some CPU’s marketed with overly aggressive clock speeds.

It is best if you can clone the disk (e.g., using Gnome disks or dd after booting a Live Linux USB drive) before attempting repair.

1 Like

I had something similar on an SSD drive a while ago, in fact a couple of times. The SSD drive was not a problem, it still works today.

I think the problem was always related to space issues on the BTRFS volume. I think it always had to do with running out of space to be allocated. BTRFS was never clear about what the problem was exactly. I tried all the restoring methods, everything was fine, only to realize I was out of space to be allocated.

I would like to thank you for your help, and the team’s continued support as I move through this.
I am currently backing up the 2Tb partition.
As a point of order, the hard drive is only about 56% full (1Tb out of 1.86)
The SMART test shows no overheat issues for the NVME drive (n.b. the original post). The air flow is in good shape.
My processor is 11th gen intel, so it’s not suffering from the 13/14th gen configuration problem.
I’ve run memory tests (but not in the last month) .

11th gen intel is not that old. Do SSDs typically fail after <3 years?

This doesn’t indicate heavy wear either.

They will if overheated or subjected to electrostatic discharge (ESD). Did the person who installed the SSD’s use ESD mitigation practices?

Quoting makeuseof.com:

ESD damage isn’t always immediately apparent. While it will often damage electronic components beyond repair, this damage may not be fatal right away. This means that electronics that have been hit with ESD can end up breaking weeks or even months down the line.

Hi, I wanted to try that at first since it was a safe action, shouldn’t corrupt data. You can try to mount the volume with the live usb image by creating a tempdir that you mount the drive to, then do the scrub. There are other tools such as restore and rescue that we could try later after exhausting checks for the actual error. Also mounting the drive temporarily via the live usb will generate the kernel stacktrace if the error occurs. Basically your device is not mounting due to a mismatch in the generation number of the mount, it reports 0 but expects some other number. This is likely caused by forced power off before the filesystem was able to complete an orderly shutdown. I think the device is very likely still ok.

[quote=“Stephen Snow, post:13, topic:115851, username:jakfrost”]
This is likely caused by forced power off before the filesystem was able to complete an orderly shutdown. I think the device is very likely still ok.
[/quote]s

The drive is still suspect until the cause of lockups is identified – hopefully there are hints available using journalctl if the drive can be recovered.

Yes, I tend to practice good ESD mitigation. After seeing xrays of accidentally zapped chips, I treat my electronics with a bit more caution.

The journalctl output is as noted:

May 01 08:40:22 localhost-live polkit-kde-authentication-agent-1[2720]: qt.qml.typeresolution.cycle: Cyclic dependency detected between "qrc:/qt/qml/org/kde/desktop/private/TextFieldContextMenu.qml" and "qrc:/qt/qml/org/kde/desktop/MenuItem.qml"
May 01 08:40:22 localhost-live polkit-agent-helper-1[3664]: pam_unix(polkit-1:auth): user [liveuser] has blank password; authenticated without it
May 01 08:40:22 localhost-live audit[3664]: USER_AUTH pid=3664 uid=1000 auid=1000 ses=2 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 msg='op=PAM:authentication grantors=pam_unix acct="liveuser" exe="/usr/lib/polkit-1/polkit-agent-helper-1" hostname=? addr=? terminal=? res=success'
May 01 08:40:22 localhost-live audit[3664]: USER_ACCT pid=3664 uid=1000 auid=1000 ses=2 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 msg='op=PAM:accounting grantors=pam_unix acct="liveuser" exe="/usr/lib/polkit-1/polkit-agent-helper-1" hostname=? addr=? terminal=? res=success'
May 01 08:40:22 localhost-live polkitd[1854]: Operator of unix-session:1 successfully authenticated as unix-user:liveuser to gain TEMPORARY authorization for action org.freedesktop.udisks2.filesystem-mount-system for system-bus-name::1.83 [/usr/bin/dolphin] (owned by unix-user:liveuser)
May 01 08:40:22 localhost-live maliit-keyboard[2556]: QSGContext::initialize: depth buffer support missing, expect rendering errors
May 01 08:40:22 localhost-live maliit-keyboard[2556]: QSGContext::initialize: stencil buffer support missing, expect rendering errors
May 01 08:40:22 localhost-live kernel: BTRFS: device label fedora devid 1 transid 420465 /dev/nvme0n1p3 scanned by pool-udisksd (3656)
May 01 08:40:22 localhost-live kernel: BTRFS info (device nvme0n1p3): first mount of filesystem e40e2cfb-83f5-48ff-a481-bf9f0cc22543
May 01 08:40:22 localhost-live kernel: BTRFS info (device nvme0n1p3): using crc32c (crc32c-intel) checksum algorithm
May 01 08:40:22 localhost-live kernel: BTRFS info (device nvme0n1p3): using free-space-tree
May 01 08:40:22 localhost-live kernel: BTRFS error (device nvme0n1p3): bad tree block start, mirror 1 want 823396630528 have 0
May 01 08:40:22 localhost-live kernel: BTRFS error (device nvme0n1p3): bad tree block start, mirror 2 want 823396630528 have 0
May 01 08:40:22 localhost-live kernel: BTRFS warning (device nvme0n1p3): couldn't read tree root
May 01 08:40:22 localhost-live kernel: BTRFS error (device nvme0n1p3): open_ctree failed

and more recently:

May 02 10:53:22 localhost-live polkit-kde-authentication-agent-1[2720]: qt.qml.typeresolution.cycle: Cyclic dependency detected between "qrc:/qt/qml/org/kde/desktop/private/TextFieldContextMenu.qml" and "qrc:/qt/qml/org/kde/desktop/MenuItem.qml"
May 02 10:53:22 localhost-live polkit-agent-helper-1[22042]: pam_unix(polkit-1:auth): user [liveuser] has blank password; authenticated without it
May 02 10:53:22 localhost-live audit[22042]: USER_AUTH pid=22042 uid=1000 auid=1000 ses=2 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 msg='op=PAM:authentication grantors=pam_unix acct="liveuser" exe="/usr/lib/polkit-1/polkit-agent-helper-1" hostname=? addr=? terminal=? res=success'
May 02 10:53:22 localhost-live audit[22042]: USER_ACCT pid=22042 uid=1000 auid=1000 ses=2 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 msg='op=PAM:accounting grantors=pam_unix acct="liveuser" exe="/usr/lib/polkit-1/polkit-agent-helper-1" hostname=? addr=? terminal=? res=success'
May 02 10:53:22 localhost-live polkitd[1854]: Operator of unix-session:1 successfully authenticated as unix-user:liveuser to gain TEMPORARY authorization for action org.freedesktop.udisks2.filesystem-mount-system for system-bus-name::1.430 [/usr/bin/dolphin /run/media/liveuser/Games] (owned by unix-user:liveuser)
May 02 10:53:22 localhost-live kernel: BTRFS: device label fedora devid 1 transid 420465 /dev/nvme0n1p3 scanned by pool-udisksd (22031)
May 02 10:53:22 localhost-live kernel: BTRFS info (device nvme0n1p3): first mount of filesystem e40e2cfb-83f5-48ff-a481-bf9f0cc22543
May 02 10:53:22 localhost-live kernel: BTRFS info (device nvme0n1p3): using crc32c (crc32c-intel) checksum algorithm
May 02 10:53:22 localhost-live kernel: BTRFS info (device nvme0n1p3): using free-space-tree
May 02 10:53:22 localhost-live kernel: BTRFS error (device nvme0n1p3): bad tree block start, mirror 1 want 823396630528 have 0
May 02 10:53:22 localhost-live kernel: BTRFS error (device nvme0n1p3): bad tree block start, mirror 2 want 823396630528 have 0
May 02 10:53:22 localhost-live kernel: BTRFS warning (device nvme0n1p3): couldn't read tree root
May 02 10:53:22 localhost-live kernel: BTRFS error (device nvme0n1p3): open_ctree failed

Do those BTRFS error messages correspond to lockups or are they from boots following unsafe shutdown? It helps to note the time a lockup occurs so you can look for journalctl entries at the time the issue occurred.

I don’ get it. I do unsafe abrupt shutdowns of BTRFS filesystems all the time, and I never get any errors.

Not all hardware is created equal.

1 Like

Is that really hardware-dependent? If it’s abrupt, it’s abrupt, so it’d be a software issue?

I’d say you’ve just been lucky.