BTRFS no longer mounts

It is a little bit from Column A and Column B.
Chromium has a tendency to live lock the system, especially with the number of tabs I keep open. My typical response is to wait, which is the best answer but doesn’t always work. I try to use another terminal to “shutdown now” then restart. which sometimes work the same with logging in via SSH from outside. Failing that I hit the power button as a hard reset, which until this time always worked.
I think that in this case it is improper shutdown. At least I’m guessing. The hard part is that there is no trouble shooting flow chart, and many of the solutions online seem to be auto-generated. I could note the large number of command line switches that don’t exist.

Memory is not saturated when this happens, and stress tests come up with nothing.

As far as the listed btrfs errors, these are all from a bad shutdown and the drive is currently responding as noted above:

  1. Will not mount, with bad tree block start from journalctl
  2. Checksum not valid on for mirrors, two agree, all three are wrong.
  3. All superblocks register as O.K.
  4. “Generation:level” across about 210 blocks seems to be off

I don’t need to get the file system in bootable fashion, I just need a read only path to pull the critical data off. After that I can clobber the partition and start over with a copy of Fedora 40 KDE. It’s currently using a copy of Fedora 39 Gnome.

As a point of order, none of the critical data I was working on at the time, so if I miss a day’s worth of filesystem use, that is fine.

Yes it is hardware dependent. Some hardware has capacitors rated to save all queued writes, other hardware may not have that safety build in.

1 Like

The reason why some HDD’s are rated “Internal / External Drives”. Some make it to servers, some are External USB rated. “shuck” drives usually have a problem with sleep modes and other features. Also, the silicon is never perfect.

Have you tried btrfs repair ?

If you can recover the file-system, journalctl may tell why chromium is causing the problem. If using a terminal or ssh works the kernel is still alive and journalctl should have relevant data from those previous boots.

btrfs repair is not an option on F40 KDE spin, version 6.8-1.fc40:

  • receive
  • replace
  • rescue

are available options.

Sorry I meant, btrfs check --repair /path/to-my/drive should be done from a LiveUSB pointing to the drive.

Results of check repair:

root@localhost-live:/home/liveuser# btrfs check --repair /dev/nvme0n1p3
enabling repair mode
WARNING:

        Do not use --repair unless you are advised to do so by a developer
        or an experienced user, and then only after having accepted that no
        fsck can successfully repair all types of filesystem corruption. E.g.
        some software or hardware bugs can fatally damage a volume.
        The operation will start in 10 seconds.
        Use Ctrl-C to stop it.
10 9 8 7 6 5 4 3 2 1
Starting repair.
Opening filesystem to check...
checksum verify failed on 823396630528 wanted 0x02000000 found 0xabe960d0
checksum verify failed on 823396630528 wanted 0x00000000 found 0x8b095422
checksum verify failed on 823396630528 wanted 0x02000000 found 0xabe960d0
bad tree block 823396630528, bytenr mismatch, want=823396630528, have=0
Couldn't read tree root
ERROR: cannot open file system

@wrotach There’s maybe a way forward with btrfs rescue chunk-recover Also from a LiveUSB device.

I would look into recovery tools if you have important or sensitive data. I would recommend some, but they all require th filesystem to be mounted. Cloning the bad drive might be an option to preserve what you have as a means to explore a repair.

little if no success:

root@localhost-live:/home/liveuser# btrfs rescue chunk-recover /dev/nvme0n1p3
Scanning: DONE in dev0                         
Check chunks successfully with no orphans
root@localhost-live:/home/liveuser# btrfs check --repair /dev/nvme0n1p3
enabling repair mode
WARNING:

        Do not use --repair unless you are advised to do so by a developer
        or an experienced user, and then only after having accepted that no
        fsck can successfully repair all types of filesystem corruption. E.g.
        some software or hardware bugs can fatally damage a volume.
        The operation will start in 10 seconds.
        Use Ctrl-C to stop it.
10 9 8 7 6 5 4 3 2 1
Starting repair.
Opening filesystem to check...
checksum verify failed on 823396630528 wanted 0x02000000 found 0xabe960d0
checksum verify failed on 823396630528 wanted 0x00000000 found 0x8b095422
checksum verify failed on 823396630528 wanted 0x02000000 found 0xabe960d0
bad tree block 823396630528, bytenr mismatch, want=823396630528, have=0
Couldn't read tree root
ERROR: cannot open file system

That’s about the limit of my experience with BTRFS, as for recovery I’m not sure you will get far. If you r drive is encrypted it would be even more difficult as the Header could be corrupt. If the data you need there is really important, there might be hope with recovery services like https://dmde.com/download.html.

I’m not sure the forensic tools offered in Fedora Security Lab would be enough as you can’t even mount to attempt it.

Manually editing this can have adverse reactions. . .

I’m thinking this is really bad, folks. I have a couple backup files to compare and test against for safety, but I’m getting a stone wall.

Starting with a find root btrfs-find-root -a fc39.raw.disk, I get the really long, 200 ish, list of “bad blocks”.

I’m getting the following results using a series of btrfs check -r <block> -s <super> fc39.raw.disk

the first in the chain gives this:

using SB copy 2, bytenr 274877906944
Opening filesystem to check...
parent transid verify failed on 822913368064 wanted 420465 found 419921
checksum verify failed on 822913368064 wanted 0x47180c50 found 0xad52f1e7
checksum verify failed on 822913368064 wanted 0x47180c50 found 0xad52f1e7
bad tree block 822913368064, bytenr mismatch, want=822913368064, have=0
Couldn't read tree root
ERROR: cannot open file system

Assuming that the “couldn’t read tree root” is a deeper problem, I searched for a block that didn’t have that issue, had the smallest number of checksum errors et al, and which was closest to the top of the chain. Results:

   btrfs check -r 822769500160 -s 0 fc39.raw.disk
using SB copy 0, bytenr 65536
Opening filesystem to check...
parent transid verify failed on 822769500160 wanted 420465 found 419917
checksum verify failed on 822769500160 wanted 0x40a0ba19 found 0x14b83719
parent transid verify failed on 822769500160 wanted 420465 found 419917
Ignoring transid failure
checksum verify failed on 823433723904 wanted 0x00a06b01 found 0x456ef27c
checksum verify failed on 823433723904 wanted 0x00406def found 0x1d612e9e
checksum verify failed on 823433723904 wanted 0x00406def found 0x1d612e9e
bad tree block 823433723904, bytenr mismatch, want=823433723904, have=18446635083965854016
ERROR: failed to read block groups: Input/output error
ERROR: cannot open file system

I then tried the repair switch, with this result:

$ btrfs check --repair --tree-root 822769500160 -s 0 fc39.raw.disk

enabling repair mode
using SB copy 0, bytenr 65536
WARNING:

        Do not use --repair unless you are advised to do so by a developer
        or an experienced user, and then only after having accepted that no
        fsck can successfully repair all types of filesystem corruption. E.g.
        some software or hardware bugs can fatally damage a volume.
        The operation will start in 10 seconds.
        Use Ctrl-C to stop it.
10 9 8 7 6 5 4 3 2 1
Starting repair.
Opening filesystem to check...
parent transid verify failed on 822769500160 wanted 420465 found 419917
checksum verify failed on 822769500160 wanted 0x40a0ba19 found 0x14b83719
parent transid verify failed on 822769500160 wanted 420465 found 419917
Ignoring transid failure
checksum verify failed on 823433723904 wanted 0x00a06b01 found 0x456ef27c
checksum verify failed on 823433723904 wanted 0x00406def found 0x1d612e9e
checksum verify failed on 823433723904 wanted 0x00406def found 0x1d612e9e
bad tree block 823433723904, bytenr mismatch, want=823433723904, have=18446635083965854016
ERROR: failed to read block groups: Input/output error
ERROR: cannot open file system

I refuse to believe that recovery from a bad shutdown is this hard or hidden.

Yeah, I sort of alluded to this in my previous post.

You have corrupted data, this was not just a bad shutdown. You most certainly have had issues in the past with either the drive and/or the filesystem itself. This is just compounded.

1 Like

It seems like a “bit flip” problem caused by faulty RAM, at least in my limited experience.

checksum verify failed on 823433723904 wanted 0x00a06b01 found 0x456ef27c
checksum verify failed on 823433723904 wanted 0x00406def found 0x1d612e9e
checksum verify failed on 823433723904 wanted 0x00406def found 0x1d612e9e

You can try mounting the filesystem in “rescue” mode:
sudo mount -o ro,rescue=all /dev/nvme0n1p3 /mnt

As a last option, if it is indeed a “bit flip” problem, if you have created a backup of the partition, you can take a look at this page and see if it helps: btrfs-bitflip.md · GitHub

Comment on this solution from the Btrfs maintainer: Reddit - Dive into anything

1 Like

As USB was unstable, I moved to a desktop with the hard drive containing the back up copies.
I attempted the prior fix sequence, double checking the raw disk file permissions and ownership, and I have the same result:

# btrfs check --repair --tree-root 822769500160 -s 0 fc39.raw.disk
enabling repair mode
using SB copy 0, bytenr 65536
WARNING:

        Do not use --repair unless you are advised to do so by a developer
        or an experienced user, and then only after having accepted that no
        fsck can successfully repair all types of filesystem corruption. E.g.
        some software or hardware bugs can fatally damage a volume.
        The operation will start in 10 seconds.
        Use Ctrl-C to stop it.
10 9 8 7 6 5 4 3 2 1
Starting repair.
Opening filesystem to check...
parent transid verify failed on 822769500160 wanted 420465 found 419917
checksum verify failed on 822769500160 wanted 0x40a0ba19 found 0x14b83719
parent transid verify failed on 822769500160 wanted 420465 found 419917
Ignoring transid failure
checksum verify failed on 823433723904 wanted 0x00a06b01 found 0x456ef27c
checksum verify failed on 823433723904 wanted 0x00406def found 0x1d612e9e
checksum verify failed on 823433723904 wanted 0x00406def found 0x1d612e9e
bad tree block 823433723904, bytenr mismatch, want=823433723904, have=18446635083965854016
ERROR: failed to read block groups: Input/output error
ERROR: cannot open file system

As I’m doing this on a local copy of the partition, no changes have been made in 75% of the binary file given all the attempts to “–repair”. I’ll take a deeper look at the python code, and attempt repair at that level.

FYI, I’m still looking into this, this is a deep rabbit hole, but I think I can get some kind of file recovery support from the following:
https://btrfs.readthedocs.io/en/latest/dev/dev-btrfs-design.html
https://btrfs.readthedocs.io/en/latest/dev/On-disk-format.html

A quick update: I found a filesystem copy from November, so I have a fallback. I would rather loose 6 months over the whole of the system.

However, I’m creating an application to scan through the backup copy to display the different potential FS/data trees, in the hopes that I can retrieve more recent documents. This is going to take a few weekends, as I am starting at the linux kernel source for btrfs, with some of the above links to differentiate the “on disk format”.

:eyes:

I’ve learned the hard lesson before and had to pay big money to forensically recover a drive. Now, I make sure I have both an on-site and off-site backup for everything I care about.