Btrfs woes

I ran into btrfs issues on a F38 btrfs install. The filesystem flipped to readonly, and then failed to mount on reboot. It’s a 2-device btrfs setup, with a 1.8 TB NVME partition and a 137GB SSD drive.
The error while mounting is referring to checksum errors and a failure to read superblock, actually triggering a kernel panic (see below)

It doesn’t seem to be a hardware error–I successfully copied the NVME partition to an external drive with ddrescue, and SMART/NVME messages do not show any errors.

btrfs filesystem show reports a 2-device setup as described above. I tried mount -t btrfs -o recovery,ro and it fails with the same checksum errors (also mentioning that -o recovery is deprecated, use rescue=usebackuproot instead. I haven’t seen it mentioned anywhere else so I haven’t tried it yet.

There is a bunch of options in the btrfs tools and the internet advice seems to go in several directions, so I thought I’d ask for consensus advice. Should I try the mount -o ro,rescue=usebackuproot or proceed to btrfs check --repair or maybe btrfs check --init-csum-tree? there are ominous warnings for those, so I wasn’t sure what to do even though I do have a snapshot.

BTW, should the repair be done on one partition (the main NVME one, for instance) or on both?

The kernel messages are as follows (OCRed from a screen photo so there may be errors). Is it worth to submit that as a kernel bug, since the kernel shouldn’t panic?


[Sat Sep 16 16:42:20 2023] BTRFS info (device nvmeenip3): using ere32c (crc32e-intel) checksum algorithm
 [Sat Sep 16 16:42:20 2023] BTRFS warning (device nvmeonip3): 'recovery is deprecated, use 'rescue-usebackuproot' instead
[Sat Sep 16 16:42:20 2023] BTRFS info (device numeonlp3): trying to use backup root at mount time
[Sat Sep 16 16:42:20 2023] BTRFS info (device nvmeonip3): using free space tree
[Sat Sep 16 16:42:20 2023] BTRFS info (device nvmeenip3): bdev /dev/nvme0n1p3 errs: wr 0, rd 0, flush 6, corrupt 19, gen
[Sat Sep 16 16:42:20 2623] BTRFS info (device nvmeonlp3): enabling ssd optimizations
[Sat Sep 16 16:42:20 2023] BTRFS info (device nvmeonlp3): auto enabling async discard 
[Sat Sep 16 16:42:20 2023] BTRFS info (device nvmeonlp3): start tree-log replay
[Sat Sep 16 16:42:21 2023] BTRFS warning (device nvmeonlp3): checksum verify failed on logical 330481664 mirror 2 wanted 0x0079615b found 8x269dalbb
[Sat Sep 16 16:42:21 2023] BTRFS warning (device nvmeenlp3): checksum verify failed on logical 330481664 mirror 1 wanted ex8879615b found 8x269dalbb lev
[Sat Sep 16 16:42:21 2023] BTRFS: error (device nvme@nlp3) in btrfs_replay_log:2482: errno=-5 10 failure (Failed to recover log tree) 
[Sat Sep 16 16:42:21 2023] I cut here ]----------
[Sat Sep 16 16:42:21 2023] WARNING: CPU: 3 PID: 14207 at fs/btrfs/block-rsv.c:446 btrfs_release_global_block_rsv @xac/@xd0
[Sat Sep 16 16:42:21 2823] Modules linked in: vfat fat uinput .......
[Sat Sep 16 16:42:21 2023] CPU: 3 PID: 14207 Comm; mount Tainted: G 6.2.9-300.fc38.x86_64 #1 [Sat Sep 16 16:42:21 2023] Hardware name: Dell Inc, OptiPlex 7060/0KYJ8C, BIOS 1.21.0 07/06/2022
[Sat Sep 16 16:42:21 2023] RIP: 0010:btrfs_release_global_block_rsv+0xac/exde
[Sat Sep 16 16:42:21 2023] Code: 01 00 00 00 74 as of ob 48 83 bb 50 01 00 00 00 74 as of ob 48 83 bb 58 01 00 00 00 74 a6 of ob 48 83 bb 80 81 00 00 74 a4 <0f> ob 48 83 bb 88 01 00 00 00 74 a2 of ob 48 83 bb b8 01 00 00 00 
[Sat Sep 16 16:42:21 2023] RSP: 0018:ffffba79c48a3aa8 EFLAGS: 00010286

This is almost certainly a failing hard-drive problem. When the kernel detects that the disk is dying due to IO errors, it puts it in read-only so you have a chance of copying any data you need off before it fully dies. I would not try to do any more file system recovery on it, but copy whatever data you might have that you haven’t backed up and replace that drive in your system ASAP.

One note - dd doesn’t do any checksum checking, so it’s not surprising that dd will work, since it doesn’t guarantee good data on the other end. If you look at dmesg, I’m assuming you’ll see I/O errors at some point, which is an indicator that all is not well. If you don’t see any I/O errors or messages in dmesg about that drive, then you might be onto something, but this is very likely a hardware problem.

1 Like

[Sat Sep 16 16:42:20 2023] BTRFS info (device nvmeenip3): bdev /dev/nvme0n1p3 errs: wr 0, rd 0, flush 6, corrupt 19, gen

flush 6” errors seems like a disk problem, at least reading the Btrfs maintainer’s analysis of a similar problem:

Lots of missed writes and reads, plus some flush errors (ie. failed
super block writes

corrupt 19”:

Here corrupt means that garbage was read from the disk, which could mean
that the sector was eg. zeroed (like replaced from the internal HDD
pool) or stale data found, or crc mismatch.

Source: https://www.spinics.net/lists/linux-btrfs/msg139799.html

I really don’t think this is hardware.

  1. there are no I/O errors in the logs; flush 6 was an OCR error (the kernel message says wr0,rd0,flush0). The case you quoted had tons of wr/rd errors, as well as flush). The only errors seen are the checksum errors

  2. badblocks /dev/sda and badblocks /dev/nvme0n1 read the entire devices just fine

  3. both disks show clear logs and pass self-tests (smartctl -t long and nvme device-self-test -s1

I don’t have time to look at the change logs but kernel 6.2.9 might be in the vicinity of a tree log related bug that sometimes caused this issue. I suggest two things:

btrfs rescue zero-log to remove the bad tree log, so you can boot. And then upgrade the kernel.

Tree log is only used as an fsync optimization, so replay is not required to make the file system consistent following a crash. But it’s possible up to a minute of writes at the time of the crash will be lost.

For what it’s worth, specifically this sequence of messages:

[Sat Sep 16 16:42:20 2023] BTRFS info (device nvmeonlp3): start tree-log replay
[Sat Sep 16 16:42:21 2023] BTRFS: error (device nvme@nlp3) in btrfs_replay_log:2482: errno=-5 10 failure (Failed to recover log tree)
[Sat Sep 16 16:42:21 2023] BTRFS: error (device nvme@nlp3) in btrfs_replay_log:2482: errno=-5 10 failure (Failed to recover log tree)

is why I’m recommending zeroing the tree log. It’s not indicated in most btrfs issues so it’s not a thing to try in all or most cases. It’s very specific to failure to replay the log.

1 Like

badblocks is meant for spinning disk drives, and not solid state NVMe drives. smartctl is still valid, but keep in mind that smartctl is telling you the results from the last time the test was ran, which may have been before the drive started breaking. The devil is also in the details, but I’ve seen head-crashed HDDs still show a passed smartctl because it was ran before the head crash and the test refused to run on it since. If the -t long actually ran and finished and it still shows healthy, you’re not entirely out of the woods because you should take a look at the values and thresholds, especially for recent errors, ECC, corrections, etc.

Also, I would definitely take @chrismurphy’s recommendations when it comes to btrfs. :slight_smile:

[Sat Sep 16 16:42:21 2023] RIP: 0010:btrfs_release_global_block_rsv+0xac/exde

I did a bit of git log digging on that message and the most recent commit is 2020, which is too long ago for the bug I’m thinking of.

Anyway, I suggest upgrading the kernel to 6.4 series for sure since 6.2 series is EOL anyway. A normal dnf update or using GNOME Software will get you the current stable kernel.

If the problem reoccurs, let us know.

I missed this the first and second time around, Scott caught it.

Note that these are just counters. Every time a particular problem is seen, even if it’s the same problem, the counter is incremented. The flush problems are pretty rare and concerning because Btrfs is particularly sensitive to write order being honored correctly.

I can’t tell whether this might be a different kernel bug, in NVMe or block drivers. Or if it’s a drive firmware bug. It could be either. So I suggest working on upgrading both. All flash drives have buggy firmware, but sometime we can work around it in the kernel. Flush and fua bugs are not in that category. So we need to figure out why the flush counter is being incremented because that’s not good. Maybe one of these NVMe drives is being periodically reset, newer drives sometimes aren’t immediately supported in current kernels so again best to update the kernel.

You can zero the counters if you want by btrfs device stats -z and pointing to either the unmounted device node, or if the fs is mounted, point the command to an of its mount points like / or /home

1 Like

Thanks for your advice, will definitely follow.
Actually, the flush issue is a red herring–since the system was down I took a screenshot of the message and OCRed it with Google Lens. The actual message says wr0,rd0,flush0 ; the 6 comes from google lens misreading the crossed-zero. Of course there’s the matter of 19 corruptions, but they at least could be software-caused.

Ah, and should btrfs rescue zero-log be done on all devices, or just one? The idea with this FS was to combine a small auxiliary SSD and a large NVME to get a small RAID-1 volume for resiliency, and a remaining larger working area that is not redundant.

Yes, as you say, it’s confusing because some properties of those different storage technology generations can be transferred and others cannot. badblocks at least show that the data can be read, and does not result in kernel-reported I/O errors. I did check smartctl for errors with pending and reallocated sectors, and didn’t see any.
You’re right though that firmware diagnostics can’t be fully relied on; I have a collection of failed drives with various SMART statuses that sometimes were not reflecting the pitiful state of the storage. I actually was an unlucky owner of a drive which had such a weird undetected failure, causing a creeping bitrot in the RAID filesystem, that it was investigated by Equallogic and resulted in a worldwide emergency firmware update.

btrfs rescue need only be run on any one of the member devices in a Btrfs file system. The tool will locate all the constituent devices and update them appropriately.

badblocks is not invalid on NVMe as far as I’m aware, but also large volume consumers of SATA and PCIe flash have told us that these devices rarely if ever report bad sectors. They just return zeros or garbage when there’s a problem, rather than a discrete uncorrectable read error as is common with spinning hard drives. Hence the benefit of running Btrfs since everything is checksummed, data and metadata. Any data csum mismatch means the corrupt data never makes it to user space, preventing any usage or replication of bad data.

It’s possible the source of the corruption is software. Btrfs, like any file system, is not bug free. But it’s also possible there are firmware bugs (we see a lot of these) and also occasionally bad memory. While I don’t recommend using Btrfs as a memory tester, it does often end up revealing it when present. These show up as transient bit flips (off by one). For awhile now there’s both a read time and write time tree checker that attempts to catch these kinds of problems before they become bigger issues - so it’s normal for the file system to become read-only in such a case, so that any confusion doesn’t get written to disk and cause more problems.

1 Like

btrfs rescue zero-log makes things better by allowing me to mount -o ro,rescue=usebackuproot /dev/nvme0n1p3 /mnt, and see the files, but the filesystem is still damaged and reverts to readonly when I boot to it.

Unfortunately btrfs scrub start /mnt seems to abort almost immediately (retunrs to prompt and btrfs scrub status shows status: aborted; there aren’t any dmesg messages).

I have seen memory errors when running memtester and BIOS diags. After re-seating DIMMs the errors seem go be gone, but the filesystem has more damage that wasn’t fixed by zeroing the logs:
btrfs check /dev/nvme0n1p3 shows checksum errors (see below).

Is it time to do btrfs check --init-csum-tree?

What is 330481664—it seems to be repeated so I wonder if I can figure out which file it is associated with, and get rid of the checksum errors by deleting the file? I tried btrfs inspect-internal dump-tree -b 330481664 /dev/nvme0n1p3 but this number doesn’t seem to be the logical address

Opening filesystem to check...
Checking filesystem on /dev/nvme0n1p3
UUID: 8c527aa5-e843-465b-ab56-5827851111d1
[1/7] checking root items
checksum verify failed on 330481664 wanted 0x0079b15b found 0x269da1bb
checksum verify failed on 330481664 wanted 0x0079b15b found 0x269da1bb
checksum verify failed on 330481664 wanted 0x0079b15b found 0x269da1bb
Csum didn't match
ERROR: failed to repair root items: Input/output error
[2/7] checking extents
checksum verify failed on 330481664 wanted 0x0079b15b found 0x269da1bb
checksum verify failed on 330481664 wanted 0x0079b15b found 0x269da1bb
checksum verify failed on 330481664 wanted 0x0079b15b found 0x269da1bb
Csum didn't match
owner ref check failed [330481664 16384]
ref mismatch on [1550045184 12288] extent item 0, found 1
data extent[1550045184, 12288] referencer count mismatch (root 257 owner 247907 offset 0) wanted 0 have 1
backpointer mismatch on [1550045184 12288]
ref mismatch on [1550057472 8192] extent item 0, found 1
data extent[1550057472, 8192] referencer count mismatch (root 256 owner 4103478 offset 0) wanted 0 have 1
backpointer mismatch on [1550057472 8192]
[........]
ref mismatch on [1551249408 4096] extent item 0, found 1
data extent[1551249408, 4096] referencer count mismatch (root 256 owner 4064970 offset 0) wanted 0 have 1
backpointer mismatch on [1551249408 4096]
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space tree
checksum verify failed on 330481664 wanted 0x0079b15b found 0x269da1bb
checksum verify failed on 330481664 wanted 0x0079b15b found 0x269da1bb
checksum verify failed on 330481664 wanted 0x0079b15b found 0x269da1bb
Csum didn't match
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
checksum verify failed on 330481664 wanted 0x0079b15b found 0x269da1bb
checksum verify failed on 330481664 wanted 0x0079b15b found 0x269da1bb
checksum verify failed on 330481664 wanted 0x0079b15b found 0x269da1bb
Csum didn't match
Error going to next leaf -5
csum exists for 1547927552-1550389248 but there is no extent record
checksum verify failed on 330481664 wanted 0x0079b15b found 0x269da1bb
checksum verify failed on 330481664 wanted 0x0079b15b found 0x269da1bb
checksum verify failed on 330481664 wanted 0x0079b15b found 0x269da1bb
Csum didn't match
Error looking up extent record -5
csum exists for 1550401536-1551089664 but there is no extent record
checksum verify failed on 330481664 wanted 0x0079b15b found 0x269da1bb
checksum verify failed on 330481664 wanted 0x0079b15b found 0x269da1bb
checksum verify failed on 330481664 wanted 0x0079b15b found 0x269da1bb
Csum didn't match
Error looking up extent record -5
csum exists for 1551097856-1551847424 but there is no extent record
ERROR: errors found in csum tree
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 165067415552 bytes used, error(s) found
total csum bytes: 116659132
total tree bytes: 1204207616
total fs tree bytes: 978763776
total extent tree bytes: 80347136
btree space waste bytes: 248615130
file data blocks allocated: 336473149440
 referenced 203036381184

As a followup, I have a working hypothesis what happened here—I think that the filesystem corruption resulted from faulty memory, due to an overly aggressive memory timing. I added third-party memory to this system, and I think I ran BIOS diagnostics afterwards, but I can’t remember for sure—either way, the memory became intermittent, with errors every few minutes during tests.

Anyway, after seeing the BTRFS errors and detecting errors while running memtester, I re-ran BIOS memory diagnostics, which reported memory issues as well. The DELL BIOS printed a message about seeing memory test issues, and launched an extended test (about an hour long), which finally reported success. Since the DELL BIOS, unlike many OEM BIOSes, does not have an extended memory timing setup pages, I think it tweaked memory timing behind the scenes to address the memory errors, as a result of the tests I ran.

I noticed that dmidecode reports this about my memory now:

  Speed: 3200 MT/s
 ...
   Configured Memory Speed: 2666 MT/s

I can’t say for sure if this changed recently, but I suspect the BIOS derated the speed to fix the errors.

I re-ran the memtester as well, and this time got no errors.

Unfortunately, that didn’t work:

Reinitialize checksum tree
checksum verify failed on 330481664 wanted 0x0079b15b found 0x269da1bb
checksum verify failed on 330481664 wanted 0x0079b15b found 0x269da1bb
checksum verify failed on 330481664 wanted 0x0079b15b found 0x269da1bb
Csum didn't match
ERROR: checksum tree refilling failed: -5

Is there another repair option I should try?

In related news, and apologies for simping, I realized that btrfs scrub start has to be run on a R/W filesystem—so I tried mount -o remount,rwbefore starting the scrub, but that didn’t work either:

ERROR: scrubbing /mnt failed for device id 1: ret=-1, errno=5 (Input/output error)
scrub canceled for 8c527aa5-e843-465b-ab56-5827851111d1
Scrub started:    Fri Sep 22 23:48:53 2023
Status:           aborted
Duration:         0:00:04
Total to scrub:   2.09GiB
Rate:             535.83MiB/s
Error summary:    csum=2
  Corrected:      0
  Uncorrectable:  2
  Unverified:     0

Only if you’re ready to lose the file system. Take advantage of read-only mount, extract what you can to freshen backups, then you can try repair. Repair is irreversible and sometimes makes things worse.

What is 330481664—it seems to be repeated so I wonder if I can figure out which file it is associated with,

Btrfs metadata is not in any particular location, it can be anywhere on the drive. So these numbers don’t mean anything by themselves. But in context it sounds like it’s related to the root tree, it could be any node or leaf, and it’s failing checksum verification therefore it can’t be trusted.

You can try to mount using "-o ro,rescue=all` which will skip over bad roots and data csum mismatches. It’s a lot more tolerant of mounting damaged file systems but because it’s ignoring data csums, it will allow you to copy damaged files out of the file system. So you might embargo such files, and ensure they are not corrupt so that you don’t inadvertently start replicating corrupt files into the future as well as your backups.

I think that the filesystem corruption resulted from faulty memory, due to an overly aggressive memory timing.

Common problem. Folks don’t realize the risks. I don’t recommend intentionally using Btrfs as a memory tester but it does tend to discover these kinds of problems.

I realized that btrfs scrub start has to be run on a R/W filesystem

You can scrub a read-only file system using btrfs scrub start -Bdr /dev/