System freezes; Suspect btrfs transid verify errors to be the cause

Hi all,

I’ve recently tried to copy a file over to some application and my whole system froze. (Edit: Unrelated and was due to an actual kernel bug, now fixed) Running a Bazzite/Silverblue machine so also using bootc with my own container. I’m typing this from the system and can access it normally, but I have not figured out why the system froze. Edit: What is causing these transid verify failed etc?

Figured to check Gnome logs and noticed there was some btrfs related logs after boot.

btrfs messages
17:42:25 kernel: BTRFS error (device dm-3): parent transid verify failed on logical 740589568 mirror 1 wanted 1183 found 2551
17:42:25 kernel: BTRFS error (device dm-3): parent transid verify failed on logical 740589568 mirror 1 wanted 1183 found 2551
17:42:25 kernel: BTRFS error (device dm-3): parent transid verify failed on logical 740589568 mirror 2 wanted 1183 found 2551
17:41:33 kernel: BTRFS error (device dm-3): unable to fixup (regular) error at logical 123462418432 on dev /dev/dm-3 physical 122380288000
17:38:51 kernel: BTRFS error (device dm-3): parent transid verify failed on logical 740589568 mirror 1 wanted 1183 found 2551
17:38:51 kernel: BTRFS error (device dm-3): parent transid verify failed on logical 740589568 mirror 2 wanted 1183 found 2551
17:37:59 kernel: BTRFS error (device dm-3): unable to fixup (regular) error at logical 123462418432 on dev /dev/dm-3 physical 122380288000
17:36:30 kernel: BTRFS error (device dm-3): fixed up error at logical 1202755403776 on dev /dev/dm-2 physical 13020102656
17:36:30 kernel: BTRFS error (device dm-3): fixed up error at logical 1202755403776 on dev /dev/dm-2 physical 13020102656
17:36:29 kernel: BTRFS error (device dm-3): fixed up error at logical 1198084718592 on dev /dev/dm-2 physical 12644384768
17:36:29 kernel: BTRFS error (device dm-3): fixed up error at logical 1198084718592 on dev /dev/dm-2 physical 12644384768
17:36:29 kernel: BTRFS error (device dm-3): fixed up error at logical 1198084718592 on dev /dev/dm-2 physical 12644384768
17:36:29 kernel: BTRFS error (device dm-3): fixed up error at logical 1198084718592 on dev /dev/dm-2 physical 12644384768
17:36:28 kernel: BTRFS error (device dm-3): fixed up error at logical 1166903738368 on dev /dev/dm-2 physical 11528175616
17:36:28 kernel: BTRFS error (device dm-3): fixed up error at logical 1166903738368 on dev /dev/dm-2 physical 11528175616
17:36:28 kernel: BTRFS error (device dm-3): fixed up error at logical 1166903738368 on dev /dev/dm-2 physical 11528175616
17:36:28 kernel: BTRFS error (device dm-3): fixed up error at logical 1166903738368 on dev /dev/dm-2 physical 11528175616
17:36:22 kernel: BTRFS error (device dm-3): fixed up error at logical 676329291776 on dev /dev/dm-2 physical 5211291648
17:36:22 kernel: BTRFS error (device dm-3): fixed up error at logical 676329291776 on dev /dev/dm-2 physical 5211291648
17:36:22 kernel: BTRFS error (device dm-3): fixed up error at logical 676329488384 on dev /dev/dm-2 physical 5211488256
17:36:22 kernel: BTRFS error (device dm-3): fixed up error at logical 676329488384 on dev /dev/dm-2 physical 5211488256
17:36:22 kernel: BTRFS error (device dm-3): fixed up error at logical 676329488384 on dev /dev/dm-2 physical 5211488256
17:36:22 kernel: BTRFS error (device dm-3): fixed up error at logical 676329488384 on dev /dev/dm-2 physical 5211488256
17:36:22 kernel: BTRFS error (device dm-3): fixed up error at logical 676329881600 on dev /dev/dm-2 physical 5211881472
17:36:22 kernel: BTRFS error (device dm-3): fixed up error at logical 676329881600 on dev /dev/dm-2 physical 5211881472
17:36:22 kernel: BTRFS error (device dm-3): fixed up error at logical 676329881600 on dev /dev/dm-2 physical 5211881472
17:36:22 kernel: BTRFS error (device dm-3): fixed up error at logical 676329881600 on dev /dev/dm-2 physical 5211881472
17:36:16 kernel: BTRFS error (device dm-3): fixed up error at logical 256814415872 on dev /dev/dm-2 physical 160759808
17:36:16 kernel: BTRFS error (device dm-3): fixed up error at logical 256814415872 on dev /dev/dm-2 physical 160759808
17:36:16 kernel: BTRFS error (device dm-3): fixed up error at logical 256784334848 on dev /dev/dm-2 physical 130678784
17:36:16 kernel: BTRFS error (device dm-3): fixed up error at logical 256784334848 on dev /dev/dm-2 physical 130678784
17:36:16 kernel: BTRFS error (device dm-3): fixed up error at logical 256784334848 on dev /dev/dm-2 physical 130678784
17:36:16 kernel: BTRFS error (device dm-3): fixed up error at logical 256784334848 on dev /dev/dm-2 physical 130678784
17:36:16 kernel: BTRFS error (device dm-3): fixed up error at logical 256784334848 on dev /dev/dm-2 physical 130678784
17:36:16 kernel: BTRFS error (device dm-3): fixed up error at logical 256784334848 on dev /dev/dm-2 physical 130678784
17:36:16 kernel: BTRFS error (device dm-3): fixed up error at logical 256784334848 on dev /dev/dm-2 physical 130678784
17:36:16 kernel: BTRFS error (device dm-3): fixed up error at logical 256784334848 on dev /dev/dm-2 physical 130678784
17:08:54 kernel: BTRFS error (device dm-3): level verify failed on logical 1250673917952 mirror 2 wanted 0 found 1
17:08:54 kernel: BTRFS error (device dm-3): parent transid verify failed on logical 1250673737728 mirror 2 wanted 133346 found 133001
17:08:54 kernel: BTRFS error (device dm-3): parent transid verify failed on logical 1250673704960 mirror 2 wanted 133346 found 133001
17:08:54 kernel: BTRFS error (device dm-3): parent transid verify failed on logical 1250673016832 mirror 2 wanted 133346 found 133001
17:08:54 kernel: BTRFS error (device dm-3): parent transid verify failed on logical 1250671280128 mirror 2 wanted 133346 found 133001

Here is latest output of dmesg

[ 1770.686229] BTRFS error (device dm-3): unable to fixup (regular) error at logical 123462418432 on dev /dev/dm-3 physical 122380288000
[ 1770.687652] BTRFS warning (device dm-3): checksum error at logical 123462418432 on dev /dev/dm-3, physical 122380288000, root 257, inode 18573628, offset 0, length 4096, links 1 (path: ostree/deploy/default/var/lib/containers/storage/overlay/d419170eda1698673bca47e8441bc9cbc736e453f8cdf94e58399f5f2919e8a2/diff/usr/lib64/libyaml-0.so.2.0.9)
[ 1822.235902] BTRFS error (device dm-3): parent transid verify failed on logical 740589568 mirror 2 wanted 1183 found 2551
[ 1822.237474] BTRFS error (device dm-3): parent transid verify failed on logical 740589568 mirror 1 wanted 1183 found 2551
[ 1822.240802] BTRFS info (device dm-3): scrub: not finished on devid 1 with status: -5

I have run btrfs scrub and read only btrfs check. Prepared a live ISO as well but since I don’t know exactly where to look after much looking around I’m asking here before I do anything stupid like running --repair without thinking.

It would be best to have the entire unfiltered dmesg. The fixup messages suggest one of the two copies of metadata was corrupted for some reason, and Btrfs is detecting this corruption and replacing the bad copy with a good copy.

The inability to fix could be that both copies of metadata are corrupt, or that the corruption hit a data block rather than a metadata block - data blocks have only one copy unless using forms of RAID 1 or higher.

The question is why is this happening? Is it related to the del_list corruption messages? Or is it some kind of slow death of the storage? I’m not sure based on the available information.

Here is the dmesg for that day when the issue occurred. Ran journalctl -k -b -21 > ~/journalctl-feb27.log

Pastebin expires after 1 day, not sure if there is a better way to show the entire log?
https://paste.centos.org/view/d27e2882


feb 27 13:40:40 bazzite kernel: BTRFS: device label fedora_fedora devid 5 transid 147150 /dev/dm-0 (253:0) scanned by (udev-worker) (1058)
feb 27 13:40:42 bazzite kernel: BTRFS: device label fedora_fedora devid 1 transid 147150 /dev/dm-1 (253:1) scanned by (udev-worker) (1239)
feb 27 13:40:43 bazzite kernel: BTRFS: device label fedora_fedora devid 2 transid 147150 /dev/dm-2 (253:2) scanned by (udev-worker) (1239)
feb 27 13:40:45 bazzite kernel: BTRFS: device label fedora_fedora devid 4 transid 147150 /dev/dm-3 (253:3) scanned by (udev-worker) (1058)

OK so a 4 device Btrfs. And also, devid 2 is dm-2 which we need to know later.

feb 27 13:40:45 bazzite kernel: BTRFS info (device dm-1): first mount of filesystem ac504488-ad0a-4cf8-a973-6acb4938ccb2
feb 27 13:40:45 bazzite kernel: BTRFS info (device dm-1): using crc32c (crc32c-intel) checksum algorithm
feb 27 13:40:45 bazzite kernel: BTRFS info (device dm-1): using free-space-tree
feb 27 13:40:45 bazzite kernel: BTRFS info (device dm-1): bdev /dev/dm-1 errs: wr 0, rd 0, flush 0, corrupt 637, gen 0
feb 27 13:40:45 bazzite kernel: BTRFS info (device dm-1): bdev /dev/dm-2 errs: wr 6828281, rd 642604, flush 610866, corrupt 0, gen 0

dm-1 has checksum mismatches (corruption), these values are just a rudimentary counter. Every time such a problem is encountered, the counter is incremented by 1. We don’t know if it’s 637 separate corruptions encountered 1 time each, or the same corruption encountered 637 times.

Same for the write, read, and flush errors on dm-2. That’s a lot of errors and suggests a device problem of some kind.

The dmesg doesn’t show enough info to figure out which physical device maps to dm-1 and dm-2, that should be in journalctl omit -k. But from the dmesg seems possible some of these are USB devices, and one is nvme. I don’t see errors in this dmesg, but they might be in previous dmesg (or journalctl -b X -k`) to give a clue what might be going on.

I’ve seen quite a lot of drives in USB enclosures not behaving well unless connected to an externally powered USB hub. The built-in hub on the host bus is often inadequate, although I’ve had some success in these cases disabling UAS for these USB devices. They run a bit slower as a result but they become more stable.

Also need sudo btrfs fi us $MNT to see the full breakdown of the block groups and profiles.

feb 28 01:18:32 sabre kernel: BTRFS error (device dm-1): parent transid verify failed on logical 466212061184 mirror 2 wanted 134644 found 134526
feb 28 01:18:32 sabre kernel: BTRFS info (device dm-1): read error corrected: ino 0 off 466212061184 (dev /dev/dm-2 sector 4543424)

mirror 2 is dm-2 (from the earlier snippet, at mount time). The transid faliure is on one device, and is being fixed up from some other source. I’m confused by the reference to inode 0. I’m pretty sure inode 0 is reserved, but I will check on that.

All similar messages like this are passive checksum checks, failures, and corrections from another mirror with a good copy.

My suggestion is do a scrub.

sudo btrfs scrub start $MNT runs in the background, use status to check it. Can be combined with watch. Or use -Bd to run foreground and print stats at the end of scrub.

OPTIONAL, you can zero the stats fields with btrfs device stats --reset $MNT before or after the scrub. The idea here is to make it easier to notice if the errors are still present and being detected. The messages say the problems are detected and fixed. Are they really? Why are they still present? Is this a transient problem?

Flash drives tend to fail by transiently returning zeros or garbage. So this sounds like early flash drive death. Eventually it’ll just die without further warning.

Some drives can fail by going read-only. They don’t report they’re going read-only. The kernel and file system still treats the drive as read-write, the writes go to the drive, the drive does not report any error, but the writes are NOT persistent. And the only way you find out, is you’re getting piles of read errors because what’s on the drive is not what Btrfs is expecting, which most often get reported as transid errors. The metadata being read is valid (passes checksum) but it’s the wrong transid (wrong transaction generation).

Anyway, this is going to require some isolation to find out what drive is giving you a hard time. And then replace it.

I recommend using btrfs replace for this. Not btrfs device add followed by btrfs device remove. The reason is replace creates a kind of virtual raid1/mirror between the old and replacement devices, and does a scrub between them - forcing replication from one to the other. It’s faster than add/remove. And it’s also more power fail/crash safe than add/remove.

NOTE that btrfs replace assumes the new block device is as big or bigger than the old block device. Also, replace does not automatically resize the file system on the replacement - depending on the block group profile, that may be a good thing, or you might want to do a resize on this one device to use all the available space. Btrfs is super tolerant of different sized devices, but it can get idiosyncratic as devices get close to full. So it’s a set of tradeoffs, per usual.

1 Like