Any way to repair BTRFS_DEV_STAT_CORRUPTION_ERRS errors?

mettacrawler · October 31, 2022, 11:10am

Background

The top-level issue is that my box gets these ongoing messages in its log:

Oct 31 03:24:39 example.com kernel: BTRFS info (device sda2): bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 192, gen 0

This system has two drives with btrfs on sda2 and sdb2.

Earlier some things that I did:

Screenshot shows an attempt to correct in “single init=/bin/bash” mode with “btrfs check”.

Ultimately correction required translating bytenr values into files. I switched to a LiveCD environment.

In order for btrfs inspect-internal logical-resolve to output the path names of the files which contained blocks that had issues I had to mount the subvols that contained the files that the bytenr values corresponded to. That command will not output any path names if you just mount an arbitrary part of the btrfs. I deleted the files at the ends of the path names which were reported.

I ran btrfs scrub and no errors were reported. I ran a full btrfs balance largely because this downtime provided a good opportunity. It took maybe 3 to 6 hours to complete. Neither of those commands output any error messages on stdout/stderr to the terminal. In the journal there are still errors, see the beginning of this post.

Root cause analysis: the csum values “expected” and “actual” differ only by one bit (see screenshot above). Specifically, 0x97aafd95 and 0x97aafd85 have only one bit which differs. I think there are really RAM issues not SDD ones. This computer’s mobo has a long history of RAM problems.

The kernel documentation and code suggests that BTRFS_DEV_STAT_CORRUPTION_ERRS is not a direct I/O failure as in “didn’t get EIO” from a driver.

Is there some way to repair the “corrupt 192” situation?

Oct 31 03:24:39 example.com kernel: BTRFS info (device sda2): bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 192, gen 0

Kernel Code Details

https://btrfs.wiki.kernel.org/index.php/Data_Structures

The codes below indicate an indirect I/O failure:
BTRFS_DEV_STAT_CORRUPTION_ERRS 0x4
checksum error, bytenr error or contents is illegal: this is an indication that the block was damaged during read or write, or written to wrong location or read from wrong location

Functions that increment the BTRFS_DEV_STAT_CORRUPTION_ERRS value:

linux/fs/btrfs/volumes.c

void btrfs_dev_stat_inc_and_print(struct btrfs_device *dev, int index)
{
        btrfs_dev_stat_inc(dev, index);

        if (!dev->dev_stats_valid)
                return;
        btrfs_err_rl_in_rcu(dev->fs_info,
                "bdev %s errs: wr %u, rd %u, flush %u, corrupt %u, gen %u",
                           rcu_str_deref(dev->name),
                           btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_WRITE_ERRS),
                           btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_READ_ERRS),
                           btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_FLUSH_ERRS),
                           btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_CORRUPTION_ERRS),
                           btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_GENERATION_ERRS));
}

linux/fs/btrfs/volumes.h

static inline void btrfs_dev_stat_inc(struct btrfs_device *dev,
                                      int index)
{
        atomic_inc(dev->dev_stat_values + index);
        /*
         * This memory barrier orders stores updating statistics before stores
         * updating dev_stats_ccnt.
         *
         * It pairs with smp_rmb() in btrfs_run_dev_stats().
         */
        smp_mb__before_atomic();
        atomic_inc(&dev->dev_stats_ccnt);
}

computersavvy · October 31, 2022, 4:51pm

It has been my understanding that mirroring with btrfs may have issues. If so then that could have caused the errors.

mettacrawler · October 31, 2022, 7:32pm

I don’t see this issue on any of my other computers that have btrfs RAID-1.

mettacrawler · December 5, 2022, 6:16pm

I think the log messages about “corrupt 192” were significant.

Oct 31 03:24:39 example.com kernel: BTRFS info (device sda2): bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 192, gen 0

The file system got to the point were it would not mount read-write. I was using SyncThing and there were no signs that corruption was synced to the other nodes. I rebuilt but not on btrfs this time. I also ordered parts that will allow me to upgrade this computer to use ECC RAM.

Topic		Replies	Views
BTRFS drive logging csum failed errors, time to replace? Ask Fedora	2	2632	May 12, 2021
Is BTRFS recovery possible? Ask Fedora f35	2	1314	July 26, 2023
Game installs are corrupting BTRFS on Fedora 40 Ask Fedora btrfs , gaming , steam , f40	13	715	May 24, 2024
BTRFS error on boot Fedora 40 Ask Fedora btrfs , workstation	14	1484	June 13, 2024
Boot fails: btrfs errno=-17 Object already exists Ask Fedora intel	12	2247	December 20, 2022

Any way to repair BTRFS_DEV_STAT_CORRUPTION_ERRS errors?

Related topics