BTRFS drive logging csum failed errors, time to replace?

I’m seeing intermittent btrfs errors ‘csum failed’ in the journal:

kernel: BTRFS warning (device dm-0): csum failed root 256 ino 63694 off 983040 csum 0xa959d91f expected csum 0xb864aaaf mirror 1
kernel: BTRFS error (device dm-0): bdev /dev/mapper/luks-3bd851cd-7ec1-4522-bda8-07d435ed116a errs: wr 0, rd 0, flush 0, corrupt 28, gen 0

As you can see from above and since installing Fedora 24 hours ago there’s been nearly 30 corruption errors logged already:

+> sudo btrfs device stats /dev/mapper/luks-3bd851cd-7ec1-4522-bda8-07d435ed116a 
[/dev/mapper/luks-3bd851cd-7ec1-4522-bda8-07d435ed116a].write_io_errs    0
[/dev/mapper/luks-3bd851cd-7ec1-4522-bda8-07d435ed116a].read_io_errs     0
[/dev/mapper/luks-3bd851cd-7ec1-4522-bda8-07d435ed116a].flush_io_errs    0
[/dev/mapper/luks-3bd851cd-7ec1-4522-bda8-07d435ed116a].corruption_errs  29
[/dev/mapper/luks-3bd851cd-7ec1-4522-bda8-07d435ed116a].generation_errs  0

I tried smartctl -A but it doesn’t seem to think there’s any problems:

=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        29 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    2%
Data Units Read:                    28,969,210 [14.8 TB]
Data Units Written:                 27,340,668 [13.9 TB]
Host Read Commands:                 152,317,320
Host Write Commands:                222,384,590
Controller Busy Time:               1,011
Power Cycles:                       1,525
Power On Hours:                     729
Unsafe Shutdowns:                   48
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               29 Celsius
Temperature Sensor 2:               34 Celsius

I tried btrfs scrub and today it reports ok:

> sudo btrfs scrub start -rB /dev/mapper/luks-...
Scrub started:    Tue May 11 12:30:46 2021
Status:           finished
Duration:         0:00:42
Total to scrub:   65.02GiB
Rate:             1.51GiB/s
Error summary:    no errors found

But yesterday the same command reported differently:

Scrub started:    Mon May 10 14:03:54 2021
Status:           finished
Duration:         0:00:36
Total to scrub:   63.02GiB
Rate:             1.73GiB/s
Error summary:    csum=1
  Corrected:      0
  Uncorrectable:  0
  Unverified:     0

So these checksum failures seem to come and go from what I can tell and strike files at random.

Is now the time to replace this drive and is there anything else I can do to investigate before I give up on it?

From what I understand there is no way to tell btrfs to avoid badblocks even if I could identify some (btrfs wiki).

1 Like

I had a similar problem on one of my hosts a few months ago. In my case, the csum errors did not spontaneously go away, and for your case I am puzzled about that.

I restored the filesystem from my (good) backups, but the csum problem remained. I recreated the disk partition and btrfs filesystem, restored, and the problem still did not go away. Finally, I recreated the disk partition and overwrote it with random data, then made a new btrfs filesystem and restored the data. The csum errors were gone, and I’m still using it without problems.

So, I do not think it is time yet to consider replacing your drive.

3 Likes

Thanks @blueshurricane4 for sharing your experience. I continued yesterday to try and spot any pattern to the failures. Then by chance I saw a bit flip in a file and checking the journal saw NO btrfs issue logged. I then thought this must be something other than a filesystem or storage problem.

I booted to Fedora live, ran Memtest86+ and it found about 50 memory errors in a specific (high) region of RAM. I reran the test and it found the same problems again in the same region. I then ran stress -c just to see if the CPU was ok if tested in isolation and there were no errors there.

So I’m pretty confident now I’ve got busted RAM chip or two. The region is high enough that it’s mostly used by cache rather than programs which I think is why the system is mostly stable except for when I’m doing lots of file copying - which is how I spotted the bit flip. In the interim I’m booting the kernel with mem=14G to exclude the dodgy region and so far so good.

I think it goes to show the kind of checksum work that BTRFS does can highlight all sorts of hardware issues early!

3 Likes