I’m seeing intermittent btrfs errors ‘csum failed’ in the journal:
kernel: BTRFS warning (device dm-0): csum failed root 256 ino 63694 off 983040 csum 0xa959d91f expected csum 0xb864aaaf mirror 1
kernel: BTRFS error (device dm-0): bdev /dev/mapper/luks-3bd851cd-7ec1-4522-bda8-07d435ed116a errs: wr 0, rd 0, flush 0, corrupt 28, gen 0
As you can see from above and since installing Fedora 24 hours ago there’s been nearly 30 corruption errors logged already:
+> sudo btrfs device stats /dev/mapper/luks-3bd851cd-7ec1-4522-bda8-07d435ed116a
[/dev/mapper/luks-3bd851cd-7ec1-4522-bda8-07d435ed116a].write_io_errs 0
[/dev/mapper/luks-3bd851cd-7ec1-4522-bda8-07d435ed116a].read_io_errs 0
[/dev/mapper/luks-3bd851cd-7ec1-4522-bda8-07d435ed116a].flush_io_errs 0
[/dev/mapper/luks-3bd851cd-7ec1-4522-bda8-07d435ed116a].corruption_errs 29
[/dev/mapper/luks-3bd851cd-7ec1-4522-bda8-07d435ed116a].generation_errs 0
I tried smartctl -A
but it doesn’t seem to think there’s any problems:
=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 29 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 2%
Data Units Read: 28,969,210 [14.8 TB]
Data Units Written: 27,340,668 [13.9 TB]
Host Read Commands: 152,317,320
Host Write Commands: 222,384,590
Controller Busy Time: 1,011
Power Cycles: 1,525
Power On Hours: 729
Unsafe Shutdowns: 48
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 29 Celsius
Temperature Sensor 2: 34 Celsius
I tried btrfs scrub
and today it reports ok:
> sudo btrfs scrub start -rB /dev/mapper/luks-...
Scrub started: Tue May 11 12:30:46 2021
Status: finished
Duration: 0:00:42
Total to scrub: 65.02GiB
Rate: 1.51GiB/s
Error summary: no errors found
But yesterday the same command reported differently:
Scrub started: Mon May 10 14:03:54 2021
Status: finished
Duration: 0:00:36
Total to scrub: 63.02GiB
Rate: 1.73GiB/s
Error summary: csum=1
Corrected: 0
Uncorrectable: 0
Unverified: 0
So these checksum failures seem to come and go from what I can tell and strike files at random.
Is now the time to replace this drive and is there anything else I can do to investigate before I give up on it?
From what I understand there is no way to tell btrfs to avoid badblocks even if I could identify some (btrfs wiki).