How to test a disk?

Continuing the discussion from Boot freezes after starting Grub successfully | btrfs filesystem errors:

Do you have a recommendation on how to test the disk after reformatting and before installation?

I use f3 which is in Fedora repo. Short how-to:

Optional: I start with blkdiscard on the whole device, e.g. /dev/sda or /dev/nvme0n1 to totally wipe it. No file systems no partition map. Nothing. This results in complete data loss. I also use this command again at the end of testing so the device is erased before reprovisioning/reinstallation.

  1. Format the device (partitioning is not necessary) with any file system. Btrfs, ext[234], xfs, f2fs, fat, makes no difference.
  2. Mount it normally, e.g. /mnt
  3. f3 write /mnt
  4. f3 read /mnt

That’s it. The man page has much more info on how to read the results, and the rationale of f3, what specific issues it’s designed to detect.


SMART tests different things and it’s a bit of a black box. But I think it’s OK to to a smartctl -t short test since it should test things that file systems can’t. The long test is primarily a read sectors test, and I think we’re better off writing patterns over the entire block device and reading them back in to verify with F/OSS than entirely relying on the firmware.

The smart test can be done anytime, the order doesn’t matter. It isn’t affected by, nor will it affect the steps above.

Use smartctl -x or smartctl -a to reveal the self-test log for the results/status. e.g.

Self-test Log (NVMe Log 0x06, NSID 0xffffffff)
Self-test status: No self-test in progress
Num  Test_Description  Status                       Power_on_Hours  Failing_LBA  NSID Seg SCT Code
 0   Short             Completed without error                  82            -     -   -   -    -

If there are errors with any tests, replace the drive - hopefully under warranty.

1 Like

Advanced

I’m aware some flash devices may have an opportunistic dedup of identical blocks during concurrent writes. See Btrfs on dm-crypt, improving chance of self-healing.

Could a user space test tool like f3 be fooled into thinking a device is more reliable than it is because of this? I suppose it’s possible, I don’t know enough about the prevalence of such dedup or to what degree the patterns used by f3 could be deduped.

But there is a simple way to not ask difficult questions and just thwart it. The XTS component of aes-xts-plain64 ensures identical plaintext blocks become non-identical ciphertext blocks. Therefore the flash firmware can’t dedup them.

See FrequentlyAskedQuestions · Wiki · cryptsetup / cryptsetup · GitLab for details. This comes from question 2.19 How can I wipe a device with crypto-grade randomness? And we can repurpose it such that the steps are now:

  1. cryptsetup open --type plain -d /dev/urandom /dev/sda target
  2. Format the dm-crypt device, /dev/mapper/target
  3. Mount /dev/mapper/target normally to /mnt
  4. f3 write /mnt
  5. f3 read /mnt

When you’re finished testing, cryptsetup close target. This will drop the unsaved random key.

NOTE: writing to the device via dm-crypt and dropping the key at the end means complete data loss on the drive.