Continuing the discussion from I want to change a btrfs LUKS encrypted partition to ext4:
By default these days Btrfs uses DUP profile metadata, i.e. two copies of the file system. Originally it was single profile when on SSD. Why?
Part of the rationalization of that choice at the time was awareness that many SSDs, even cheap ones, do an opportunistic dedup during concurrent writes. i.e. if the SSD firmware detects, in a concurrent write, the exact same data in two (and maybe more) blocks, it can very cheaply dedup them. And gain a storage and performance advantage.
We do see evidence of this from time to time. Btrfs will detect corruption, as designed, but the corruption identically affects both copies of the metadata, and then the file system tends to go read-only to protect the file system from more confusion. Of course, a suddenly read-only root file system will confuse user space. And users.
There are other explanations other than opportunistic dedup during concurrent writes, but that is more likely than other explanations due to things like the write time tree checker.
OK so what about encryption? Well the XTS component of full disk encryption means identical plaintext blocks become non-identical ciphertext blocks. And the SSD only sees the ciphertext blocks. Therefore it can’t dedup them.
This means some portion of uncorrectable corruptions become correctable. Btrfs can self-heal if only one of the copies is corrupted. Btrfs detects it, complains loudly and many times, even for one such detection (in dmesg), automatically finds the good copy, uses that, and fixes the bad copy. All of that gets logged to dmesg.
So in this example, Btrfs can be more reliable if on dm-crypt and if the metadata profile is DUP.
To check what the metadata profile is:
# btrfs fi us /
...[snip]...
Data,single: Size:35.00GiB, Used:33.83GiB (96.66%)
/dev/mapper/luks-10d575f0... 35.00GiB
Metadata,single: Size:2.00GiB, Used:569.75MiB (27.82%)
/dev/mapper/luks-10d575f0... 2.00GiB
System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
/dev/mapper/luks-10d575f0... 32.00MiB
This example, data, metadata, system block groups use single profile. One copy.
It can be converted while mounted to DUP:
# btrfs balance start -mconvert=dup /
# btrfs fi us /
...[snip]...
Data,single: Size:35.00GiB, Used:33.83GiB (96.66%)
/dev/mapper/luks-10d575f0... 35.00GiB
Metadata,DUP: Size:2.00GiB, Used:569.75MiB (27.82%)
/dev/mapper/luks-10d575f0... 4.00GiB
System,DUP: Size:32.00MiB, Used:16.00KiB (0.05%)
/dev/mapper/luks-10d575f0... 64.00MiB
m means metadata and includes system block group; so mconvert restricts the conversion only to metadata and system block groups.
Done.
Now there are two copies of the file system.
This does not protect against significant failure of the SSD, but can help with transient problems and silent data corruption isolated to just one copy of the file system or even one block. These are typical pre-failure behaviors of SSDs, in which they just start returning zeros or garbage, without any other device error, including no indication of a problem reported by smartctl.
Does this help folks who are not using LUKS/dm-crypt? I don’t know how common the undocumented and unmarketed dedup feature is of cheap SSDs. We only know about it by inference. It’s a firmware feature and not one manufacturers talk about much at all.
But in aggregate, yes the chance of self-heal is higher with DUP, and hence why the default was changed from single to DUP at mkfs time.