Btrfs on dm-crypt, improving chance of self-healing

Continuing the discussion from I want to change a btrfs LUKS encrypted partition to ext4:

By default these days Btrfs uses DUP profile metadata, i.e. two copies of the file system. Originally it was single profile when on SSD. Why?

Part of the rationalization of that choice at the time was awareness that many SSDs, even cheap ones, do an opportunistic dedup during concurrent writes. i.e. if the SSD firmware detects, in a concurrent write, the exact same data in two (and maybe more) blocks, it can very cheaply dedup them. And gain a storage and performance advantage.

We do see evidence of this from time to time. Btrfs will detect corruption, as designed, but the corruption identically affects both copies of the metadata, and then the file system tends to go read-only to protect the file system from more confusion. Of course, a suddenly read-only root file system will confuse user space. And users.

There are other explanations other than opportunistic dedup during concurrent writes, but that is more likely than other explanations due to things like the write time tree checker.

OK so what about encryption? Well the XTS component of full disk encryption means identical plaintext blocks become non-identical ciphertext blocks. And the SSD only sees the ciphertext blocks. Therefore it can’t dedup them.

This means some portion of uncorrectable corruptions become correctable. Btrfs can self-heal if only one of the copies is corrupted. Btrfs detects it, complains loudly and many times, even for one such detection (in dmesg), automatically finds the good copy, uses that, and fixes the bad copy. All of that gets logged to dmesg.

So in this example, Btrfs can be more reliable if on dm-crypt and if the metadata profile is DUP.

To check what the metadata profile is:

# btrfs fi us /
...[snip]...
Data,single: Size:35.00GiB, Used:33.83GiB (96.66%)
   /dev/mapper/luks-10d575f0...      35.00GiB

Metadata,single: Size:2.00GiB, Used:569.75MiB (27.82%)
   /dev/mapper/luks-10d575f0...       2.00GiB

System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
   /dev/mapper/luks-10d575f0...      32.00MiB


This example, data, metadata, system block groups use single profile. One copy.

It can be converted while mounted to DUP:

# btrfs balance start -mconvert=dup /
# btrfs fi us /
...[snip]...
Data,single: Size:35.00GiB, Used:33.83GiB (96.66%)
   /dev/mapper/luks-10d575f0...       35.00GiB

Metadata,DUP: Size:2.00GiB, Used:569.75MiB (27.82%)
   /dev/mapper/luks-10d575f0...        4.00GiB

System,DUP: Size:32.00MiB, Used:16.00KiB (0.05%)
   /dev/mapper/luks-10d575f0...       64.00MiB

m means metadata and includes system block group; so mconvert restricts the conversion only to metadata and system block groups.

Done.

Now there are two copies of the file system.

This does not protect against significant failure of the SSD, but can help with transient problems and silent data corruption isolated to just one copy of the file system or even one block. These are typical pre-failure behaviors of SSDs, in which they just start returning zeros or garbage, without any other device error, including no indication of a problem reported by smartctl.

Does this help folks who are not using LUKS/dm-crypt? I don’t know how common the undocumented and unmarketed dedup feature is of cheap SSDs. We only know about it by inference. It’s a firmware feature and not one manufacturers talk about much at all.

But in aggregate, yes the chance of self-heal is higher with DUP, and hence why the default was changed from single to DUP at mkfs time.

1 Like

This is highly interesting

Data,single: Size:244.00GiB, Used:221.60GiB (90.82%)
   /dev/mapper/luks-0140d69f-4781-4439-9339-94a101794e2e	 244.00GiB

Metadata,DUP: Size:6.00GiB, Used:2.97GiB (49.59%)
   /dev/mapper/luks-0140d69f-4781-4439-9339-94a101794e2e	  12.00GiB

System,DUP: Size:32.00MiB, Used:48.00KiB (0.15%)
   /dev/mapper/luks-0140d69f-4781-4439-9339-94a101794e2e	  64.00MiB

Unallocated:
   /dev/mapper/luks-0140d69f-4781-4439-9339-94a101794e2e	 208.68GiB

I guess this depends on the failure model that the controller manufacturers have. As long as they do not consider bit flips, deduplication of identical blocks is an easy way for them to reduce writes and extend the life of flash cells. It benefits the SSD manufacturers, not the users, so why would they publish that? I would assume that the SSD manufacturers that buy the controllers get this information under NDA. (And the same applies for companies like Samsung or WD that develop their own controllers in-house, it’s a feature that benefits only them.)

Would truly dup metadata improve btrfs’ crash resilience? There are various anecdotal reports of systems rendered unbootable after hard shutdowns.

1 Like

Would truly dup metadata improve btrfs’ crash resilience?

It shouldn’t matter because the drive should always honor flush/fua.

If it always honors flush/fua, then the write order Btrfs expects is preserved, and the file system is always consistent (can be mounted).

There are various anecdotal reports of systems rendered unbootable after hard shutdowns.

If the drive fails to preserve write order and at the same time there’s a crash or power failure - the file system will be inconsistent. It takes two bad lucks to end up in these cases.

I’ve made hundreds of attempts to sabotage Btrfs on HDD and SSD, during heavy writing, by forcing power off on the computer or the drive. Real hardware. Many more times with virtual machines.

I have experienced no inconsistencies with Btrfs in these attempts to sabotage it. It always mounts without error following this form of sabotage - which I do not recommend because it’s probably not great for the hardware. Will my luck run out? Probably, and I do keep multiple backups.

Every kernel release, including the release candidates, thousands of tests run on Linux file systems via fstests. These include write order tests using dm-log-writes to make sure the flie system has the expected behavior in case of power failures.

Meta is running millions of instances of Btrfs on consumer drives. They have told us Btrfs doesn’t fall over more often than other file systems. It’s true they also tend not to repair file systems when there are problems, it’s easier for them to reprovision.

Is it possible Btrfs is more difficult to repair than other file systems if something goes wrong with the hardware? Yes. Btrfs COW adds more protection, but it also makes repair harder. Without fixed locations for file system metadata, assumptions can’t be made. And also Btrfs doesn’t have the longer history of other file systems, which includes fixing problems.

1 Like

Pure anecdotes. Correlation is not causation. It could be something else screwing people up, EVEN if those aren’t anecdotes.

100%. I was actually pleasantly surprised a few years ago when I switched to BTRFS. Before, it was always like a latent sense of fear at all times that something will go seriously awry one day when hard poweroff will happen. I don’t even know what was causing it because I was on ext4 before…I guess I had some problems at one point even with ext4, or maybe it was an artifact of prior Windows use, and it just sort of stayed ingrained.

BTRFS set me free! Haha. No more fear! I’ve gone through so many hard poweroffs now and never once had an issue. It’s, like, no more a nagging feeling in the back of your head that you might lose data or turn your computer into a brick or whatever. It’s robust as heck. And the features are: such awesome, adopt very, geek sexy, cleanly problem, rich omg.

Why have a “chance” of self-healing when you can just turn on DUP for actual data too and have 100% chance of self-healing?

Not OP but I personally do this use LUKS + BTRFS with data/metadata DUP.
but both the capacity drop and financial cost of DUP or even RAID is quite high with flash storage.

Amen. And so what? Just architect your storage smarter. I bet you don’t want to have Data, DUP for your /tmp folder, etc. But it’s essential for important things. So no, not costly at all.

It’s a valid option but comes at the cost of slower writes and more space consumption. It also can’t protect if both copies are missing/corrupt. Any failure that takes out both copies isn’t recoverable. Cost and performance wise, this use case is probably better suited for two devices and raid1.

I have used DUP data myself, but mainly for experimentation and testing.

I don’t think that’s accurate. You first lay the first copy down at regular speed, and then you write the second copy in the background. Why would that be “slower writes”? I think you would only talk about something like that if you are writing lots of data sequentially or something?

DUP writes are concurrent.

it’s a separate question if the user notices, as there are tricks like delayed allocation that applies to all writes that aren’t fsync’d.

1 Like

If that’s so, that’d be an interesting idea for BTRFS people…maybe add a setting that allows to sequence DUP writes a little bit to take off that write speed hit?

The hit is still there, you are just changing the timing, maybe.
Intuition about performance is wrong far more often the its right.
To be sure you are on to a real improvemnt benchmatking is required.

Often disk subsystems work best when given a lot of work to do then the hardware optimises writes etc.

Only way to know for sure would be to implement both schemes and benchmark over a range of hardware and use cases.