BTRFS is NOT copy-on-write in its snapshots

soconfused · May 30, 2024, 6:48am

I didn’t realize btrfs snapshots can take this much metadata space. I deleted a couple of snapshots, and the metadata usage went down by a couple of hundreds of MBs or something. That’s crazy! You can run out of space by snapshotting without much change in between snapshots? I’m not talking about file diffs, I’m talking about duplicated metadata! Because there’s very little change between snapshots in terms of what changed. I haven’t done thorough research, but is this a realistic scenario?

Also, I assume metadata space grows if you run out of metadata space?

soconfused · May 30, 2024, 9:00am

This is basically that btrfs is NOT Copy-On-Write as it applies to snapshot functionality? Which should be discussed, and the users should be informed?

jakfrost · May 30, 2024, 10:27am

How do you come to that conclusion? It is clearly stated

A snapshot is not a backup: snapshots work by use of BTRFS’ copy-on-write behaviour. A snapshot and the original it was taken from initially share all of the same data blocks.

in the documentation for btrfs at Subvolumes — BTRFS documentation

soconfused · May 30, 2024, 10:35am

That’s probably wrong. I just copied a folder on a btrfs subvolume, and ran btrfs fi usage . right away, and my metadata usage went up by 115MB! I deleted that duplicated folder and the metadata usage is back down by 115MB.

So. That means that on the backend (of btrfs) CoW has major overhead that diverges from pure CoW behavior, and, I dare to say, destroys the CoW advantage for people who want to keep many snapshot backups because the metadata use can go up significantly!

jakfrost · May 30, 2024, 10:40am

You’re probably wrong in this case. That’s the official BTRFS documentation site. CoW functionality is about not copying the data not the meta-data. How would you propose to reflink a single dataset by two different locations without increasing the amount of unique references from those two locations pointing to that data?

hamrheadcorvette · May 30, 2024, 2:38pm

I feel the urge to joint this discussion. . . But I agree, the OP needs a little more understanding of how some of these concepts work.

soconfused · May 30, 2024, 2:43pm

Yaa, ooook. (A total Miss Swan moment). I would work on a way to do it because as it stands right now: I have 10 MB diff in data and 115MB diff in metadata per each snapshot. That kind of defeats the BTRFS snapshot advantage based on CoW principle. Metadata should only be duplicated if the data gets diff’d or something.

Please join. And as an end user there’s 100% understanding what is happening. 10MB data diff and 115MB metadata diff on each snapshot is NOT how it should work.

vekruse · May 30, 2024, 2:54pm

How many files are in the snapshot, and what is the average size of these files?

computersavvy · May 30, 2024, 3:51pm

Probably there is 115MB of metadata related to the file system, especially if there is a lot of data on the system. A snapshot would need a FULL copy of the metadata that describes the actual data.

I really do not understand the concern since with most drives now being multiple TB in size the 115MB is only about .01% of a TB

soconfused · June 1, 2024, 5:49pm

Yes.

This is actually presumptuous to say. Many people are still on 128-256GB drives and might not be able to afford new ones. Some use scenarios of backing up involve smaller targets. It’s just an unexpected significant overhead hit for some people. I still don’t understand why it can’t be a better way to do this. For example, if you duplicate a folder with lots of content, just point to the top folder. If you start changing anything in the duplicated folder, duplicate metadata of the path leading to whatever is being changed and create top-level links non-recursively to other stuff that’s not changed on the fly. That’d save a lot of metadata overhead in snapshots and in general?

computersavvy · June 1, 2024, 6:15pm

It would take 10 snapshots of that size to use 1GB of storage.
Compare that to 1 full backup of all the data on the drive and the space that would be used.

The metadata with COW does approximately this.

If your idea has merit you should be able to show how it is better, then improvements may be made.

This idea sounds great but would involve the expertise of the developers of btrfs and other file systems to implement.

Make a change proposal to the developers instead of arguing for or against it here where we (almost) all are users and not the programming experts.

chrismurphy · June 4, 2024, 4:49pm

What’s the exact command you used, with complete paths? e.g. rsync does not attempt to use FICLONE ioctl (reflink copy) therefore it’s duplicating the data, consistent with your description.

Whereas cp and Files (GNOME Shell) does try to use FICLONE, and results in new metadata writes but no new data writes, as the data extents are shared (reflink copy). The new metadata writes are the result of creating several new entries in a leaf block to create the file (which has its own inode, permissions, time stamps, all of that) which is not much, maybe a few hundred bytes, but due to COW (no overwriting) there’s a bunch of writes that happens: a new leaf containing the added items is written, that’s 16KiB, and there is at least one node block that will be written to point to that leaf, that’s 16KiB, all the way up to the file tree root node, another 16KiB, and then the root tree, and then the super blocks. It’s still pretty small.

A snapshot is a more efficient because it just creates a new btree that points to the “parent” subvolume, and all of its contents. You won’t get any copy on write unless there’s changes made either the original subvolume or its snapshot. If there are changes, there’s potentially quite a lot of writes because now one or the other file btrees needs to be filled out rather than merely pointing to its “parent”. Once changes are happening, COW does apply and it can be a little expensive relative to things like the initial snapshot. Most Fedora users might be looking at less than 5MiB to make and minimally change a snapshot or its parent subvolume.

Here’s an example where cp produces shared extents even across subvolumes (root and home subvolumes in my case):

~$ cp /usr/lib64/firefox/firefox .
~$ filefrag -v /usr/lib64/firefox/firefox
Filesystem type is: 9123683e
File size of /usr/lib64/firefox/firefox is 529344 (130 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..      31:   66796211..  66796242:     32:             encoded,shared
   1:       32..      63:   66790485..  66790516:     32:   66796243: encoded,shared
   2:       64..     127:  108648537.. 108648600:     64:   66790517: shared
   3:      128..     129:   58889790..  58889791:      2:  108648601: last,encoded,shared,eof
/usr/lib64/firefox/firefox: 4 extents found
~$ filefrag -v firefox 
Filesystem type is: 9123683e
File size of firefox is 529344 (130 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..      31:   66796211..  66796242:     32:             encoded,shared
   1:       32..      63:   66790485..  66790516:     32:   66796243: encoded,shared
   2:       64..     127:  108648537.. 108648600:     64:   66790517: shared
   3:      128..     129:   58889790..  58889791:      2:  108648601: last,encoded,shared,eof
firefox: 4 extents found
$

The physical offsets are identical for the four extents making up this file. Therefore the extents are shared, they weren’t duplicated. Whereas

~$ cp --reflink=never /usr/lib64/firefox/firefox .
~$ filefrag -v firefox 
Filesystem type is: 9123683e
File size of firefox is 529344 (130 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..      31:   60663136..  60663167:     32:             encoded
   1:       32..      63:   60642508..  60642539:     32:   60663168: encoded
   2:       64..     127:   61387857..  61387920:     64:   60642540:
   3:      128..     129:   58564603..  58564604:      2:   61387921: last,encoded,eof
firefox: 4 extents found

This file has different physical offsets, they are not shared extents, so this caused duplication of data to occur and more space has been used for this operation as well as more IO to read the file contents and write it out somewhere else.

soconfused · June 7, 2024, 1:18pm

No, I understand. I, of course, used cp for duplication. The folder is an nspawn container, so of course I get the benefits of the CoW by increasing disk usage only with 115MB metadata upon duplication of that folder that’s several GBs. Of course, snapshotting is still a huge improvement for backups. But snapshotting has a serious drawback that I don’t see anyone talk about - metadata bloat. And it even seems this metadata bloat can be gotten rid of by not duplicating metadata of unmodified files - there are ways to do it, maybe even snapshot metadata upon duplication and keep track of modifications to modify metadata snapshots. It is very technical, but it is not technical to inform Fedora community of this BTRFS drawback.

chrismurphy · June 18, 2024, 3:10pm

Snapshots don’t have any bloat. They aren’t free, but they are very, very cheap on Btrfs - that was the intent of the original design.

Reflink copies made with cp are cheaper than full file metadata and data duplication. But the “metadata bloat” is the same for XFS as for Btrfs.

The current design doesn’t have a way of deduplicating metadata created with cp reflinks. There is deduplication (initially) of metadata with snapshots, however.

Depending on the workload, it’s possible you’re experiencing a side effect of something called bookend extents. That’s not a metadata bloat, but a limitation of how data extents are handled when portions of those extents are no longer referenced by any files. Ideally the unused portion of an extent would be broken up into its own extent and then that space freed. That isn’t currently possible, but it’s planned work.

The problem of bookend extents does tend to show up in container workflows (e.g. moby, podman, maybe also nspawn) leveraging Btrfs snapshots instead of overlayfs. It’s contingent on how the container is modified though - it’s not a given you’ll always see bookend extents.

As I see it, these container tools don’t have automatic clean-up of stale containers, that task is left up to the user to manually manage. Therefore it might become more important to keep track of stale containers when using a Btrfs graph driver for certain workloads.

Topic		Replies	Views
Need help setting up Btrfs snapshots, complete NOOB Ask Fedora timeshift , btrfs , btrfs-assistant , fedora	80	852	July 24, 2025
Btrfs create snapshot and send \| receive to backup Ask Fedora btrfs	21	8024	January 30, 2022
BTRFS benefits for a home user Ask Fedora btrfs	25	3749	November 14, 2020
BTRFS as default filesystem was a bad idea Project Discussion workstation-wg	88	2343	August 15, 2025
Fedora BTRFS and Timeshift - Long time to create a snapshot Ask Fedora kde-plasma , timeshift , btrfs	17	3584	April 16, 2023

BTRFS is NOT copy-on-write in its snapshots

Related topics