Btrfs copy with deduplication

This question might be based on some misunderstanding of btrfs, so please correct me and/or link to any clear documentation if I’m misunderstanding things.

I want to do a large recursive copy (often a whole subvolume) from one partition to another with behavior equivalent to “inband” deduplication. I don’t care that the copy operation might take several times as long as a normal copy. I do care that it doesn’t even temporarily take as much space as would be required if I did an ordinary copy followed by “out of band” deduplication.

As an example, consider subvolume X in one btrfs filesystem and subvolume Y in another, where X and Y have mostly the same contents (even with file creation and modification dates different for otherwise matching files); Create subvolume Z in the same btrfs filesystem as X with contents (including directory info) matching Y. Taking advantage of the fact that Z’s file contents will mostly match X’s file contents, the incremental space used by Z should be small.

That example might have a good “backwards” solution and/or some special method I haven’t thought of. I’m interested both in that example and in some less simple cases, so if that example has a better answer than the general answer, I’d like to know both answers. By backwards, I mean snapshot X into Z and then apply all the differences from Y into Z (I don’t know a good tool for applying all the differences in that situation).

I’ve installed and tried duperemover and read its documentation, but don’t understand enough about what it is really doing to know whether it would be practical for an incremental kludge: Copy a small fraction at a time and then deduplicate that fraction into a pre-existing database of what was there before. Hopefully, combined with temporarily tweaking the delayed write features of the destination fs, one could avoid ever actually writing the majority of contents that are immediately removed as duplicate.

I understand “in band” deduplication exists. I think I understand I would need to rebuild the kernel myself to get that feature. I think I understand that it would be a bad feature to have turned on all the time, maybe even a bad feature to have turned on for other use of the fs that occurs at the same time as the copy operation that wants it. But maybe turning that on during the copy operation is my best answer.

https://btrfs.wiki.kernel.org/index.php/Deduplication

In-band deduplication

Inband / synchronous / inline deduplication is deduplication done in the write path, so it happens as data is written to the filesystem. This typically requires large amounts of RAM to store the lookup table of known block hashes and adds IO overhead to store the hashes. The feature is not actively developed, some patches patches have been posted. See the User notes on dedupe page for more details.

This page list 3 tools:
|duperemove
|bees
|dduper

https://btrfs.readthedocs.io/en/latest/Deduplication.html
This page only list bees and duperemove

I personally have tried BEES. It works like this:

  • For a existing btrfs filesystem, it will scan each “block” and store a hash value.
  • when two “blocks” with same hash are find, it will further test if they are the same. If they are, then it will dedupe. Otherwise, it will skip.

https://github.com/Zygo/bees/blob/master/docs/btrfs-kernel.md#unfixed-kernel-bugs says:
The kernel does not permit btrfs send and dedupe to run at the same time. Recent kernels no longer crash, but now refuse one operation with an error if the other operation was already running.”

I did read all that documentation before asking the question.

1 Like

Sorry for overlooking that.

Please do share back if you manage to patch the current kernel with the “in-band” dedup feature.

I don’t think I’m going to try rebuilding the kernel.

I mentioned that originally, more in hopes that I misunderstood the documentation I read and/or it was obsolete. I was hoping (not expecting) to be told there is an easier way.

I shouldn’t have given that terse reply to two attempts at helping. I didn’t mean to imply you should have guessed which documentation I had already read. I certainly might have missed some important documentation that does answer my question. So quoting or linking documentation is a reasonable attempt at helping.

I only meant that I had read those specific items of documentation before asking and so far as I understand, none of that answers my question.