Btrfs receive: "ERROR: crc32 mismatch in command". How to proceed?

Hi,

today i wanted to repartition my system, so i decide to back up the data with sending the existing data from all volumes with btrfs send to an external XFS formatted drive and saved the streams as files.

After repartitioning the drive, i wanted to restore my Files using btrfs receive, but it fails on a specific file which i do not need necessarily, but i can’t get any further as the receive stops immediately after hitting the crc32 error.

is there anything i can try to get past the crc32 error and try to recover the remaining data in the stream?

Edit: found the --max-errors option. Hope this will restore the rest of the file.

one lesson learned: run btrfs-receive --dump to test the file before deleting the source, as btrfs send does not verify it’s stream written to a file … :smiley:

Sidenote: i think the corruption arises because the system went to standby during btrfs send so i expect that there is only one corrupted file in the stream. Keeping my fingers crossed…

Lots of errors after that said point. ~ 500gb lost… and lots of orphanded files in the root of the home subvol…

Can you file a bug at RHBZ (you can use Fedora Account if you don’t already have an RHBZ account)?

Please include the btrfs-progs and kernel versions, and the exact btrfs send command used, and any messages it generated on the command line following wake from suspend.

And journalctl --no-hostname -o short-monotonic > journal.log

You can use -b flag to pick a prior boot if necessary.

I’m not sure the command can resume following wake from suspend, but it should report that the output file is incomplete due to the stream being interrupted.

Hi @chrismurphy and thanks for your reply

The strange thing is that the btrfs send transfer continued normally when the computer was reactivated from standby: there were no warnings in the terminal, and the transfer exited without error. I think there was some sort of sync issue while transferring to the destination disk because my USB drive was connected to the integrated U3425WE monitor hub. Unfortunately, there are no logs as I did all the work from the F43 Live ISO. In general, though, GNOME should not go into standby mode while a command is being executed in a terminal, so I think i should better file a bug against Ptyxis, shouldn’t I?

I don’t have time to try to reproduce this issue at the moment, but I’m happy to do so in the next few weeks.

Destination Disk is XFS Formatted, the command i used was:

btrfs send mysubvol > /path/to/USB-Disk/mysubvol.btrfs

i think, this was my fault, i should better use the ‘-f’ option instead using the redirection of stdout to a file. And perhaps make use of protocol v2? i just saw in the man pages, that btrfs send defaults to proto v1.

Today i examined all the data loss and it turns out, that i’ve only lost Stream’s Compatdata and all my VM Disks - nothing serious.

Cheers
Dave

>GNOME should not go into standby mode while a command is being executed in a terminal, so I think i should better file a bug against Ptyxis, shouldn’t I?

I’m not sure.

I think systemd-inhibit applies here, to allow the process to explicitly inhibit suspend/reboot/shutdown.

But if inhibit isn’t used then either the process resumes upon wake or it’s terminated. I don’t know the rules for that though.

Hi There,

I’ve had some spare time recently for testing, and I’ve been able to reproduce this issue consistently on my Dell laptop:

# create a backup
liveuser@localhost-live:/run/media/liveuser/dell-fedora$ sudo btrfs send --compressed-data --proto 2 -f /run/media/liveuser/Backup-D01/root.btrfs root
At subvol root

#(initiate Standby on 13:43:00 UTC)
# Wait for finish disk activity: ~13:43:42 (UTC)
# and woke up the system on ~13:43:55 (UTC)
 
# now let's check backup file with sudo btrfs receive --dump -f /run/media/liveuser/Backup-D01/root.btrfs root:
(lot of Output)
write           ./root/ostree/repo/objects/2a/a7f572b6a045e5feb8ebcacaa66a655806970dc491e54cf8a695273968a0bd.file offset=66191360 len=131072
write           ./root/ostree/repo/objects/2a/a7f572b6a045e5feb8ebcacaa66a655806970dc491e54cf8a695273968a0bd.file offset=66322432 len=131072
write           ./root/ostree/repo/objects/2a/a7f572b6a045e5feb8ebcacaa66a655806970dc491e54cf8a695273968a0bd.file offset=66453504 len=131072
write           ./root/ostree/repo/objects/2a/a7f572b6a045e5feb8ebcacaa66a655806970dc491e54cf8a695273968a0bd.file offset=66584576 len=131072
ERROR: crc32 mismatch in command
ERROR: failed to dump the send stream: Invalid argument
liveuser@localhost-live:/run/media/liveuser/dell-fedora$ 

# Backup is damaged - lets see the Kernel Logs:

Nov 16 13:43:01 kernel: PM: suspend entry (s2idle)
Nov 16 13:43:42 kernel: Filesystems sync: 41.307 seconds
Nov 16 13:43:55 kernel: Freezing user space processes
Nov 16 13:43:55 kernel: Freezing user space processes completed (elapsed 0.002 seconds)
Nov 16 13:43:55 kernel: OOM killer disabled.
Nov 16 13:43:55 kernel: Freezing remaining freezable tasks
Nov 16 13:43:55 kernel: Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
Nov 16 13:43:55 kernel: printk: Suspending console(s) (use no_console_suspend to debug)
Nov 16 13:43:55 kernel: PM: suspend devices took 0.315 seconds
Nov 16 13:43:55 kernel: ACPI: EC: interrupt blocked
Nov 16 13:43:55 kernel: ACPI: EC: interrupt unblocked
Nov 16 13:43:55 kernel: pci 0000:00:08.0: Setting to D3hot
Nov 16 13:43:55 kernel: intel-ipu6 0000:00:05.0: IPU6 in secure mode
Nov 16 13:43:55 kernel: pcieport 10000:e0:06.1: can't derive routing for PCI INT A
Nov 16 13:43:55 kernel: nvme 10000:e1:00.0: PCI INT A: no GSI
Nov 16 13:43:55 kernel: i915 0000:00:02.0: [drm] GT0: GuC firmware i915/mtl_guc_70.bin version 70.49.4
Nov 16 13:43:55 kernel: i8042 kbd 00:03: activated
Nov 16 13:43:55 kernel: i8042 aux 00:04: activated
Nov 16 13:43:55 kernel: i915 0000:00:02.0: [drm] GT0: GUC: submission enabled
Nov 16 13:43:55 kernel: i915 0000:00:02.0: [drm] GT0: GUC: SLPC enabled
Nov 16 13:43:55 kernel: nvme nvme0: D3 entry latency set to 10 seconds
Nov 16 13:43:55 kernel: i915 0000:00:02.0: [drm] GT0: GUC: RC enabled
Nov 16 13:43:55 kernel: i915 0000:00:02.0: [drm] GT1: GuC firmware i915/mtl_guc_70.bin version 70.49.4
Nov 16 13:43:55 kernel: i915 0000:00:02.0: [drm] GT1: HuC firmware i915/mtl_huc_gsc.bin version 8.5.4
Nov 16 13:43:55 kernel: nvme nvme0: 22/0/0 default/read/poll queues
Nov 16 13:43:55 kernel: i915 0000:00:02.0: [drm] GT1: GUC: submission enabled
Nov 16 13:43:55 kernel: i915 0000:00:02.0: [drm] GT1: GUC: SLPC enabled
Nov 16 13:43:55 kernel: i915 0000:00:02.0: [drm] GT1: GUC: RC enabled
Nov 16 13:43:55 kernel: PM: resume devices took 0.339 seconds
Nov 16 13:43:55 kernel: OOM killer enabled.
Nov 16 13:43:55 kernel: Restarting tasks: Starting
Nov 16 13:43:55 kernel: Restarting tasks: Done
Nov 16 13:43:55 kernel: random: crng reseeded on system resumption
Nov 16 13:43:55 kernel: mei_gsc_proxy 0000:00:16.0-0f73db04-97ab-4125-b893-e904ad0d5464: bound 0000:00:02.0 (ops i915_gsc_proxy_component_ops [i915])
Nov 16 13:43:55 kernel: PM: suspend exit
lines 1437-1476/1476 (END)
# some lines (mostly wifi auth stuff) removed

After waking the system from S2idle, the btrfs send command resumes without interruption or error. However, the resulting file is damaged. For this test, I connected the USB HDD directly to a USB port on my laptop instead of the monitor hub.

I also ran the same test on a completely different machine with a different USB disk, and got the same result.

2nd Test
Nov 16 14:38:17 kernel: PM: suspend entry (deep)
Nov 16 14:39:05 kernel: Filesystems sync: 48.225 seconds
Nov 16 14:39:24 kernel: Freezing user space processes
Nov 16 14:39:24 kernel: Freezing user space processes completed (elapsed 0.001 seconds)
Nov 16 14:39:24 kernel: OOM killer disabled.
Nov 16 14:39:24 kernel: Freezing remaining freezable tasks
Nov 16 14:39:24 kernel: Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
Nov 16 14:39:24 kernel: printk: Suspending console(s) (use no_console_suspend to debug)
Nov 16 14:39:24 kernel: serial 00:04: disabled
Nov 16 14:39:24 kernel: sd 7:0:0:0: [sda] Synchronizing SCSI cache
Nov 16 14:39:24 kernel: PM: suspend devices took 0.282 seconds
Nov 16 14:39:24 kernel: amdgpu 0000:03:00.0: amdgpu: MODE1 reset
Nov 16 14:39:24 kernel: amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
Nov 16 14:39:24 kernel: amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
Nov 16 14:39:24 kernel: ACPI: PM: Preparing to enter system sleep state S3
Nov 16 14:39:24 kernel: ACPI: PM: Saving platform NVS memory
Nov 16 14:39:24 kernel: Disabling non-boot CPUs ...
Nov 16 14:39:24 kernel: smpboot: CPU 15 is now offline
Nov 16 14:39:24 kernel: smpboot: CPU 14 is now offline
Nov 16 14:39:24 kernel: smpboot: CPU 13 is now offline
Nov 16 14:39:24 kernel: smpboot: CPU 12 is now offline
Nov 16 14:39:24 kernel: smpboot: CPU 11 is now offline
Nov 16 14:39:24 kernel: smpboot: CPU 10 is now offline
Nov 16 14:39:24 kernel: smpboot: CPU 9 is now offline
Nov 16 14:39:24 kernel: smpboot: CPU 8 is now offline
Nov 16 14:39:24 kernel: Spectre V2 : Update user space SMT mitigation: STIBP off
Nov 16 14:39:24 kernel: smpboot: CPU 7 is now offline
Nov 16 14:39:24 kernel: smpboot: CPU 6 is now offline
Nov 16 14:39:24 kernel: smpboot: CPU 5 is now offline
Nov 16 14:39:24 kernel: smpboot: CPU 4 is now offline
Nov 16 14:39:24 kernel: smpboot: CPU 3 is now offline
Nov 16 14:39:24 kernel: smpboot: CPU 2 is now offline
Nov 16 14:39:24 kernel: smpboot: CPU 1 is now offline
Nov 16 14:39:24 kernel: ACPI: PM: Low-level resume complete
Nov 16 14:39:24 kernel: ACPI: PM: Restoring platform NVS memory
Nov 16 14:39:24 kernel: AMD-Vi: Virtual APIC enabled
Nov 16 14:39:24 kernel: AMD-Vi: Virtual APIC enabled
Nov 16 14:39:24 kernel: LVT offset 0 assigned for vector 0x400
Nov 16 14:39:24 kernel: Enabling non-boot CPUs ...
Nov 16 14:39:24 kernel: smpboot: Booting Node 0 Processor 1 APIC 0x2
Nov 16 14:39:24 kernel: CPU1 is up
Nov 16 14:39:24 kernel: smpboot: Booting Node 0 Processor 2 APIC 0x4
Nov 16 14:39:24 kernel: CPU2 is up
Nov 16 14:39:24 kernel: smpboot: Booting Node 0 Processor 3 APIC 0x6
Nov 16 14:39:24 kernel: CPU3 is up
Nov 16 14:39:24 kernel: smpboot: Booting Node 0 Processor 4 APIC 0x8
Nov 16 14:39:24 kernel: CPU4 is up
Nov 16 14:39:24 kernel: smpboot: Booting Node 0 Processor 5 APIC 0xa
Nov 16 14:39:24 kernel: CPU5 is up
Nov 16 14:39:24 kernel: smpboot: Booting Node 0 Processor 6 APIC 0xc
Nov 16 14:39:24 kernel: CPU6 is up
Nov 16 14:39:24 kernel: smpboot: Booting Node 0 Processor 7 APIC 0xe
Nov 16 14:39:24 kernel: CPU7 is up
Nov 16 14:39:24 kernel: smpboot: Booting Node 0 Processor 8 APIC 0x1
Nov 16 14:39:24 kernel: Spectre V2 : Update user space SMT mitigation: STIBP always-on
Nov 16 14:39:24 kernel: CPU8 is up
Nov 16 14:39:24 kernel: smpboot: Booting Node 0 Processor 9 APIC 0x3
Nov 16 14:39:24 kernel: Spurious APIC interrupt (vector 0xFF) on CPU#9, should never happen.
Nov 16 14:39:24 kernel: CPU9 is up
Nov 16 14:39:24 kernel: smpboot: Booting Node 0 Processor 10 APIC 0x5
Nov 16 14:39:24 kernel: Spurious APIC interrupt (vector 0xFF) on CPU#10, should never happen.
Nov 16 14:39:24 kernel: CPU10 is up
Nov 16 14:39:24 kernel: smpboot: Booting Node 0 Processor 11 APIC 0x7
Nov 16 14:39:24 kernel: Spurious APIC interrupt (vector 0xFF) on CPU#11, should never happen.
Nov 16 14:39:24 kernel: CPU11 is up
Nov 16 14:39:24 kernel: smpboot: Booting Node 0 Processor 12 APIC 0x9
Nov 16 14:39:24 kernel: Spurious APIC interrupt (vector 0xFF) on CPU#12, should never happen.
Nov 16 14:39:24 kernel: CPU12 is up
Nov 16 14:39:24 kernel: smpboot: Booting Node 0 Processor 13 APIC 0xb
Nov 16 14:39:24 kernel: Spurious APIC interrupt (vector 0xFF) on CPU#13, should never happen.
Nov 16 14:39:24 kernel: CPU13 is up
Nov 16 14:39:24 kernel: smpboot: Booting Node 0 Processor 14 APIC 0xd
Nov 16 14:39:24 kernel: Spurious APIC interrupt (vector 0xFF) on CPU#14, should never happen.
Nov 16 14:39:24 kernel: CPU14 is up
Nov 16 14:39:24 kernel: smpboot: Booting Node 0 Processor 15 APIC 0xf
Nov 16 14:39:24 kernel: Spurious APIC interrupt (vector 0xFF) on CPU#15, should never happen.
Nov 16 14:39:24 kernel: CPU15 is up
Nov 16 14:39:24 kernel: ACPI: PM: Waking up from system sleep state S3
Nov 16 14:39:24 kernel: amdgpu 0000:03:00.0: amdgpu: PCIE GART of 512M enabled (table at 0x00000083DAB00000).
Nov 16 14:39:24 kernel: amdgpu 0000:03:00.0: amdgpu: PSP is resuming...
Nov 16 14:39:24 kernel: serial 00:04: activated
Nov 16 14:39:24 kernel: nvme nvme0: 16/0/0 default/read/poll queues
Nov 16 14:39:24 kernel: xhci_hcd 0000:0e:00.0: xHC error in resume, USBSTS 0x401, Reinit
Nov 16 14:39:24 kernel: usb usb1: root hub lost power or was reset
Nov 16 14:39:24 kernel: usb usb2: root hub lost power or was reset
Nov 16 14:39:24 kernel: amdgpu 0000:03:00.0: amdgpu: RAP: optional rap ta ucode is not available
Nov 16 14:39:24 kernel: amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: optional securedisplay ta ucode is not available
Nov 16 14:39:24 kernel: amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
Nov 16 14:39:24 kernel: amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x0000002e, smu fw if version = 0x00000032, smu fw program = 0, smu fw version = 0x00684a00 (104.74.0)
Nov 16 14:39:24 kernel: amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched
Nov 16 14:39:24 kernel: amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully!
Nov 16 14:39:24 kernel: amdgpu 0000:03:00.0: amdgpu: program CP_MES_CNTL : 0x4000000
Nov 16 14:39:24 kernel: amdgpu 0000:03:00.0: amdgpu: program CP_MES_CNTL : 0xc000000
Nov 16 14:39:24 kernel: amdgpu 0000:03:00.0: amdgpu: [drm] DMUB hardware initialized: version=0x0A000400
Nov 16 14:39:24 kernel: ata4: SATA link down (SStatus 0 SControl 300)
Nov 16 14:39:24 kernel: ata1: SATA link down (SStatus 0 SControl 300)
Nov 16 14:39:24 kernel: ata3: SATA link down (SStatus 0 SControl 300)
Nov 16 14:39:24 kernel: ata2: SATA link down (SStatus 0 SControl 300)
Nov 16 14:39:24 kernel: usb 1-1: reset high-speed USB device number 2 using xhci_hcd
Nov 16 14:39:24 kernel: r8152 6-1.2:1.0 enp16s0f4u1u2: carrier on
Nov 16 14:39:24 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
Nov 16 14:39:24 kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Nov 16 14:39:24 kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Nov 16 14:39:24 kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 6 on hub 0
Nov 16 14:39:24 kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 7 on hub 0
Nov 16 14:39:24 kernel: amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 8 on hub 0
Nov 16 14:39:24 kernel: amdgpu 0000:03:00.0: amdgpu: ring sdma1 uses VM inv eng 9 on hub 0
Nov 16 14:39:24 kernel: amdgpu 0000:03:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
Nov 16 14:39:24 kernel: amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
Nov 16 14:39:24 kernel: usb 1-6: WARN: invalid context state for evaluate context command.
Nov 16 14:39:24 kernel: usb 1-6: reset full-speed USB device number 3 using xhci_hcd
Nov 16 14:39:24 kernel: usb 2-5: reset SuperSpeed USB device number 2 using xhci_hcd
Nov 16 14:39:24 kernel: PM: resume devices took 1.164 seconds
Nov 16 14:39:24 kernel: OOM killer enabled.
Nov 16 14:39:24 kernel: Restarting tasks: Starting
Nov 16 14:39:24 kernel: Restarting tasks: Done
Nov 16 14:39:24 kernel: random: crng reseeded on system resumption
Nov 16 14:39:24 kernel: PM: suspend exit
Nov 16 14:39:24 kernel: r8152 6-1.2:1.0 enp16s0f4u1u2: carrier on

So, the question is: is it the kernel’s fault? Or is it a faulty driver? Or is it the BTRFS utilities? I’m not quite sure where to report my findings.

Thanks!

I’m not sure it’s a btrfs question. I think it’s really a generic shell question: if shell is running a process redirecting stdout to a file, should the system suspend or be inhibited?

Should the Terminal program set inhibit for all commands? Or is the user expected to do that?

It might be a question for the Fedora devel@ mailing list. Subscription is necessary to post, anyone can join.

stdout was not used during the test.

Unfortunately terminals do not set inhibit. So it’s better to start the btrfs send process with systemd-inhibit, this should prevent system from suspending.

systemd-inhibit sudo  btrfs send ....

use systemd-inhibit --list to check the current list

$ systemd-inhibit  sleep 120  &
$ systemd-inhibit --list
WHO            UID  USER   PID    COMM            WHAT                        WHY           MODE
sleep 120      1000 mark 280374 systemd-inhibit shutdown:sleep:idle   Unknown reason    block
1 Like

OK good point. I glossed over that.

I still don’t know whether suspend should be inhibited in this case. I think the question is valid for -f as well as stdout though.

Is the user expected to request inhibit? Should the terminal program set inhibit for all shell programs? I’m not sure. And therefore I’m not sure if it’s expected the resulting file is bad.

In the case of btrbk it will do a btrfs send and receive, locally as well as remotely via ssh. So it’s the same concern. What if a replication event is in-progress when suspend is initiated, whether manually or on a timer? What happens and what should happen may not be aligned.

In general, terminal applications don’t set inhibitors for running processes automatically. Applications and services need to explicitly set the inhibitor themselves.

1 Like

1. I have straced btrfs send and suspended/resumed the system in the beginning of btrfs send:

ioctl(5, BTRFS_IOC_SEND, {send_fd=7, clone_sources_count=0, clone_sources=NULL, parent_root=0, flags=BTRFS_SEND_FLAG_VERSION}) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
ioctl(5, BTRFS_IOC_SEND, {send_fd=7, clone_sources_count=0, clone_sources=NULL, parent_root=0, flags=BTRFS_SEND_FLAG_VERSION}) = 0

For comparison: strace btrfs send w/o suspend/resume:

ioctl(5, BTRFS_IOC_SEND, {send_fd=7, clone_sources_count=0, clone_sources=NULL, parent_root=0, flags=BTRFS_SEND_FLAG_VERSION}) = 0

The output file for suspend is longer than the file w/o suspend:

-rw-------. 1 root root 9006882308 Nov 19 10:17 susp
-rw-------. 1 root root 7878804653 Nov 19 10:16 nosusp

2. I have redone the test, this time I resized the terminal close to the end of btrfs send:

ioctl(5, BTRFS_IOC_SEND, {send_fd=7, clone_sources_count=0, clone_sources=NULL, parent_root=0, flags=BTRFS_SEND_FLAG_VERSION}) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
ioctl(5, BTRFS_IOC_SEND, {send_fd=7, clone_sources_count=0, clone_sources=NULL, parent_root=0, flags=BTRFS_SEND_FLAG_VERSION}) = 0
-rw-------. 1 root root 12451238907 Nov 19 11:27 win
-rw-------. 1 root root  7878804653 Nov 19 11:25 nowin

Seems like btrfs send executes one giant ioctl, which sends the data of the snaphot to user space. The ioctl terminates and gets restarted from the beginning during signal delivery caused by suspend/resume/window change. The output file does not get truncated, causing the CRC error during btrfs receive.

You could post your findings to the upstream Btrfs mailing list, linux-btrfs@vger.kernel.org, subscription is not required but sometimes not everyone will reply-all.

I think the gotcha with btrfs send providing inhibit by default is that some snapshots are huge, and replication could take hours. What if this results in draining the life of the laptop rather than just suspending and leaving the replicated snapshot in a bad state? It’s really up to the specific end user to decide which outcome is less bad, and choose accordingly.

But it’s good to have an upstream discussion about it.

See (sig)STOPing and (sig)CONTinuing a send|receive pipe causes: ERROR: crc32 mismatch in command · Issue #719 · kdave/btrfs-progs · GitHub

2 Likes

I have a laptop with a fresh install of F43 of GNOME from Everything, updates, and wake-on-wlan configured. In the past, with earlier Fedora versions, I could wake the laptop with ether-wake from another building and connect using ssh or cockpit. Now the system goes to sleep almost immediately, so the issue isn’t limited to Ptyxis. I’m wondering if sleep is watching for keyboard or mouse activity.

I have filed a new issue: btrfs send: suspend/resume or resize terminal corrupts output (crc32 mismatch) · Issue #1063 · kdave/btrfs-progs · GitHub

1 Like

Good find.