BTRFS scrub crash my system

I find out recently that if you run your system on BTRFS you are supposed to run scrub periodically.

So I installed BTRFS assistant and tried to run it.

Whenever I start a scrub my whole system crash and I can’t even run REISUB, I need to manually halt my PC.

The only logs line in journalctl left after the reboot since I started the scrub are:

Aug 12 11:10:56 littlebeast kernel: BTRFS info (device nvme1n1p3): scrub: started on devid 1
Aug 12 11:11:02 littlebeast systemd[1]: snapperd.service: Deactivated successfully.
Aug 12 11:11:02 littlebeast audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=snapperd comm="systemd" exe="/usr/lib/syste>
Aug 12 11:11:03 littlebeast systemd[1]: systemd-hostnamed.service: Deactivated successfully.
Aug 12 11:11:03 littlebeast audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/>
Aug 12 11:11:03 littlebeast audit: BPF prog-id=102 op=UNLOAD
Aug 12 11:11:03 littlebeast audit: BPF prog-id=101 op=UNLOAD
Aug 12 11:11:03 littlebeast audit: BPF prog-id=100 op=UNLOAD

But I’m sure there was something else, since I was following the dmesg while executing the scrub, but it failed to save the logs.

What should I do to debug this situation? Should I run the scrub on a part of the device?
Thanks in advance.

I tried to run scrub only on ~ but It crashed anyway

We can’t help without more details. Please provide the output of inxi -Fzxx.
When reporting problems it is best to ensure that your system is fully updated so it is easy for others to reproduce the configuration and also so you aren’t chasing a bug that has been fixed.

  • Have you run btrfs check (on unmounted filesystem, e.g., by booting a Fedora Installer Live USB)?
  • Have you checked the drive health (e.g., with Gnome Disks or smartmontools)?
  • dmesg uses a ring buffer, but the output should be available using journalctl.

I’ve started a live!

This is my system:

System:
  Kernel: 6.5.6-300.fc39.x86_64 arch: x86_64 bits: 64 compiler: gcc
    v: 2.40-13.fc39
  Desktop: GNOME v: 45.0 tk: GTK v: 3.24.38 wm: gnome-shell dm: GDM
    Distro: Fedora Linux 39 (Workstation Edition)
Machine:
  Type: Desktop Mobo: ASRock model: X570 Phantom Gaming 4
    serial: <superuser required> UEFI: American Megatrends v: P5.01
    date: 01/18/2023
CPU:
  Info: 6-core model: AMD Ryzen 5 3600 bits: 64 type: MT MCP arch: Zen 2
    rev: 0 cache: L1: 384 KiB L2: 3 MiB L3: 32 MiB
  Speed (MHz): avg: 2700 high: 4067 min/max: 2200/4208 boost: enabled cores:
    1: 2200 2: 2200 3: 2200 4: 2200 5: 2200 6: 3600 7: 4067 8: 2200 9: 2140
    10: 3600 11: 2200 12: 3600 bogomips: 86236
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
Graphics:
  Device-1: AMD Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M/6850M XT]
    vendor: Micro-Star MSI driver: amdgpu v: kernel arch: RDNA-2 pcie:
    speed: 16 GT/s lanes: 16 ports: active: DP-3 empty: DP-1,DP-2,HDMI-A-1
    bus-ID: 0c:00.0 chip-ID: 1002:73df
  Device-2: Sunplus Innovation Aukey-PC-LM1E Camera
    driver: snd-usb-audio,uvcvideo type: USB rev: 2.0 speed: 480 Mb/s lanes: 1
    bus-ID: 1-2.2:5 chip-ID: 1bcf:0215
  Display: wayland server: X.org v: 1.20.14 with: Xwayland v: 23.2.1
    compositor: gnome-shell driver: gpu: amdgpu display-ID: 0
  Monitor-1: DP-3 model: AOC Q27G2WG4 res: 2560x1440 dpi: 109
    diag: 685mm (27")
  API: OpenGL v: 4.6 vendor: amd mesa v: 23.2.1 glx-v: 1.4 es-v: 3.2
    direct-render: yes renderer: AMD Radeon RX 6700 XT (navi22 LLVM 16.0.6 DRM
    3.54 6.5.6-300.fc39.x86_64) device-ID: 1002:73df display-ID: :0.0
  API: EGL Message: EGL data requires eglinfo. Check --recommends.
Audio:
  Device-1: AMD Navi 21/23 HDMI/DP Audio driver: snd_hda_intel v: kernel pcie:
    speed: 16 GT/s lanes: 16 bus-ID: 0c:00.1 chip-ID: 1002:ab28
  Device-2: AMD Starship/Matisse HD Audio vendor: ASRock
    driver: snd_hda_intel v: kernel pcie: speed: 16 GT/s lanes: 16
    bus-ID: 0e:00.4 chip-ID: 1022:1487
  Device-3: Sunplus Innovation Aukey-PC-LM1E Camera
    driver: snd-usb-audio,uvcvideo type: USB rev: 2.0 speed: 480 Mb/s lanes: 1
    bus-ID: 1-2.2:5 chip-ID: 1bcf:0215
  API: ALSA v: k6.5.6-300.fc39.x86_64 status: kernel-api
  Server-1: JACK v: 1.9.22 status: off
  Server-2: PipeWire v: 0.3.80 status: active with: 1: pipewire-pulse
    status: active 2: wireplumber status: active 3: pipewire-alsa type: plugin
Network:
  Device-1: Intel Wi-Fi 6 AX200 driver: iwlwifi v: kernel pcie: speed: 5 GT/s
    lanes: 1 bus-ID: 04:00.0 chip-ID: 8086:2723
  IF: wlp4s0 state: down mac: <filter>
  Device-2: Intel I211 Gigabit Network vendor: ASRock driver: igb v: kernel
    pcie: speed: 2.5 GT/s lanes: 1 port: f000 bus-ID: 05:00.0 chip-ID: 8086:1539
  IF: enp5s0 state: up speed: 1000 Mbps duplex: full mac: <filter>
Bluetooth:
  Device-1: Intel AX200 Bluetooth driver: btusb v: 0.8 type: USB rev: 2.0
    speed: 12 Mb/s lanes: 1 bus-ID: 1-6:7 chip-ID: 8087:0029
  Report: btmgmt ID: hci0 rfk-id: 2 state: up address: <filter> bt-v: 5.2
    lmp-v: 11
Drives:
  Local Storage: total: 1.71 TiB used: 427.79 GiB (24.4%)
  ID-1: /dev/nvme0n1 vendor: Toshiba model: N/A size: 238.47 GiB
    speed: 31.6 Gb/s lanes: 4 serial: <filter> temp: 55.9 C
  ID-2: /dev/sda vendor: Samsung model: ST1000LM024 HN-M101MBB
    size: 931.51 GiB speed: 6.0 Gb/s serial: <filter> temp: 39 C
  ID-3: /dev/sdb vendor: Samsung model: SSD 870 EVO 500GB size: 465.76 GiB
    speed: 6.0 Gb/s serial: <filter> temp: 35 C
  ID-4: /dev/sdc vendor: Kingston model: DataTraveler 3.0 size: 115.47 GiB
    type: USB rev: 3.2 spd: 5 Gb/s lanes: 1 serial: <filter>
Partition:
  ID-1: / size: 7.36 GiB used: 6.13 GiB (83.3%) fs: ext4 dev: /dev/dm-1
    mapped: live-rw
Swap:
  ID-1: swap-1 type: zram size: 8 GiB used: 0 KiB (0.0%) priority: 100
    dev: /dev/zram0
Sensors:
  System Temperatures: cpu: 57.9 C mobo: N/A gpu: amdgpu temp: 54.0 C
    mem: 46.0 C
  Fan Speeds (rpm): N/A gpu: amdgpu fan: 0
Info:
  Memory: total: 16 GiB available: 15.52 GiB used: 2.91 GiB (18.8%)
  Processes: 409 Power: uptime: 16m wakeups: 0 Init: systemd v: 254
    target: graphical (5) default: graphical
  Packages: Compilers: N/A Shell: Bash v: 5.2.15 running-in: tmux:
    inxi: 3.3.34

smartctl find out something slightly wrong:

liveuser@localhost-live:~$ sudo smartctl /dev/nvme1 -a
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.5.6-300.fc39.x86_64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Sabrent
Serial Number:                      A44907961B8901243611
Firmware Version:                   ECFM12.3
PCI Vendor/Subsystem ID:            0x1987
IEEE OUI Identifier:                0x6479a7
Total NVM Capacity:                 512,110,190,592 [512 GB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            6479a7 21c1243131
Local Time is:                      Mon Aug 12 10:04:27 2024 EDT
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0007):   Security Format Frmw_DL
Optional NVM Commands (0x0054):     DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x08):         Telmtry_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     70 Celsius
Critical Comp. Temp. Threshold:     90 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.12W       -        -    0  0  0  0        0       0
 1 +     6.40W       -        -    1  1  1  1        0       0
 2 +     5.54W       -        -    2  2  2  2        0       0
 3 -   0.0490W       -        -    3  3  3  3     2000    2000
 4 -   0.0018W       -        -    4  4  4  4    25000   25000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        36 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    3%
Data Units Read:                    41,423,892 [21.2 TB]
Data Units Written:                 26,158,326 [13.3 TB]
Host Read Commands:                 398,977,070
Host Write Commands:                212,503,832
Controller Busy Time:               1,536
Power Cycles:                       1,681
Power On Hours:                     3,132
Unsafe Shutdowns:                   126
Media and Data Integrity Errors:    0
Error Information Log Entries:      3,638
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 63 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0       3638     0  0x0014  0x4004  0x028            0     0     -  Invalid Field in Command

Self-tests not supported

btrfs check:

liveuser@localhost-live:~1$ sudo btrfs check /dev/nvme1n1p3 
Opening filesystem to check...
Checking filesystem on /dev/nvme1n1p3
UUID: 7293325f-994a-48ad-8db5-8eb3b72a31c0
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 451761180672 bytes used, no error found
total csum bytes: 437125952
total tree bytes: 2794307584
total fs tree bytes: 2130624512
total extent tree bytes: 180961280
btree space waste bytes: 574235285
file data blocks allocated: 1025429245952
 referenced 483452248064

I run the scrub directly here and I got some errors:

[liveuser@localhost-live ~]$ sudo btrfs scrub status /dev/nvme1n1p3
UUID:             7293325f-994a-48ad-8db5-8eb3b72a31c0
Scrub started:    Mon Aug 12 10:09:21 2024
Status:           aborted
Duration:         0:00:46
Total to scrub:   420.74GiB
Rate:             663.55MiB/s
Error summary:    read=512
  Corrected:      0
  Uncorrectable:  512
  Unverified:     0

I don’t have manually aborted it.
It looks like something is wrong here too! :pensive:

Did you by chance try running btrfs scrub when the system was booted and the devices mounted and active?

Yes, in / and /home as suggested by BTRFS Assistant!

Did I do damagies?
May I find out which files have uncorrectable errors?

Did you see this error using nvme1? [#1741 (nvme self test commands fail when specifying a namespace) – smartmontools] discusses this error when using nvme1nN (namespace N).

Yes, the command was smartctl /dev/nvme1 -a.

Added btrfs, btrfs-assistant

I reran the scrub in a live and take a look at journalct to find out the corrupted files:

liveuser@localhost-live:~1$ journalctl -k | grep -E 'btrfs|nvme'
Aug 13 04:50:16 fedora kernel: nvme nvme1: pci function 0000:03:00.0
Aug 13 04:50:16 fedora kernel: nvme nvme0: pci function 0000:09:00.0
Aug 13 04:50:16 fedora kernel: nvme nvme0: missing or invalid SUBNQN field.
Aug 13 04:50:16 fedora kernel: nvme nvme0: Shutdown timeout set to 10 seconds
Aug 13 04:50:16 fedora kernel: nvme nvme0: 8/0/0 default/read/poll queues
Aug 13 04:50:16 fedora kernel: nvme nvme0: ctrl returned bogus length: 16 for NVME_NIDT_EUI64
Aug 13 04:50:16 fedora kernel: nvme nvme0: Ignoring bogus Namespace Identifiers
Aug 13 04:50:16 fedora kernel:  nvme0n1: p1 p2 p3
Aug 13 04:50:16 fedora kernel: nvme nvme1: 8/0/0 default/read/poll queues
Aug 13 04:50:16 fedora kernel:  nvme1n1: p1
Aug 13 04:50:16 fedora kernel: BTRFS: device label fedora_localhost-live devid 1 transid 167224 /dev/nvme0n1p3 scanned by (udev-worker) (862)
Aug 13 04:50:16 fedora kernel: nvme nvme1: pci function 0000:03:00.0
Aug 13 04:50:16 fedora kernel: nvme nvme0: pci function 0000:09:00.0
Aug 13 04:50:16 fedora kernel: nvme nvme0: missing or invalid SUBNQN field.
Aug 13 04:50:16 fedora kernel: nvme nvme0: Shutdown timeout set to 10 seconds
Aug 13 04:50:16 fedora kernel: nvme nvme0: 8/0/0 default/read/poll queues
Aug 13 04:50:16 fedora kernel: nvme nvme0: ctrl returned bogus length: 16 for NVME_NIDT_EUI64
Aug 13 04:50:16 fedora kernel: nvme nvme0: Ignoring bogus Namespace Identifiers
Aug 13 04:50:16 fedora kernel:  nvme0n1: p1 p2 p3
Aug 13 04:50:16 fedora kernel: nvme nvme1: 8/0/0 default/read/poll queues
Aug 13 04:50:16 fedora kernel:  nvme1n1: p1
Aug 13 04:50:16 fedora kernel: BTRFS: device label fedora_localhost-live devid 1 transid 167224 /dev/nvme0n1p3 scanned by (udev-worker) (862)
Aug 13 04:52:32 localhost-live kernel: BTRFS info (device nvme0n1p3): using crc32c (crc32c-intel) checksum algorithm
Aug 13 04:52:32 localhost-live kernel: BTRFS info (device nvme0n1p3): disk space caching is enabled
Aug 13 04:52:32 localhost-live kernel: BTRFS info (device nvme0n1p3): enabling ssd optimizations
Aug 13 04:52:32 localhost-live kernel: BTRFS info (device nvme0n1p3): auto enabling async discard
Aug 13 04:53:19 localhost-live kernel: BTRFS info (device nvme0n1p3): scrub: started on devid 1
Aug 13 04:53:19 localhost-live kernel: BTRFS error (device nvme0n1p3): super block at physical 67108864 devid 1 has bad generation 167223 expect 167224
Aug 13 04:53:19 localhost-live kernel: BTRFS error (device nvme0n1p3): super block at physical 274877906944 devid 1 has bad generation 167223 expect 167224
Aug 13 04:54:25 localhost-live kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Aug 13 04:54:25 localhost-live kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
Aug 13 04:54:25 localhost-live kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Aug 13 04:54:25 localhost-live kernel: nvme 0000:09:00.0: enabling device (0000 -> 0002)
Aug 13 04:54:25 localhost-live kernel: nvme nvme0: Disabling device after reset failure: -19
Aug 13 04:54:25 localhost-live kernel: BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 0, rd 1, flush 0, corrupt 0, gen 0
Aug 13 04:54:25 localhost-live kernel: BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 0, rd 2, flush 0, corrupt 0, gen 0
Aug 13 04:54:25 localhost-live kernel: BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 0, rd 3, flush 0, corrupt 0, gen 0
Aug 13 04:54:25 localhost-live kernel: BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 0, rd 4, flush 0, corrupt 0, gen 0
Aug 13 04:54:25 localhost-live kernel: BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 0, rd 5, flush 0, corrupt 0, gen 0
Aug 13 04:54:25 localhost-live kernel: BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 0, rd 6, flush 0, corrupt 0, gen 0
Aug 13 04:54:25 localhost-live kernel: BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 0, rd 7, flush 0, corrupt 0, gen 0
Aug 13 04:54:25 localhost-live kernel: BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 0, rd 8, flush 0, corrupt 0, gen 0
Aug 13 04:54:25 localhost-live kernel: BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 0, rd 9, flush 0, corrupt 0, gen 0
Aug 13 04:54:25 localhost-live kernel: BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 0, rd 10, flush 0, corrupt 0, gen 0
Aug 13 04:54:25 localhost-live kernel: BTRFS error (device nvme0n1p3): unable to fixup (regular) error at logical 1127735820288 on dev /dev/nvme0n1p3 physical 80481026048
Aug 13 04:54:25 localhost-live kernel: BTRFS error (device nvme0n1p3): unable to fixup (regular) error at logical 1127735885824 on dev /dev/nvme0n1p3 physical 80481091584
Aug 13 04:54:25 localhost-live kernel: BTRFS error (device nvme0n1p3): unable to fixup (regular) error at logical 1127735885824 on dev /dev/nvme0n1p3 physical 80481091584
Aug 13 04:54:25 localhost-live kernel: BTRFS error (device nvme0n1p3): unable to fixup (regular) error at logical 1127735820288 on dev /dev/nvme0n1p3 physical 80481026048
Aug 13 04:54:25 localhost-live kernel: BTRFS error (device nvme0n1p3): unable to fixup (regular) error at logical 1127736082432 on dev /dev/nvme0n1p3 physical 80481288192
Aug 13 04:54:25 localhost-live kernel: BTRFS error (device nvme0n1p3): unable to fixup (regular) error at logical 1127735689216 on dev /dev/nvme0n1p3 physical 80480894976
Aug 13 04:54:25 localhost-live kernel: BTRFS error (device nvme0n1p3): unable to fixup (regular) error at logical 1127735754752 on dev /dev/nvme0n1p3 physical 80480960512
Aug 13 04:54:25 localhost-live kernel: BTRFS error (device nvme0n1p3): unable to fixup (regular) error at logical 1127735820288 on dev /dev/nvme0n1p3 physical 80481026048
Aug 13 04:54:25 localhost-live kernel: BTRFS error (device nvme0n1p3): unable to fixup (regular) error at logical 1127735885824 on dev /dev/nvme0n1p3 physical 80481091584
Aug 13 04:54:25 localhost-live kernel: BTRFS error (device nvme0n1p3): unable to fixup (regular) error at logical 1127735951360 on dev /dev/nvme0n1p3 physical 80481157120
Aug 13 04:54:25 localhost-live kernel: BTRFS info (device nvme0n1p3): scrub: not finished on devid 1 with status: -5
Aug 13 04:54:25 localhost-live kernel: BTRFS: error (device nvme0n1p3) in btrfs_commit_transaction:2500: errno=-5 IO failure (Error while writing out transaction)
Aug 13 04:54:25 localhost-live kernel: BTRFS info (device nvme0n1p3: state E): forced readonly
Aug 13 04:54:25 localhost-live kernel: BTRFS warning (device nvme0n1p3: state E): Skipping commit of aborted transaction.
Aug 13 04:54:25 localhost-live kernel: BTRFS: error (device nvme0n1p3: state EA) in cleanup_transaction:1997: errno=-5 IO failure
Aug 13 04:54:25 localhost-live kernel: BTRFS warning (device nvme0n1p3: state EA): Skipping commit of aborted transaction.
Aug 13 04:54:25 localhost-live kernel: BTRFS: error (device nvme0n1p3: state EA) in cleanup_transaction:1997: errno=-5 IO failure
Aug 13 04:54:25 localhost-live kernel: BTRFS error (device nvme0n1p3: state EA): scrub: failed to commit transaction to fix super block errors: -5

It doesn’t helped to me :flushed:
I’m only thinking if I should find a replace for this disk meanwhile…

Thanks everyone for the support by the way!

Before replacing the SSD, consider whether the PCI implementation in your host system is not up to current standards (this is one of the reasons topics should include hardware details, e.g., inxi -Fzxx output). See: Any undesirable side effects of pci=nommconf?.

The LHDB sometimes has user comments that offer advice for configuring around hardware limitations.