BTRFS system went read-only overnight

Merritt · May 13, 2024, 7:46pm

Yesterday this Fedora 39 system was working normally, this morning the whole filesystem appears to be read-only. I saw in the logs multiple mentions of btrfs incorrect extent count.

I followed the first part of the openSUSE BTRFS troubleshooting section as that was the best-looking info I could find, booted into a live Fedora 40 USB and:

SMART status on the drive (an SSD) passes.

I can successfully mount the partition without errors

root@localhost-live:~# mount /dev/sda3 /mnt

btrfs scrub shows uncorrectable errors:

root@localhost-live:~# btrfs scrub start /mnt
scrub started on /mnt, fsid 24e4c96e-7bc1-4317-8111-bda90121e386 (pid=3647)
Starting scrub on devid 1
root@localhost-live:~# btrfs scrub status /mnt
UUID:             24e4c96e-7bc1-4317-8111-bda90121e386
Scrub started:    Mon May 13 15:13:13 2024
Status:           running
Duration:         0:00:10
Time left:        0:06:00
ETA:              Mon May 13 15:19:24 2024
Total to scrub:   79.00GiB
Bytes scrubbed:   2.13GiB  (2.70%)
Rate:             218.49MiB/s
Error summary:    no errors found
root@localhost-live:~# ERROR: there are 1 uncorrectable errors
liveuser@localhost-live:~$ sudo btrfs scrub status /mnt
UUID:             24e4c96e-7bc1-4317-8111-bda90121e386
Scrub started:    Mon May 13 15:13:13 2024
Status:           finished
Duration:         0:03:39
Total to scrub:   78.98GiB
Rate:             369.28MiB/s
Error summary:    csum=2
  Corrected:      0
  Uncorrectable:  2
  Unverified:     0

inxi from the live system:

liveuser@localhost-live:~$ inxi -Fzxx
System:
  Kernel: 6.8.5-301.fc40.x86_64 arch: x86_64 bits: 64 compiler: gcc
    v: 2.41-34.fc40
  Desktop: KDE Plasma v: 6.0.3 tk: Qt v: N/A wm: kwin_wayland dm: SDDM
    Distro: Fedora Linux 40 (KDE Plasma)
Machine:
  Type: Desktop Mobo: ASUSTeK model: P8P67 LE v: Rev X.0x
    serial: <superuser required> part-nu: SKU UEFI: American Megatrends v: 3801
    date: 09/12/2013
Battery:
  Device-1: hidpp_battery_0 model: Logitech MX Vertical Advanced Ergonomic
    Mouse serial: <filter> charge: 55% (should be ignored) status: discharging
  Device-2: hidpp_battery_1 model: Logitech MK700 serial: <filter>
    charge: 70% (should be ignored) status: discharging
CPU:
  Info: quad core model: Intel Core i5-2320 bits: 64 type: MCP
    arch: Sandy Bridge rev: 7 cache: L1: 256 KiB L2: 1024 KiB L3: 6 MiB
  Speed (MHz): avg: 1612 high: 1648 min/max: 1600/3300 cores: 1: 1600
    2: 1600 3: 1648 4: 1600 bogomips: 24724
  Flags: avx ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3
Graphics:
  Device-1: NVIDIA GM107 [GeForce GTX 750 Ti] vendor: eVga.com.
    driver: nouveau v: kernel arch: Maxwell pcie: speed: 2.5 GT/s lanes: 16
    ports: active: DVI-I-1 empty: DP-1,HDMI-A-1 bus-ID: 01:00.0
    chip-ID: 10de:1380 temp: 30.0 C
  Display: wayland server: Xwayland v: 23.2.4 compositor: kwin_wayland
    driver: N/A display-ID: 0
  Monitor-1: DVI-I-1 res: 1440x900 size: N/A
  API: EGL v: 1.5 platforms: device: 0 drv: nouveau device: 1 drv: swrast
    gbm: drv: nouveau surfaceless: drv: nouveau wayland: drv: nouveau x11:
    drv: nouveau
  API: OpenGL v: 4.5 compat-v: 4.3 vendor: mesa v: 24.0.5 glx-v: 1.4
    direct-render: yes renderer: NV117 device-ID: 10de:1380 display-ID: :0.0
  API: Vulkan v: 1.3.275 surfaces: xcb,xlib,wayland device: 0 type: cpu
    driver: N/A device-ID: 10005:0000
Audio:
  Device-1: Intel 6 Series/C200 Series Family High Definition Audio
    vendor: ASUSTeK P8H67 driver: snd_hda_intel v: kernel bus-ID: 00:1b.0
    chip-ID: 8086:1c20
  Device-2: NVIDIA GM107 High Definition Audio [GeForce 940MX]
    vendor: eVga.com. driver: snd_hda_intel v: kernel pcie: speed: 2.5 GT/s
    lanes: 16 bus-ID: 01:00.1 chip-ID: 10de:0fbc
  API: ALSA v: k6.8.5-301.fc40.x86_64 status: kernel-api
  Server-1: PipeWire v: 1.0.4 status: active with: 1: pipewire-pulse
    status: active 2: wireplumber status: active 3: pipewire-alsa type: plugin
    4: pw-jack type: plugin
Network:
  Device-1: Realtek RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet
    vendor: ASUSTeK P8P67 and other motherboards driver: r8169 v: kernel pcie:
    speed: 2.5 GT/s lanes: 1 port: c000 bus-ID: 05:00.0 chip-ID: 10ec:8168
  IF: enp5s0 state: up speed: 100 Mbps duplex: full mac: <filter>
Bluetooth:
  Device-1: Broadcom BCM20702A0 Bluetooth 4.0 driver: btusb v: 0.8 type: USB
    rev: 2.0 speed: 12 Mb/s lanes: 1 bus-ID: 4-1.2:3 chip-ID: 0a5c:21e8
  Report: btmgmt ID: hci0 rfk-id: 0 state: up address: <filter> bt-v: 4.1
    lmp-v: 7
Drives:
  Local Storage: total: 34.62 TiB used: 86.17 GiB (0.2%)
  ID-1: /dev/sda vendor: SanDisk model: SDSSDA240G size: 223.57 GiB
    speed: 6.0 Gb/s serial: <filter>
  ID-2: /dev/sdb vendor: Western Digital model: WD30EFRX-68EUZN0
    size: 2.73 TiB speed: 6.0 Gb/s serial: <filter>
  ID-3: /dev/sdc vendor: Western Digital model: WD5000AAKX-00ERMA0
    size: 465.76 GiB speed: 3.0 Gb/s serial: <filter>
  ID-4: /dev/sdd vendor: Western Digital model: WD180EDGZ-11BLDS0
    size: 16.37 TiB speed: 6.0 Gb/s serial: <filter>
  ID-5: /dev/sde vendor: Samsung model: Type-C size: 239.02 GiB type: USB
    rev: 2.1 spd: 480 Mb/s lanes: 1 serial: <filter>
  ID-6: /dev/sdf vendor: Seagate model: ST8000DM004-2CX188 size: 7.28 TiB
    type: USB rev: 3.0 spd: 5 Gb/s lanes: 1 serial: <filter>
  ID-7: /dev/sdg vendor: Western Digital model: WD80EDAZ-11TA3A0
    size: 7.28 TiB type: USB rev: 3.1 spd: 5 Gb/s lanes: 1 serial: <filter>
  ID-8: /dev/sdh vendor: Kingston model: DataTraveler 3.0 size: 58.61 GiB
    type: USB rev: 2.1 spd: 480 Mb/s lanes: 1 serial: <filter>
Partition:
  ID-1: / size: 8.56 GiB used: 6.92 GiB (80.9%) fs: ext4 dev: /dev/dm-0
    mapped: live-rw
Swap:
  ID-1: swap-1 type: zram size: 8 GiB used: 0 KiB (0.0%) priority: 100
    dev: /dev/zram0
Sensors:
  System Temperatures: cpu: 38.0 C mobo: N/A gpu: nouveau temp: 30.0 C
  Fan Speeds (rpm): N/A
Info:
  Memory: total: 16 GiB available: 15.58 GiB used: 2.83 GiB (18.2%)
  Processes: 302 Power: uptime: 24m wakeups: 0 Init: systemd v: 255
    target: graphical (5) default: graphical
  Packages: Compilers: N/A Shell: Bash v: 5.2.26 running-in: konsole
    inxi: 3.3.34

One other thing I noticed: it doesn’t appear to be set read-only immediately. I tried booting in, then opened a console and did touch test.txt, which created the file as expected. I tried again 2 minutes later and the filesystem was read-only again.

I am very not familiar with btrfs and unsure what to do next, with further steps on that wiki warning they can cause damage.

barryascott · May 13, 2024, 9:02pm

I would check for disk hardware failure.
What does smartctl have say?
Do you see disk errors in dmesg?

Merritt · May 13, 2024, 10:25pm

SMART tests all say they’ve passed.

I ran a new long test:

liveuser@localhost-live:~$ sudo smartctl -t long /dev/sda

Results:

liveuser@localhost-live:~$ sudo smartctl -x /dev/sda
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.8.5-301.fc40.x86_64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     SandForce Driven SSDs
Device Model:     SanDisk SDSSDA240G
Serial Number:    154473402901
LU WWN Device Id: 5 001b44 f02ea0e15
Firmware Version: Z22000RL
User Capacity:    240,057,409,536 bytes [240 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      1.8 inches
TRIM Command:     Available
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon May 13 18:00:56 2024 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Disabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Unavailable

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x71) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0002) Does not save SMART data before
                                        entering power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  10) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  5 Retired_Block_Count     -O--CK   100   100   000    -    0
  9 Power_On_Hours_and_Msec -O--CK   241   100   000    -    12529h+00m+00.000s
 12 Power_Cycle_Count       -O--CK   100   100   000    -    184
166 Unknown_Attribute       -O--CK   100   100   000    -    5226
167 Unknown_Attribute       -O--CK   100   100   000    -    0
168 Unknown_Attribute       -O--CK   100   100   000    -    5345
169 Unknown_Attribute       -O--CK   100   100   000    -    25
170 Reserve_Block_Count     -O--CK   100   100   000    -    0
171 Program_Fail_Count      -O--CK   100   100   000    -    0
172 Erase_Fail_Count        -O--CK   100   100   000    -    0
173 Unknown_SandForce_Attr  -O--CK   100   100   ---    -    5304
174 Unexpect_Power_Loss_Ct  -O--CK   100   100   000    -    62
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
194 Temperature_Celsius     -O---K   063   100   000    -    37 (Min/Max 0/52)
199 SATA_CRC_Error_Count    -O--CK   100   100   000    -    1
230 Life_Curve_Status       -O--CK   100   100   000    -    177
232 Available_Reservd_Space PO--CK   100   100   004    -    100
233 SandForce_Internal      -O--CK   100   100   000    -    1043972
241 Lifetime_Writes_GiB     ----CK   253   253   000    -    83030
242 Lifetime_Reads_GiB      ----CK   253   253   000    -    31367
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01       GPL,SL  R/O      1  Summary SMART error log
0x02       GPL,SL  R/O      1  Comprehensive SMART error log
0x03       GPL,SL  R/O      1  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06       GPL,SL  R/O      1  SMART self-test log
0x07       GPL,SL  R/O      1  Extended self-test log
0x09       GPL,SL  R/W      1  Selective self-test log
0x10       GPL,SL  R/O      1  NCQ Command Error log
0x11       GPL,SL  R/O      1  SATA Phy Event Counters log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
Device Error Count: 3
        CR     = Command Register
        FEATR  = Features Register
        COUNT  = Count (was: Sector Count) Register
        LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
        LH     = LBA High (was: Cylinder High) Register    ]   LBA
        LM     = LBA Mid (was: Cylinder Low) Register      ] Register
        LL     = LBA Low (was: Sector Number) Register     ]
        DV     = Device (was: Device/Head) Register
        DC     = Device Control Register
        ER     = Error register
        ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 3 [2] log entry is empty
Error 2 [1] log entry is empty
Error 1 [0] log entry is empty
SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Reserved (0x1b)     Completed without error       00%     12528         -
# 2  Vendor (0x44)       Self-test routine in progress 20%     12528         -
# 3  Short offline       Unknown status (0xb)          10%      4694         -
# 4  Short offline       Completed without error       00%      4685         -
# 5  Short offline       Completed without error       00%      3698         -
# 6  Short offline       Completed without error       00%      3694         -
# 7  Short offline       Completed without error       00%      3680         -
# 8  Reserved (0x1b)     Completed without error       00%      3677         -
# 9  Vendor (0x44)       Self-test routine in progress 20%      3611         -
#10  Short offline       Unknown status (0xb)          10%     62020         -
#11  Short offline       Completed without error       00%     45349         -
#12  Short offline       Completed without error       00%      3606         -
#13  Short offline       Completed without error       00%      3589         -
#14  Short offline       Completed without error       00%      3585         -
#15  Short offline       Completed without error       00%      3368         -
#16  Short offline       Completed without error       00%      3366         -
#17  Short offline       Completed without error       00%      3364         -
#18  Short offline       Completed without error       00%      3353         -
#19  Short offline       Completed without error       00%      3351         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Commands not supported

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 2) ==
0x01  0x008  4             184  ---  Lifetime Power-On Resets
0x01  0x010  4           12529  ---  Power-on Hours
0x01  0x018  6           83030  ---  Logical Sectors Written
0x01  0x020  6       662771368  ---  Number of Write Commands
0x01  0x028  6           31367  ---  Logical Sectors Read
0x01  0x030  6      1376711810  ---  Number of Read Commands
0x02  =====  =               =  ===  == Free-Fall Statistics (empty) ==
0x03  =====  =               =  ===  == Rotating Media Statistics (empty) ==
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4             205  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              37  ---  Current Temperature
0x05  0x010  1            -108  ---  Average Short Term Temperature
0x05  0x018  1             -81  ---  Average Long Term Temperature
0x05  0x020  1              51  ---  Highest Temperature
0x05  0x028  1              10  ---  Lowest Temperature
0x05  0x030  1              -1  ---  Highest Average Short Term Temperature
0x05  0x038  1               0  ---  Lowest Average Short Term Temperature
0x05  0x040  1              -1  ---  Highest Average Long Term Temperature
0x05  0x048  1               0  ---  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1             100  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4            2931  ---  Number of Hardware Resets
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0x07  =====  =               =  ===  == Solid State Device Statistics (rev 1) ==
0x07  0x008  1             176  ---  Percentage Used Endurance Indicator
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            0  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2           12  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0010  2            0  R_ERR response for host-to-device data FIS, non-CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x0013  2            0  R_ERR response for host-to-device non-data FIS, non-CRC

Even though the tests pass, it does seem to think the drive is getting pretty old.

Percentage Used Endurance Indicator: 176

Also:

ATA Error Count: 3

Though from what I could see, that doesn’t necessarily mean anything unless the count is going up.

So I am not sure if I should consider it dead enough to just replace, if that caused the issue…

barryascott · May 14, 2024, 6:38am

Taken at face value the drive is beyond its endurance limit by a large margin.
If you can find a specification for your drive you can check its endurance.
Not all vendors list endurance by the eay.

Seems to indicate a bad problem.

I would be replacing the drive given the evidence.

augenauf · May 14, 2024, 7:13am

Removed audio, bluetooth, intel, kde, kde-plasma, nvidia, pipewire, server

computersavvy · May 14, 2024, 11:57am

I understand that the firmware in some SSDs is designed to force read-only mode when past the projected lifetime of the device. It certainly is the case with many USB devices and SD cards.

As such this issue could easily force replacement as suggested by Barry.

chrismurphy · May 14, 2024, 9:21pm

root@localhost-live:~# ERROR: there are 1 uncorrectable errors

The specific error details will be in dmesg so we can see if this is a metadata problem or a data problem. If data is affected, the dmesg error will include path to file affected which you may not want to share depending on the file. If it’s not an important file, then you could ignore it but might be best to find all copies (in all snapshots, if any) and remove them.

If the error is in metadata it may or may not be fixable. You can try btrfs check --readonly to see what the problem is. I do not recommend running --repair. You should recover important files first, while it’s still possible to mount the file system.

Merritt · May 14, 2024, 10:38pm

I ran a check on the RAM overnight, to be sure that faulty memory wasn’t causing the issue. It came back clean.

I tried a few different hard drives; seems all my drives are quite old and reporting pre-failure, but I reinstalled on an old-but-working non-SSD, and restored my backup files.

Got the docker containers spun back up, and everything seems to be working alright. Seems reasonable to presume this was down to the quite old SSD, and rather than btrfs acting up it probably was trying to save my data by going read-only.

I will have to look at picking up a new SSD before this old drive dies too.

Thanks all for the help, much appreciated and much love!

hamrheadcorvette · May 14, 2024, 10:54pm

We’ve had several BTRFS read only mounts here this week. This can’t be a coincidence . . .

gnwiii · May 14, 2024, 11:20pm

Lots of users are updating or installing to get F40. That exercises storage devices more heavily than most see in normal use, so can push old devices over the edge.

hamrheadcorvette · May 14, 2024, 11:28pm

Fair point.

It makes me a bit uneasy to be honest. Could be it BTRFS, old devices, SSD’s not rated properly? I’m still very skeptical. SSD’s going through this is odd, maybe a 10yr old HDD with long Power life. . .

gnwiii · May 14, 2024, 11:37pm

With btrfs, could be lack of periodic maintenance, e.g., not using btrfsmaintence scripts.

chrismurphy · May 15, 2024, 2:55am

At least ext4 and XFS will go read-only when confused. Like Btrfs, this is to avoid confusion from ending up on disk. Unlike other file systems, Btrfs checksums all data in addition to metadata. Since there’s orders of magnitude more data, Btrfs is more likely to detect errors that result in corruption.

To better understand what’s going on, it’s necessary to parse the btrfs kernel messages found in dmesg.

geraldosimiao · May 15, 2024, 4:18am

One thing that is good to check from time to time is the filesystem usage, running the command sudo btrfs filesystem usage / just to be sure btrfs have not allocated all the device… And a parcial balance is a good maintenance practice too, like for example sudo btrfs balance start -musage=xx / for balance only metadata, or with -dusage=xx for data.

Merritt · May 15, 2024, 2:53pm

FWIW, this system hadn’t been updated yet - was waiting a few weeks to ensure any upgrade bugs were ironed out first.

If you think it would be helpful, let me know specifically what command would produce the desired output and I’ll hook the drive back up to try and get the error logs.

Merritt · May 15, 2024, 2:55pm

I hope that any important to run things like this are done automatically, because I certainly have not done so. I expect most users do not do anything manual regarding the filesystem on a regular basis.

gnwiii · May 15, 2024, 11:13pm

The brfsmaintence scripts run automatically using systemd, and have sensible defaults, but with the wide variety of use cases, e.g., systems that run continuously vs those that are only on for brief periods each day , users may prefer to do the maintenance manually when it won’t interfere with their normal usage.

chrismurphy · May 21, 2024, 4:28pm

A majority of workloads can depend on the default behavior of the file system. An active area of development is something called “automatic reclaim” which is slowly being rolled out in newer kernels. There are still some limitations, and further enhancements are pending. A long standing workaround is installing the btrfsmaintenance package and enabling the btrfs-balance.timer, that’s it. It will run the proper service periodically. It is not installed or enabled by default.

If you think it would be helpful, let me know specifically what command would produce the desired output and I’ll hook the drive back up to try and get the error logs.

Mount the file system and provide the following:

dmesg | grep -i btrfs
btrfs fi usage $mnt
journalctl -D $mnt/root/var/log/journal/$uuid -b | grep -i btrfs
journalctl -D $mnt/root/var/log/journal/$uuid -b-1 | grep -i btrfs
journalctl -D $mnt/root/var/log/journal/$uuid -b-2 | grep -i btrfs

Where $mnt is the path to the mountpoint for this Btrfs file system. And $uuid can be autocompleted with tab key, it’s actually the machine ID. The various commands let us see different boots and see if some event maybe got logged that explains what’s going on.

The gist of what we’re looking for is how full the file system is. That’s not immediately obvious with df alone because Btrfs allocates different kinds of space and it’s possible for one kind of space to fill up before other kinds of space. So we need to see the details of how the space is being used.

Also you can find me (cmurf) in the Fedora room on Matrix, https://matrix.to/#/#fedora:fedoraproject.org which is often easier for back and forth.

Merritt · May 23, 2024, 3:25pm

I can’t seem to boot off the drive anymore, so I hooked it up as an external drive to mount and get the logs:

❯ sudo dmesg | grep -i btrfs
[sudo] password for merritt:
[    0.946157] Btrfs loaded, zoned=yes, fsverity=yes
[    3.038103] BTRFS: device label fedora devid 1 transid 226 /dev/sda3 scanned by mount (436)
[    3.038576] BTRFS info (device sda3): first mount of filesystem a1fc8ea5-2e4b-4e03-9b14-cb642f96e34f
[    3.038589] BTRFS info (device sda3): using crc32c (crc32c-generic) checksum algorithm
[    3.038597] BTRFS info (device sda3): using free-space-tree
[    5.942258] BTRFS info (device sda3: state M): use zstd compression, level 1
[  276.549802] BTRFS: device label fedora devid 1 transid 22886 /dev/sdg3 scanned by pool-udisksd (2699)
[  276.551102] BTRFS info (device sdg3): first mount of filesystem 3faa1aa5-8d62-4dae-9ced-c3f5303eb78a
[  276.551115] BTRFS info (device sdg3): using crc32c (crc32c-intel) checksum algorithm
[  276.551120] BTRFS info (device sdg3): using free-space-tree
[61322.698545] BTRFS info (device sdg3): last unmount of filesystem 3faa1aa5-8d62-4dae-9ced-c3f5303eb78a
[61444.965187] BTRFS: device label fedora_localhost-live devid 1 transid 1519632 /dev/sdg3 scanned by pool-udisksd (1666129)
[61444.967186] BTRFS info (device sdg3): first mount of filesystem 24e4c96e-7bc1-4317-8111-bda90121e386
[61444.967203] BTRFS info (device sdg3): using crc32c (crc32c-intel) checksum algorithm
[61444.967208] BTRFS info (device sdg3): using free-space-tree
[61444.978253] BTRFS info (device sdg3): bdev /dev/sdg3 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
[61453.394772] BTRFS error (device sdg3): incorrect extent count for 120289492992; counted 1340, expected 1338
[61475.626897] BTRFS error (device sdg3): incorrect extent count for 120289492992; counted 1339, expected 1337
[61475.626912] BTRFS error (device sdg3: state A): Transaction aborted (error -5)
[61475.626916] BTRFS: error (device sdg3: state A) in convert_free_space_to_extents:471: errno=-5 IO failure
[61475.626923] BTRFS info (device sdg3: state EA): forced readonly
[61475.626926] BTRFS: error (device sdg3: state EA) in add_to_free_space_tree:1057: errno=-5 IO failure
[61475.626929] BTRFS: error (device sdg3: state EA) in do_free_extent_accounting:3012: errno=-5 IO failure
[61475.626933] BTRFS error (device sdg3: state EA): failed to run delayed ref for logical 120982568960 num_bytes 16384 type 176 action 2 ref_mod 1: -5
[61475.626938] BTRFS: error (device sdg3: state EA) in btrfs_run_delayed_refs:2249: errno=-5 IO failure

❯ sudo btrfs fi usage /run/media/merritt/fedora_localhost-live/
Overall:
    Device size:                 221.98GiB
    Device allocated:            170.07GiB
    Device unallocated:           51.91GiB
    Device missing:                  0.00B
    Device slack:                    0.00B
    Used:                         78.98GiB
    Free (estimated):             70.85GiB      (min: 44.90GiB)
    Free (statfs, df):            70.85GiB
    Data ratio:                       1.00
    Metadata ratio:                   2.00
    Global reserve:              274.28MiB      (used: 0.00B)
    Multiple profiles:                  no

Data,single: Size:92.01GiB, Used:73.07GiB (79.41%)
   /dev/sdg3      92.01GiB

Metadata,DUP: Size:39.00GiB, Used:2.96GiB (7.58%)
   /dev/sdg3      78.00GiB

System,DUP: Size:32.00MiB, Used:16.00KiB (0.05%)
   /dev/sdg3      64.00MiB

Unallocated:
   /dev/sdg3      51.91GiB

❯ sudo journalctl -D /run/media/merritt/fedora_localhost-live/root/var/log/journal/f2c38cd118664f49b8d0979c154a0926 -b | grep -i btrfs
May 13 10:55:08 htpc kernel: Btrfs loaded, zoned=yes, fsverity=yes
May 13 10:55:08 htpc dracut-cmdline[308]: Using kernel command line parameters:  rd.driver.pre=btrfs   BOOT_IMAGE=(hd4,gpt2)/vmlinuz-6.8.9-200.fc39.x86_64 root=UUID=24e4c96e-7bc1-4317-8111-bda90121e386 ro rootflags=subvol=root initcall_blacklist=simpledrm_platform_driver_init rhgb quiet selinux=0 initcall_blacklist=simpledrm_platform_driver_init nvidia-drm.modeset=1 rd.driver.blacklist=nouveau modprobe.blacklist=nouveau fsck.mode=force fsck.repair=yes
May 13 10:55:11 htpc kernel: BTRFS: device label fedora_localhost-live devid 1 transid 1519615 /dev/sda3 scanned by mount (507)
May 13 10:55:11 htpc kernel: BTRFS info (device sda3): first mount of filesystem 24e4c96e-7bc1-4317-8111-bda90121e386
May 13 10:55:11 htpc kernel: BTRFS info (device sda3): using crc32c (crc32c-intel) checksum algorithm
May 13 10:55:11 htpc kernel: BTRFS info (device sda3): using free-space-tree
May 13 10:55:11 htpc kernel: BTRFS info (device sda3): bdev /dev/sda3 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
May 13 10:55:11 htpc kernel: BTRFS info (device sda3): start tree-log replay
May 13 14:55:15 htpc kernel: BTRFS info (device sda3: state M): use zstd compression, level 1
May 13 14:55:18 htpc containerd[1196]: time="2024-05-13T14:55:18.597857277-04:00" level=info msg="loading plugin \"io.containerd.snapshotter.v1.btrfs\"..." type=io.containerd.snapshotter.v1
May 13 14:55:24 htpc dockerd[1277]: time="2024-05-13T14:55:24.830140973-04:00" level=info msg="[graphdriver] using prior storage driver: btrfs"
SIGBUS handling failed: Value too large for defined data type
[1]    1687394 IOT instruction  sudo journalctl -D  -b |
       1687395 done             grep --color=auto --exclude-dir={.bzr,CVS,.git,.hg,.svn,.idea,.tox} -i btrfs

❯ sudo journalctl -D /run/media/merritt/fedora_localhost-live/root/var/log/journal/f2c38cd118664f49b8d0979c154a0926 -b-1 | grep -i btrfs
Specifying boot ID or boot offset has no effect, no persistent journal was found.

❯ sudo journalctl -D /run/media/merritt/fedora_localhost-live/root/var/log/journal/f2c38cd118664f49b8d0979c154a0926 -b-2 | grep -i btrfs
Specifying boot ID or boot offset has no effect, no persistent journal was found.

gnwiii · May 24, 2024, 7:19pm

I’m guessing you used Gnome Disks. Try using the “Device” name from Gnome Disks to mount with sudo, e.g.:

$ sudo mount <Device name> /mnt

Topic		Replies	Views
Btrfs Read-only OS is it possible to recover? Ask Fedora btrfs	5	466	May 21, 2024
System went read-only after inactivity \| BTRFS \| Ask Fedora btrfs , filesystem , f40	18	233	July 22, 2024
BTRFS error on boot Fedora 40 Ask Fedora btrfs , workstation	14	1251	June 13, 2024
System has gone read-only all of a sudden Ask Fedora f38 , btrfs	25	5519	October 10, 2023
BTRFS no longer mounts Ask Fedora btrfs , f39	47	3232	May 28, 2024

BTRFS system went read-only overnight

Related topics