System went read-only after inactivity | BTRFS |

wouldrathernot · July 18, 2024, 4:05pm

Earlier today my system went read-only, which from my experience is not a good thing. I shutdown my laptop and went into rescue mode in a USB to check the errors, I got these from typing btrfs check /dev/sdb4

[1/7] checking root items
[2/7] checking extents
ref mismatch on [10116931584 36864] extent item 4194305, found 1
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space tree
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data]
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 16488804352 bytes used, error(s) found
total csum bytes: 15410108
total tree bytes: 575078400
total fs tree bytes 526319616
total extent tree bytes: 26918912
btree space waste bytes: 118175279
file data blocks allocated: 54143864832
  referenced 22470471680

This is a new solid state drive by the way (bought it two months ago) which is what’s confusing me the most!

chrismurphy · July 18, 2024, 7:57pm

$ printf ‘%x\n%x\n%x\n%x\n%x\n’ 4194305
400001

extent items essentially never fall on a boundary that’s off by 1 byte. This is evidence of a bitflip and generically suggests bad RAM. But it could be power supply, cooling, overclocking the CPU.

So if any of those things sound suspicious, like you’re playing games and overclocking the CPU (super common) reset that back to a proper default.

Here’s what I’d like to see:

Boot from the USB stick, and try to mount the drive normally:

mount /dev/sdb4 /mnt
umount /dev/sdb4 /mnt
mount -o ro,rescue=all /dev/sdb4 /mnt
journalctl -k -o short-monotonic --no-hostname > dmesg0.txt

post dmesg0.txt somewhere for us to download, I’d like to see what the kernel says about any issues mounting this file system.

chrismurphy · July 18, 2024, 8:03pm

Next, you will also need to run some memory testers to confirm/deny this is a memory problem. If you were overclocking a lot, well you could probably safely assume that’s the problem and not test. But I personally would still test.

memtest86 and memtest86+, these run in the pre-boot environment, not everyone is familiar with that, but they have the benefit of taking up very little memory therefore essentially all the memory can be tested; the downside is it can take days. Yes, about as many folks reporting memory defects discovered in minutes, report memory defects discovered only after days. Many passes, zero errors, then suddenly on day 4, 1 bitflip. So that’s tedious but hopefully you will not experience that. Test until you can’t stand it anymore.

memtester is in the Fedora repo. This is a user space memory tester. As such a large footprint for the kernel, and memory in-use by user space, prevents a lot of memory from being tested. You can minimize this (and maximize memory tested) by booting with boot parameter 3 which means non-graphical boot. Run memtester from the command line.

Some folks swear by doing multiple concurrent compilations of linux. Supposedly gcc is very memory intensive. I haven’t ever tried this and I’m not sure how to maximize the amount of memory being tested or how to interpret the results. But there you go.

chrismurphy · July 18, 2024, 8:04pm

It still would be useful to know the computer make/model, the drive make/model, kernel version, and btrfs progs version.

Also, not required but back and forth convo is easier for me (cmurf) on Matrix:

wouldrathernot · July 18, 2024, 8:20pm

Thanks! I will get back to you tomorrow.

wouldrathernot · July 19, 2024, 11:01am

Here is dmesg0.txt https://0x0.st/X9x3.txt

chrismurphy · July 19, 2024, 5:37pm

[ 404.944029] kernel: BTRFS info (device sdb4): bdev /dev/sdb4 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0

So at some point the file system encountered a checksum mismatch twice, hence the corrupt 2. These statistics are counters. We don’t know if it’s the same corruption encountered twice, or different mismatches each encountered twice. It’s not a lot to go on but it does suggest an earlier problem of some kind, and it might have been logged.

It’s probably safe to run the file system read-write, if it encounters the block with the bad (corrupt) extent item, the file system is going to bail out and go read-only again to avoid any more confusion getting onto the drive. I would take the opportunity to freshen backups of anything important, just in case things get worse. It’s probably bad RAM but this is so far unconfirmed, so better safe than sorry.

Test the RAM over the weekend, see if any errors are reported.

And at some point it might be interesting to see the result from the following command (mount sdb4 to /mnt first, either normally or with -o ro option if you prefer to keep it mounted read-only for now):

$machineid = this is a directory that’s named with a UUID created during initial installation by systemd to uniquely identify this machine - it’s just a random number and it’s the only directory in journal/ directory. You can hit [TAB] key after journal/ and bash will autocomplete the name of the directory, i.e. you are not literally typing in $machineid

So this will look for very specific kernel error messages over the past 10 days. Maybe there’s some device error or other btrfs error that we can learn more about what happened. NOTE: If you have done a btrfs scrubin the past 10 days, and if there were any corrupt files found, the entire path to that file will be included in the text file resulting from the command. It’s a potential privacy leak. If there is such a path to file - you can just delete the file name from the filel - keep the rest of that line intact though.

wouldrathernot · July 19, 2024, 6:01pm

I have ran that command: https://0x0.st/X96B.txt

As for the ram, I’ve ran memtester with boot parameter 3 for a few hours but so far nothing, I will continue testing in case something pops up.

chrismurphy · July 19, 2024, 6:15pm

[ 3.306789] kernel: BTRFS info (device sda4): bdev /dev/sda4 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0

OK so they’ve been there at least 10 days. I don’t think there’s much useful knowledge to be gained but if you want you could expand it to 100 days. I also removed the monotonic time stamp in favor of the default date time stamp so that it’s possible to see what day this might have happened.

wouldrathernot · July 19, 2024, 6:30pm

Extended to 100 days: https://0x0.st/X964.txt

Also apparently I didn’t run memtester with sudo and the moment I did, I saw bitflip errors among others, so maybe it’s a ram issue… which is cheaper to replace than storage I guess?

chrismurphy · July 19, 2024, 6:39pm

Jul 14 18:33:59 kernel: BTRFS info (device sdb4): bdev /dev/sdb4 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0

OK I guess your journal doesn’t go back very far. shrug So we don’t really know how long it’s been there, but at least 5 days and also no changes in 5 days either.

I saw bitflip errors among others, so maybe it’s a ram issue

Can you post some of the excerpts of this?

which is cheaper to replace than storage I guess?

I would be a consumer. All reputable memory manufacturers offer lifetime warranties on memory. That is not the case with soldered in memory by laptop manufacturers, and I take a very dim view of this myself - I would seriously request a warranty waiver for defective memory. Expressly ask for the waiver on the basis that “all reputable memory manufacturers offer lifetime warranties to consumers”. I would absolutely be polite about it, but I would not accept any answer but satisfaction.

wouldrathernot · July 20, 2024, 10:28am

Here is two loops of me memtesting 10 gigabytes (95% of my ram) https://0x0.st/X9RN.txt

gnwiii · July 20, 2024, 11:20am

Fake SSD’s are a known problem. Reliability engineering has the concept of “bathtub” curve for failure rates.

Your first priority should be to make a backup of important data in case rescue attempts fail. Then you can mount the root partition of the failed system and use journalctl’s -D <path> option in a Live USB system to look for details of the failure.

chrismurphy · July 20, 2024, 10:03pm

Interesting. It’s consistently finding two defects, offset 0xac856830, offset 0x4553d8e0 in each loop. But not for each test.

I think the test is consistent with the Btrfs error. Btrfs is not intended to be a memory test utility, not least of which is that the region of memory used by file system metadata and data is pretty small, but does tend to eventually catch these kinds of issues.

gnwiii · July 21, 2024, 1:16am

Many non-reproducible glitches are ignored by users (when they should investigate and maybe run memory tests), but btrfs going read-only can’t be ignored.

wouldrathernot · July 21, 2024, 9:11am

I checked and the SSD is real, I will buy new ram sticks for my laptop or try the consumer route. Anything I should do once that’s done?

gnwiii · July 21, 2024, 6:46pm

I would make a backup before switching to read/write mode so you have a “redo” option if the system still has btrfs issues.

chrismurphy · July 22, 2024, 2:36am

Freshen backups while mounted with -o ro. Attempting repair can make things worse, so definitely backup important things.

And then you can see if btrfs check --repair can fix the problem. The file system should not be mounted at the time you attempt the repair. You might consider dnf update btrfs-progs if you’re running the repair from a live USB stick with Fedora 39 or older.

Note the warning in man btrfs check does say not to use --repair unless advised. This advice is a case by case basis, it is not generally recommended. Other recovery attempts should be used first. You may consider yourself so advised.

Edit: After the backup, check journalctl -k | grep -i btrfs for errors related to the backup time. It’s possible the bitflip prevents some files from being backed up. If this happens, put the repair on hold, let me know, and we’ll consider the next step.

Espionage724 · July 22, 2024, 5:14am

I might recommend popular tools from Windows.

I’ve done overnight Memtest86+ (the open-source one) overnight on a few machines with no errors (almost thought it was fake up until recently when I saw it actually show errors on a known-bad laptop :p), but overnight on a couple of the same machines got errors with HCI’s MemTest. It’s possible Memtest86+ could have found errors too with more time, but a paid tool found errors.

Topic		Replies	Views
System has gone read-only all of a sudden Ask Fedora f38 , btrfs	25	7290	October 10, 2023
BTRFS system went read-only overnight Ask Fedora btrfs , f39	24	4049	June 6, 2024
Btrfs woes Ask Fedora f38 , btrfs , intel	16	3024	September 23, 2023
Fedora in read-only mode Ask Fedora help	12	720	November 21, 2024
BTRFS drive logging csum failed errors, time to replace? Ask Fedora	2	2716	May 12, 2021

System went read-only after inactivity | BTRFS |

Related topics