System went read-only after inactivity | BTRFS |

Earlier today my system went read-only, which from my experience is not a good thing. I shutdown my laptop and went into rescue mode in a USB to check the errors, I got these from typing btrfs check /dev/sdb4

[1/7] checking root items
[2/7] checking extents
ref mismatch on [10116931584 36864] extent item 4194305, found 1
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space tree
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data]
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 16488804352 bytes used, error(s) found
total csum bytes: 15410108
total tree bytes: 575078400
total fs tree bytes 526319616
total extent tree bytes: 26918912
btree space waste bytes: 118175279
file data blocks allocated: 54143864832
  referenced 22470471680

This is a new solid state drive by the way (bought it two months ago) which is what’s confusing me the most!

$ printf ā€˜%x\n%x\n%x\n%x\n%x\n’ 4194305
400001

extent items essentially never fall on a boundary that’s off by 1 byte. This is evidence of a bitflip and generically suggests bad RAM. But it could be power supply, cooling, overclocking the CPU.

So if any of those things sound suspicious, like you’re playing games and overclocking the CPU (super common) reset that back to a proper default.

Here’s what I’d like to see:

Boot from the USB stick, and try to mount the drive normally:

mount /dev/sdb4 /mnt
umount /dev/sdb4 /mnt
mount -o ro,rescue=all /dev/sdb4 /mnt
journalctl -k -o short-monotonic --no-hostname > dmesg0.txt

post dmesg0.txt somewhere for us to download, I’d like to see what the kernel says about any issues mounting this file system.

1 Like

Next, you will also need to run some memory testers to confirm/deny this is a memory problem. If you were overclocking a lot, well you could probably safely assume that’s the problem and not test. But I personally would still test.

memtest86 and memtest86+, these run in the pre-boot environment, not everyone is familiar with that, but they have the benefit of taking up very little memory therefore essentially all the memory can be tested; the downside is it can take days. Yes, about as many folks reporting memory defects discovered in minutes, report memory defects discovered only after days. Many passes, zero errors, then suddenly on day 4, 1 bitflip. So that’s tedious but hopefully you will not experience that. Test until you can’t stand it anymore.

memtester is in the Fedora repo. This is a user space memory tester. As such a large footprint for the kernel, and memory in-use by user space, prevents a lot of memory from being tested. You can minimize this (and maximize memory tested) by booting with boot parameter 3 which means non-graphical boot. Run memtester from the command line.

Some folks swear by doing multiple concurrent compilations of linux. Supposedly gcc is very memory intensive. I haven’t ever tried this and I’m not sure how to maximize the amount of memory being tested or how to interpret the results. But there you go.

It still would be useful to know the computer make/model, the drive make/model, kernel version, and btrfs progs version.

Also, not required but back and forth convo is easier for me (cmurf) on Matrix:

Thanks! I will get back to you tomorrow.

Here is dmesg0.txt https://0x0.st/X9x3.txt

[ 404.944029] kernel: BTRFS info (device sdb4): bdev /dev/sdb4 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0

So at some point the file system encountered a checksum mismatch twice, hence the corrupt 2. These statistics are counters. We don’t know if it’s the same corruption encountered twice, or different mismatches each encountered twice. It’s not a lot to go on but it does suggest an earlier problem of some kind, and it might have been logged.

It’s probably safe to run the file system read-write, if it encounters the block with the bad (corrupt) extent item, the file system is going to bail out and go read-only again to avoid any more confusion getting onto the drive. I would take the opportunity to freshen backups of anything important, just in case things get worse. It’s probably bad RAM but this is so far unconfirmed, so better safe than sorry.

Test the RAM over the weekend, see if any errors are reported.

And at some point it might be interesting to see the result from the following command (mount sdb4 to /mnt first, either normally or with -o ro option if you prefer to keep it mounted read-only for now):

journalctl -D /mnt/root/var/log/journal/$machineid --since=-10d -o short-monotonic --no-hostname |grep "Linux version\| ata\|Btrfs\|BTRFS\|] hd\| scsi\| sd\| sdhci\| mmc\| nvme\| usb\| vd> dmesg1.txt

$machineid = this is a directory that’s named with a UUID created during initial installation by systemd to uniquely identify this machine - it’s just a random number and it’s the only directory in journal/ directory. You can hit [TAB] key after journal/ and bash will autocomplete the name of the directory, i.e. you are not literally typing in $machineid

So this will look for very specific kernel error messages over the past 10 days. Maybe there’s some device error or other btrfs error that we can learn more about what happened. NOTE: If you have done a btrfs scrubin the past 10 days, and if there were any corrupt files found, the entire path to that file will be included in the text file resulting from the command. It’s a potential privacy leak. If there is such a path to file - you can just delete the file name from the filel - keep the rest of that line intact though.

I have ran that command: https://0x0.st/X96B.txt

As for the ram, I’ve ran memtester with boot parameter 3 for a few hours but so far nothing, I will continue testing in case something pops up.

[ 3.306789] kernel: BTRFS info (device sda4): bdev /dev/sda4 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0

OK so they’ve been there at least 10 days. I don’t think there’s much useful knowledge to be gained but if you want you could expand it to 100 days. I also removed the monotonic time stamp in favor of the default date time stamp so that it’s possible to see what day this might have happened.

journalctl -D /mnt/root/var/log/journal/$machineid --since=-100d --no-hostname |grep "Linux version\| ata\|Btrfs\|BTRFS\|] hd\| scsi\| sd\| sdhci\| mmc\| nvme\| usb\| vd> dmesg2.txt

Extended to 100 days: https://0x0.st/X964.txt

Also apparently I didn’t run memtester with sudo and the moment I did, I saw bitflip errors among others, so maybe it’s a ram issue… which is cheaper to replace than storage I guess?

Jul 14 18:33:59 kernel: BTRFS info (device sdb4): bdev /dev/sdb4 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0

OK I guess your journal doesn’t go back very far. shrug So we don’t really know how long it’s been there, but at least 5 days and also no changes in 5 days either.

I saw bitflip errors among others, so maybe it’s a ram issue

Can you post some of the excerpts of this?

which is cheaper to replace than storage I guess?

I would be a consumer. All reputable memory manufacturers offer lifetime warranties on memory. That is not the case with soldered in memory by laptop manufacturers, and I take a very dim view of this myself - I would seriously request a warranty waiver for defective memory. Expressly ask for the waiver on the basis that ā€œall reputable memory manufacturers offer lifetime warranties to consumersā€. I would absolutely be polite about it, but I would not accept any answer but satisfaction.

1 Like

Here is two loops of me memtesting 10 gigabytes (95% of my ram) https://0x0.st/X9RN.txt

Fake SSD’s are a known problem. Reliability engineering has the concept of ā€œbathtubā€ curve for failure rates.

Your first priority should be to make a backup of important data in case rescue attempts fail. Then you can mount the root partition of the failed system and use journalctl’s -D <path> option in a Live USB system to look for details of the failure.

Interesting. It’s consistently finding two defects, offset 0xac856830, offset 0x4553d8e0 in each loop. But not for each test.

I think the test is consistent with the Btrfs error. Btrfs is not intended to be a memory test utility, not least of which is that the region of memory used by file system metadata and data is pretty small, but does tend to eventually catch these kinds of issues.

Many non-reproducible glitches are ignored by users (when they should investigate and maybe run memory tests), but btrfs going read-only can’t be ignored.

I checked and the SSD is real, I will buy new ram sticks for my laptop or try the consumer route. Anything I should do once that’s done?

I would make a backup before switching to read/write mode so you have a ā€œredoā€ option if the system still has btrfs issues.

Freshen backups while mounted with -o ro. Attempting repair can make things worse, so definitely backup important things.

And then you can see if btrfs check --repair can fix the problem. The file system should not be mounted at the time you attempt the repair. You might consider dnf update btrfs-progs if you’re running the repair from a live USB stick with Fedora 39 or older.

Note the warning in man btrfs check does say not to use --repair unless advised. This advice is a case by case basis, it is not generally recommended. Other recovery attempts should be used first. You may consider yourself so advised.

Edit: After the backup, check journalctl -k | grep -i btrfs for errors related to the backup time. It’s possible the bitflip prevents some files from being backed up. If this happens, put the repair on hold, let me know, and we’ll consider the next step.

I might recommend popular tools from Windows.

I’ve done overnight Memtest86+ (the open-source one) overnight on a few machines with no errors (almost thought it was fake up until recently when I saw it actually show errors on a known-bad laptop :p), but overnight on a couple of the same machines got errors with HCI’s MemTest. It’s possible Memtest86+ could have found errors too with more time, but a paid tool found errors.