Filesystem in Read Only Mode

Hello!

This is my first post on this site, so hopefully I do a good job lol

As mentioned in the subject, I am having issues with Fedora being in Read Only mode. I followed the guidance from another post and I have the dmesg and journalctl logs. I will include them.

For more context, I was using my laptop and it suddenly went into read only mode when I was installing docker. I wasn’t really sure what to do so I followed a post on here and also used ChatGPT, which seemed to be a mistake. I ended up running fsck -f and it ran for like 24hrs or so. During that process, it found inodes which didn’t have references(?) and tried to fix them. Eventually, it moved them to lost & found. I learned I wasn’t supposed to do that and then cancelled fsck, ran fedora from a live usb, mounted the drive, and got SMART and dmesg logs. I did smartctl -t long and it returned no problems. Tested the ram with Memtest86+, I had 2x8GB Crucial sticks, found one of them was bad, and started the process of getting them replaced. I still had my original Samsung RAM the laptop came with, so I put that back thinking I had solved the issue. Unfortunately, it did but only for a couple of hours as it went back to read only mode. During that session, where it went into read only mode, I got the dmesg, journalctl, and SMART logs. I saw there was an inode error in the dmesg log, so I’m guessing it’s a problem with the disk now (although SMART says it’s fine). I’m guessing the original problem was the RAM and I messed up the disk inodes with fsck. So there’s no faulty hardware anymore, but the disk entries need to be fixed?

dmesg before ram fix
smartctl before ram fix

dmesg after ram fix
smartctl after ram fix
journalctl after ram fix (Google Drive Link as file is too large)

The journalctl output is 6 weeks worth, which is why it’s so long - it goes back to the start of November.

Skipping to the end though, you have a corrupt filesystem:

Dec 23 16:49:29 fedora kernel: BTRFS critical (device dm-0): failed to delete reference to 7C61B5E53DACDD3F7C3BCC5D55F796C65BF9C8C9, root 257 inode 578564 parent 622
Dec 23 16:49:29 fedora kernel: ------------[ cut here ]------------
Dec 23 16:49:29 fedora kernel: BTRFS: Transaction aborted (error -2)

This is the cause of the RO filesystem and it needs to be corrected before anything else., If you have important stuff on this machine, I advise you to boot from a live USB stick and get a copy of anything you cannot afford to lose, assuming you don’t have any recent backups.

Note that since you have already run fsck -f which can be destructive it may not be possible to recover everything from the file system. Those inodes that were moved to lost & found may no longer exist as part of the file system and that data might be permanently missing.

The suggestion above seems best as a start.

In the future please consider non-destructive fixes to recover before diving in without understanding the potential consequences.

Also be aware that AI such as chatGPT are not by any means 100% correct in suggested fixes. The AI learns from skimming input from many sources and may or may not get a correct solution for your problem. It also may give different results if the question is phrased differently, so buyer beware.

3 Likes

For your case since the error relate to btrfs, it could be better to ask for help in the btrfs mailing list linux-btrfs@vger.kernel.org.

As btrfs don’t easily break, and can’t be easily fixed and btrfs check should be used after you understand why, you might want some help of real experts.

Okay, done. What are the next steps?

Yeah, I’ve learned my lesson of just following what the AI says. Sometimes it is helpful and sometimes it isn’t…

I will keep this in mind

boot from live-USB so that your filesystem is unmounted and run a btrfs check on the device which is reporting issues. I guess the first thing is to see if there’s actually any damage (and from the sound of it, there is) and then see how much of it is recoverable.

This will be a read-only scan so it won’t fix anything - it’s just to ascertain what the extent of the issue is and how much of the filesystem is kaputt.

2 Likes

Okay here is the output

sudo btrfs check /dev/mapper/mydisk
Opening filesystem to check...
Checking filesystem on /dev/mapper/mydisk
UUID: 24274481-9103-4f2c-95a7-7091d6343051
[1/8] checking log skipped (none written)
[2/8] checking root items
[3/8] checking extents
[4/8] checking free space tree
[5/8] checking fs roots
	unresolved ref dir 622 index 0 namelen 40 name 7C61B5E53DACDD3F7C3BCC5D55F796C65BF9C8C9 filetype 1 errors 6, no dir index, no inode ref
root 257 inode 578566 errors 1, no inode item
	unresolved ref dir 578565 index 3 namelen 12 name .metadata-v2 filetype 0 errors 3, no dir item, no dir index
root 257 inode 578567 errors 2000, link count wrong
	unresolved ref dir 578565 index 4 namelen 2 name ls filetype 0 errors 3, no dir item, no dir index
ERROR: errors found in fs roots
found 33880698880 bytes used, error(s) found
total csum bytes: 31358036
total tree bytes: 1155760128
total fs tree bytes: 1068728320
total extent tree bytes: 47022080
btree space waste bytes: 207533249
file data blocks allocated: 47729221632
 referenced 50779336704

I’m not sure if this damage was done by the forced fsck performed, or this is fundamental damage caused by running with broken ram and THEN running an fsck on it. Whichever way you slice it, this file-system appears to have fundamental metadata damage within it.

I see three options.

One - wipe the entire thing, re-install and put your crucial, backed up data back into place. This would be my preferred option if I were in this situation.

Two - allow btrfs check to try repair it with the same check command from a USB drive and also passing in the --repair flag.

Three - wait a while and try to summon Chris Murphy (@chrismurphy), who knows more about btrfs than I do and may have a better way of repairing the damage.

What I can say is that you should double-check that you have all the required data that live on this drive somewhere safe and that they all look sensible internally. I suspect you’re going to be needing them.

You can use a Live USB installer to create an image of the corrupt drive on another drive (or even make a clone of the corrupt drive). You can loop mount a partition from the image to experiment with recovery. If that fails you still have the original and can try different recovery steps on a new image.

1 Like

I wonder if I’m missing anything here. The root filesystem is a btrfs, and doing fsck to a btrfs filesystem shouldn’t do anything (fsck.btrfs is just a copy of fsck.xfs which is a NOP placeholder), and lost+found should be a EXT4 thing. Is the ext4 filesystem (maybe /boot) also damaged?

1 Like

It looks familiar to an issue I have seen on a Fedora user, which later led to a fix that resolved the problem.

I strongly recommend backing up your data before proceeding with the repair!

EDIT (clarification): What I want to clarify is that the issue you are experiencing is not caused by a bug in Btrfs, but rather by a RAM-related problem (bit flip). The patch only addresses the repair/recovery and does not fix the underlying cause.

1 Like

As long as there isn’t a hardware failure anymore I am happy to do whatever. I don’t have anything important that isn’t already backed up on this computer, so i am fine with doing a re-install. I just wanted to make sure there wasn’t anything else that could be causing the RO.

With that in mind, option 2 seems kinda fun so I could try that to see what happens lol

The existing btrfs-progs 6.17.1 won’t likely fix this problem. The fix hasn’t been made generic enough yet, and isn’t merged into btrfs-progs.

The btrfs check looks sufficiently different from the previous case that I think it needs a follow-up report.

Options 1:

Option 2:

You could do a “me too” with this comment in the existing thread:

Find the reply instructions by scrolling down. I prefer the mbox method for replying to the list. The point of doing it this way is it will instruct your MUA to reply with headers that cause your email to post in the existing thread.

It’s also fine to just start a new thread, and refer to the original:

Basically you report your kernel and btrfs-progs version, say you think you’re having the same problem as reported here. And then post the result of btrfs check .

I suggest also posting the results from btrfs check –mode=lowmem because it sometimes finds different things.

And best if you can run btrfs-image -c9 -t4 on the unmounted (plaintext) device, which will make a compressed binary file of the file system (metadata only) for devs to look at. In this case, it’s not possible to hide filename with -ss because one of the problems is with the filename. So if it’s hidden, the image isn’t helpful.

If you’re willing to share the btrfs-image, then include a link for it in the upstream post.

Note that the btrfs mailing list is plaintext only, anything else will get bounced. And attachments get dropped if they’re too big, I’m not even sure it’ll accept 32KiB attachments.

Oh yeah, it needs to be confirmed whether the problem is caused by bad RAM and get that problem fixed first. Memory induced bitflips can make things worse during repair.

There’s entire threads about memory testing. The super short summary:

memtest86+ can take a long time to find memory problems. As in days of continuously running. But the memory footprint is very small, so it tests most memory.

memtester: this seems to find problems much faster than memtest86+ but it runs in user space so a ton of memory is not testable. You can increase what can be tested by booting with single boot param and running from a shell.

gcc: Some folks swear by multiple instances of compiling things like the kernel or webkitgtk. This is a bit like using btrfs as a memory tester though. Yeah it’ll run into it eventually but you aren’t going to get any sort of discrete memory fail error, what happens is something fails. You are just inferring it’s memory related.

I can confirm that one of my sticks of RAM was bad, and that the one I’m currently using is good. I tested them individually with memtest86+, ~12 passes each, in different slots.

You were right to question that. I also checked the fsck manpage and found that it does nothing for btrfs file system. I found the conversation I had with ChatGPT and found that it actually told me to use btrfs check --repair. Good catch!

1 Like

I had a similar issue in 2024, my home “server” handles a lot of household data and developed several bad dram sticks. The root directory (btrfs) would only mount read-only, which gave systemd fits trying to write to /var unsuccessfully.

Be aware that any file or metadata on your drive(s) could be corrupt, and cannot be trusted. Sata commands, for example, are check-summed (CRC) to prevent an error in transit, this does not prevent the Sata controller reading a corrupt command from dram. Your system may have scribbled anywhere on your storage, even if mounted read-only! For this reason I would not boot that filesystem ever again nor use any executables on it.

The only reasonable course of action is to install/test new dram, boot a known-good media, perhaps a Fedora install .iso; then backup whatever you want to try to recover (files or the entire filesystems) . Finally you should scrub the drive, even the partition table, and re-install. I keep extensive backups, so recovery was pretty simple for me.

You’ll want to try “btrfs {check|rescue}” and btrfsck (a version of ‘check’), but avoid modifying the filesystem until you see what these will do … See the options.

In my case the various btrfs tools were able to correct access to several hundred files, but one of the often written directories (/var/lib iirc) was unrecoverable, un-deletable and the directory owner/permissions & entries were corrupt and useless.