Stuck in Emergency Mode

Hello,

i am new to Linux and im having a problem with booting into Fedora. I had a hard-stuck and now, when i try to boot its entering the emergency mode, a terminal looking screen with the prompt:

Generating “/run/initramfs/rdosreport.txt”
Entering emergency mode. Exit the shell to continue.
Type “journalctl” to view system logs.
You might want to save “/run/initramfs/rdsosreport.txt” to a USB stick or /boot
after mounting them and attach it to a bug report.
Press Enter for maintenance
(or press Control-D to continue):

When i type in Ctrl-D it get the same prompt again.

I looked at the journalctl and i found some errors in it:

BTRFS error (device nvme0n1p3): bad tree block start, mirror 1 want 14291976192 have 14287765504
BTRFS error (device nvme0n1p3): bad tree block start, mirror 1 want 14291976192 have 14287781888
BTRFS error (device nvme0n1p3): failed to read block groups: -5
BTRFS error (device nvme0n1p3): open_ctree failed

I already did some research in the web and tried different possible solutions, but none of them worked and i still dont really know what to do. Possible Solution.

I also tried using the following command, unsuccessful: btrfs rescue zero-log /dev/nvme0n1p3

Does anyone knows what all that means and how to solve it? I have absolutely no clue what is happening, because im completely new to linux, my laptop was just working super fine until now.

Thank you so much!

Do you mean you had to hard power off the PC? It could have been done before btrfs was finished with the storage device. If you still have the installer media, try booting that and in troubleshooting menu pick rescue existing system.

Yeah, i had to do a hard shutdown… When i boot the installer media, im just able to start the live version of fedora and in the troubleshooting menu, i dont have the option to rescue the existing system.
But i tried to mount my SSD with the corrupted files in fedora live and got an error message saying can't read superblock on /dev/nvme0n1p3 (udisks-error-quark, 0)

This problem seems to be widespread, but none of the solutions is working for me at the moment. :frowning:

Can you use btrfs utils from the live installer media, like start a live session then in a terminal use btrfs commands to do an fsck on the drive even without mounting it, just the device. btrfsck <path to device>

1 Like

I started a live session and tried to use btrfsck /dev/nvme0n1p3. I got following output:

Opening filesystem to check...
bad tree block 14291976192, bytenr mismatch, want 14291976192 have 14287765504
ERROR: failed to read block groups: Input/output error
ERROR: cannot open file system

Possibly try btrfs check -p <path to device> to get a progress. Don’t try to repair the filesystem yet though, until we know what may be wrong. That will just give output during the check and run in readonly mode which is default for the command. It should give us something for further digging into the problem.

Hey, using the command btrfs check -p <path to device> i get the same output as above:

Opening filesystem to check...
bad tree block 14291976192, bytenr mismatch, want 14291976192 have 14287765504
ERROR: failed to read block groups: Input/output error
ERROR: cannot open file system

I think this is recoverable, but I have to look at it tomorrow, can’t right now. Maybe @chrismurphy has some idea about this error.

Bit flip.

14291976192
1101010011110111100100000000000000
14287765504
1101010011100111100000000000000000
14287781888
1101010011100111100100000000000000

You should do a memory check before attempting repairs to the file system or the repairs themselves could be faulty and make the file system problem worse.

Sources of bitflips, typically non-ECC RAM, which is main logic board memory and DRAM in the storage device (used as cache and running garbage collection). You need to isolate so you know which one to replace. If it’s under warranty, there’s no question about replacement. If it’s not under warranty, I’d make a strong case for warranty exception, i.e. it’s the right thing to do.

There’s a variety of ways to do memory tests. The memtest86 and variants might locate this, they’re relatively easy to use. But we sometimes see that certain memory defects elude detection, taking days of testing with this method. If the file system has been around for a while and this is just showing up now, why? CPU (up) clocking has also resulted in undetected bitflips for some reason. But maybe it’s just a rare and transient event.

The odd thing though in this case, same copy seems to have been read twice with a different result back to back (?) I’m not sure I’m interpreting this correctly. The mirror 1 tells us it’s only checking one copy, two times, but with different have values - so yeah it seems that the two reads produced different results. That makes me wonder if it could be the DRAM in the drive, or even possibly a problem between the drive controller and logic board controller. I’m not too sure of those details.

I’m pretty sure btrfs check --repair can fix these kinds of bitflips. But again, best if you avoid making the problem worse by repairing without fixing the source of the bitflips. You might also consider refreshing backups first. It should be possible to backup important data either by mounting normally with -o ro option.

Or if that fails, try -o ro,rescue=usebackuproot,nologreplay,ibadroots.

And if that fails try -o ro,rescue=all which then includes idatacsums - but I’m initially leaving that off because, well, it’s nice to know whether the data you’re backing up is corrupt. By default with checksumming enabled, btrfs does not permit corrupt data to get to user space, therefore preventing replication of corrupt data. Using idatacsums means it’s possible the backup contains corrupt data, so you’d want to take precautions (separate backup). It’s still better to have a suspect backup than no backup.

Alternative memory testers: memtester is a user space memory tester in Fedora repo. The drawback of this is it can’t test quite a large chunk of memory: whatever is used by the kernel and all of user space is excluded. So you want to make user space as small as possible, i.e. no graphical boot. I think it’ll run in single user mode, ensuring most memory is being tested. Still, it’s a bigger memory footprint than the memtest86 variants which run in the preboot environment.

Still another tactic is multiple concurrent GCC sessions (like compiling the kernel). Somehow GCC seems to be a pretty good stresser. But how it blows up when it hits a bitflip (?) I have no idea. It might vary depending on what’s affected.

There is or was an in-kernel memory tester, but I don’t know the status of that or how effective it is.

Haha, just saw this. For now, ignore everything I wrote previously.

OK so maybe it’s possible the hard shutdown itself resulted in some form of corruption - like possibly the drive itself just didn’t fully or correctly write the last write commands from the kernel to stable media.

In that case, it should be true that merely mounting ro,usebackuproots,nologreplay will revert to a previous file system state (at most 30 seconds prior to the shutdown) and it’ll all appear to be OK. Give that a try and look around at important files and such.

If that mount seems to be working, I’m curious what a btrfs scrub start -Br finds. This is a read-only scrub, and will only scrub the older state of the file system, seconds before the shutdown. If this comes up clean, I think you can consider reverting to this point if you’re not worried about any of the lost data in the last 1/2 minute or less prior to the shutdown.

Sorry for the run around. I can’t really tell from the error message if it might be some older part of the file system that’s just coincidentally showing up as a problem now - and thus the problem isn’t related to the hard shutdown. Or if it’s related to the most recent commits to the drive, and thus probably related to the hard shutdown.

Hey Chris, thanks for your effort!

I tried to mount the ssd using the commands you told me, but i get the error message can't read superblock on /dev/nvme0n1p3

Which exact command resulted in this error?

Let’s see if the super is present and OK:

sudo btrfs insp dump-s -fa /dev/nvme0n1p3
sudo btrfs super-r /dev/nvme0n1p3

The command i used was mount -o ro,rescue=usebackuproot,nologreplay /dev/nvme0n1p3 /mnt

Both of the commands:

sudo btrfs insp dump-s -fa /dev/nvme0n1p3
sudo btrfs super-r /dev/nvme0n1p3

were working fine. Do you need a picture of the output?

OK it can’t already be mounted when doing this. This must be the first mount command. Anytime you get btrfs errors from a user space command, the actual details of the error are in dmesg so I’ll need that too. That’s the only thing I can thing of why it’s complaining about super blocks when you issue that mount command and yet the other command suggest the supers are fine.

Also, it’s easier to do a back and forth convo on matrix. I’m @cmurf:fedora.im and can be found in the Fedora room, conversing there is fine.

I’d like to write up an email to post to linux-btrfs@ which is the upstream development list. But I need additional information. Please post the output verbatim from these commands. You can paste them into this discussion.

    btrfs --version
    sudo btrfs inspect dump-super -fa /dev/nvme0n1p3
    sudo btrfs inspect dump-tree -d /dev/nvme0n1p3
    sudo btrfs inspect dump-tree -b 14291976192 --hide-names /dev/nvme0n1p3
    sudo btrfs inspect dump-tree -b 14287765504 --hide-names /dev/nvme0n1p3
    sudo btrfs inspect dump-tree -b 14287781888 --hide-names /dev/nvme0n1p3

--hide-names is optional, I don’t think these blocks will have file names in them but it’s possible there will be an old/stale metadata block at these locations.

Also can you open a shell and journalctl -f -o short-monotonic then open a new shell again and try to mount the file system with mount -o ro,rescue=all and paste the btrfs messages that appear in the first shell set to follow recent journal messages? That might show some clues what rescue is going to successfully mount the file system.

Thanks

1 Like

This happened to me after a hard stop today. Here is how I fixed it.
Login to root
at the prompt type
systemctl rescue
if it works you will be asked to login again. Type your user account name hit enter, then enter the password and you are in… that is if it works for you and there is nothing else holding you back.

1 Like