Kernel 6.10.9 Causes System to Boot to Read-Only Mode for BTRFS

Also :joy: it would be helpful to get a more complete dmesg, if you can reboot with 6.10.9 or later, you’ll hit the problem, be read-only but maybe you can still run:

journalctl -k -o short-monotonic --no-hostname | fpaste

And then post the URL here or ping cmurf with it on https://matrix.to/#/#fedora:fedoraproject.org the URL will expire in 24 hours

That tells me it wasn’t repaired. There’s a problem with the file system still. The patch in 6.10.9 adds a check that’s intended to catch an existing problem on disk, and force read only to prevent further corruption. So it’s not advised to continue to use an older kernel without the check.

So I need the same thing:

  1. boot 6.10.9 or newer, hit the problem
  2. run journalctl -k -o short-monotonic --no-hostname | fpaste
  3. post the URL
  4. boot from a Live USB (or drop to dracut) and run btrfs check --mode=lowmem $DEV on the drive partition in question, post results.

FWIW, using --repair is risky, it says so in the man page and even when you run the command, it’s OK to run it if you’re prepared for the problem to get worse.

Well, easier said than done! On 6.10.9, I can’t open Firefox, can’t open Telegram to send it to myself, nothing. That’s why I resorted to taking a photo and doing the text extraction thing.

Unfortunately not using the machine today just isn’t really an option, because work. But, considering I’ve been on 6.10.8 for a while before even taking the update to 6.10.9, if the FS is trashed the damage is probably already done.

Still, I find the implementation of this kernel patch to be mildly infuriating. It’s pretty much “We found an error that might mess some things up, so to prevent it from getting worse, we rendered your entire system completely unusable.” Not good design for the end-user, IMO, but that’s another topic.

I did quickly reboot off my live USB and ran btrfsck again. I basically got the same results as when I ran it from 6.10.8:

liveuser@localhost-live:~$ sudo btrfsck --readonly /dev/nvme0n1p6
Opening filesystem to check...
Checking filesystem on /dev/nvme0n1p6
UUID: 1ebf9e47-73a9-4138-b305-ee9f8f395cf3
[1/7] checking root items
[2/7] checking extents
data extent[7010725888, 4096] referencer count mismatch (root 256 owner 1664566 offset 1110016) wanted 0 have 1
data extent[7010725888, 4096] bytenr mimsmatch, extent item bytenr 7010725888 file item bytenr 0
data extent[7010725888, 4096] referencer count mismatch (root 45883135623424 owner 0 offset 1110016) wanted 1 have 0
backpointer mismatch on [7010725888 4096]
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space tree
[4/7] checking fs roots
        unresolved ref dir 12447 index 130 namelen 40 name 7F8B43944C8834393A41C49938D21738EB36BFD0 filetype 1 errors 40, index mismatch
ERROR: errors found in fs roots
found 518496641024 bytes used, error(s) found
total csum bytes: 501844652
total tree bytes: 3709845504
total fs tree bytes: 2757312512
total extent tree bytes: 350257152
btree space waste bytes: 687576331
file data blocks allocated: 6814757773312
 referenced 599333552128

So at this point, it seems my next step would be to back stuff up, attempt a --repair from a live USB, and hope for the best. I have Timeshift and performed a manual backup yesterday, but fat lot of good that’ll do me if I have to restore back to a partially corrupt system anyway.

6.10.10-200.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Sep 12 18:26:09 UTC 2024 x86_64 GNU/Linux

It looks like I don’t hit the problem while booted this kernel but I don’t want to boot 6.10.9 just to check whether the problem appears again, I assume it would.

Anyway, the rest:

[quote=“Chris Murphy, post:22, topic:131472, username:chrismurphy”]

https://paste.centos.org/view/5c7cb28c

The LiveUSB I’m using is the latest Workstation image 40-1.14

btrfs check --mode=lowmem /dev/mapper/fedora-system

Opening filesystem to check...
Checking filesystem on /dev/mapper/fedora-system
UUID: 3c6589e6-fb08-4d17-9ded-eecf18407167
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space tree
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs done with fs roots in lowmem mode, skipping
[7/7] checking quota groups skipped (not enabled on this FS)
found 543662604288 bytes used, no error found
total csum bytes: 377528088
total tree bytes: 1333379072
total fs tree bytes: 780369920
total extent tree bytes: 143884288
btree space waste bytes: 250607156
file data blocks allocated: 4775788670976
 referenced 570942398464

I just got the update for 6.10.10, but it’s exactly the same behavior for me. I haven’t tried to --repair yet, though. I’ll back up my important files tonight, run a repair from a live USB, and then see what happens from there.

Interestingly, I noticed from your journalctl output that you also have nvidia and vbox installed, which I do too. When I just tried to boot 6.10.10, the only verbose error displayed was that the vbox kmod couldn’t load, before I got to sddm and everything fell apart again.

I don’t think it’s actually related to this issue, but it is a bit of a coincidence to take note of.

Could you boot with 6.10.9 and report back? There is no risk. The read time tree checker is doing the right thing. The risk is using an older kernel without the check.

Also could you retry btrfs check --readonly? lowmem and original do different checks so maybe the lowmem one isn’t catching anything; and original still sees a problem.

Don’t use --repair unless you’re prepared for things to get a lot worse.

I just want to see the things I’ve asked for already.

  1. dmesg from a boot with 6.10.9 or higher, journalctl -k -o short-monotonic --no-hostname | fpaste
  2. btrfs check --readonly and btrfs check --readonly --mode=lowmem while unmounted.

We need to understand the problem before trying irreversible things. Sorry. But also that means running an older kernel read write means irreverislbe things are happening. I know you need to use this system. Keep backups of important things.

Maybe I misunderstood something: the checker is supposed to be present in 6.10.10 too, no? Then it should do the same as 6.10.9 but for some reason unlike 6.10.9, the 6.10.10 kernel doesn’t detect any problem and change the fs to readonly. Also no errors with dmesg, nothing while using 6.10.10. The reason I’m hesitant to try 6.9.10 again is because there is a probability that it is itself the cause of a corruption that can only be fixed with btrfs check --repair and I would not play with that a lot.

I booted into live image and ran btrfs check --readonly - no problems reported

Yep, I understand doing --repair may make things worse. I’ll only do that after backing up, and I fully understand it may immediately lead to requiring a total wipe/reinstall. Not my ideal scenario, of course.

Regarding your requests, here is the journalctl output on 6.10.9:
https://paste.centos.org/view/4464cda6

Then, from a live USB, I ran btrfsck --readonly (and I took advantage of the fpaste convenience:
https://paste.centos.org/view/e74dcc71

Finally, I attmpted lowmem mode, and this is as far as it got before it appeared to hang. I waited about 30 minutes, and both CPU and disk activity had basically ceased.

liveuser@localhost-live:~$ sudo btrfsck --readonly --mode=lowmem /dev/nvme0n1p6 | fpaste
[1/7] checking root items
[2/7] checking extents
ERROR: extent[7010725888, 4096] referencer count mismatch (root: 45883135623424, owner: 0, offset: 1110016) wanted: 1, have: 0

This shows no errors.

Huh, unfortunately it looks like it’s trimming lines for some reason, the problems found aren’t actually in the paste.

Finally, I attmpted lowmem mode, and this is as far as it got before it appeared to hang.

lowmem is a lot slower

I’m a little surprised --repair did seem to fix @pusztaig problem, since you have a backup, I think it’s worth the risk because the alternative is to keep using the file system with an older kernel and not knowing exactly when it’s going to get worse (but it will, one day).

But also, if you’re keeping important things backed up often, you can wait to do --repair when convenient for you (in case you lose the file system, and have to reinstall).

Sorry, I somehow missed your reply yesterday… So yeah, I also think that the problem might be something specific to my setup since it is not very widely reported. For example, faulty hardware would be also an obvious suspect. I tried to check for that too and I couldn’t find anything…

The only relatively solid point I have so far is that after a supposedly successful repair, the problem happened (or detected) again when I booted 6.10.9 and it didn’t happen on 6.10.10.

Edit: I remember that during messing around with the problem yesterdey, i did a dnf list once and I noticed that kmod-Virtualbox was missing for one of the installed kernel versions, meanwhile kmod-nvidia was installed properly for all kernels present on the system. Unfortunately, I don’t remember, which version it was.

Yeah, I’m inclined to agree with you, I think there’s some impossibly specific thing that’s causing the issue in use-cases like ours. I actually don’t believe I have a hardware issue, and I’m even hard-pressed to believe there is an actual file system degradation/corruption thing happening, though I’m operating under the assumption that there is because that’s the only evidence or clues I have at this point. Still, this is a fairly high end laptop that’s less than 2 years old, and at least as a user, I’ve not seen any signs of filesystem corruption whatsoever, and I find it a little troubling that we can go from a 100% perfect system to completely unusable with a simple update, with nothing in place to alert the user of what’s going on or offer a pathway to repair or anything.

So, hey, maybe it’s related to the vbox kernel module? I could try uninstalling it but I use it daily for work :confused:

Just to be super clear - Btrfs read time tree checker is designed to catch inconsistency in the file system already on stable media (on disk). So if read time tree checker sees a problem, it’s really on disk. And its job is to force the file system read-only to prevent more confusion. The error detects a problem, it does not cause a problem.

Btrfs write time tree checker is designed to catch inconsistency in the in-memory copy of the file system before it’s written to disk. Same thing, it forces the file system read-only to prevent further confusion. In this case it’s possible it was successful and no problem ever made it to disk. But also, the tree checker prevents problems, it is not the cause.

We need entire top to bottom dmesg in these cases to find out what might be going on. For any read time tree checker we definitely also need a `btrfs check - it might be we don’t need one for write time tree checker errors, but it doesn’t hurt.

The file system must be unmounted for btrfs check it’s not reliable to check a read-write mounted file system.

Finally back in town and had a chance to try btrfsck --repair while on a live USB. The repair seemed to go normal, and I repeated a readonly check right after with no errors displayed. Then I rebooted back into 6.10.8, another readonly check, also fine. Then I booted into 6.10.10. It took long to boot, presumably from building kmods (nvidia etc) that never got a chance to while in readonly, but booted and all appears to be normal. I did run btrfsck once more and it dumped a ton of errors from the log only, but nothing else, so not sure if I should be concerned about that.

Anyway, btrfsck --repair seems to have done the trick with no data loss, so I’m grateful for that.

Thanks again to everyone who chimed in here.

1 Like

[1/8] checking log
ERROR: csum missing in log

I would ignore these for now because btrfs check support for logs is new. But also, did you run this command on a read-write mounted file system by any chance?

The theory is that the log shouldn’t exist on Btrfs following a file system commit, and a file system commit should happen either upon unmount or remount read-only.

Yep I did, really the exact process I mentioned above. After repairing, I ran it from the live USB with the fs unmounted, no errors. Then I ran it on 6.10.8 with the fs mounted, no errors. Then I ran it on 6.10.10 with the fs mounted, got those csum errors. Maybe they’re related to the prior boot(s) I attempted in 6.10.10 before repairing? Though I’m not sure how we’d have logged errors when it was in read-only…

Read-only is enforced by the kernel’s VFS. Btrfs itself has no distinction between RW and RO, so could the file system be somehow in flux while mounted RO? It’s possible.

That’s why I’m not a fan of file system checks on a mounted file system or the --force option.

Yeah, that’s why I was sure to try it first while on a live USB and the fs unmounted, but wanted to at least do those readonly checks after. The “somehow in flux while mounted RO” is the only explanation I can think of otherwise /shrug

@stadsport @pusztaig

I’m tracking a possibly related bug, but we need more and complete (not filtered or trimmed) dmesg to see if either of your cases fit either of the ones we’re tracking. Btrfs is detecting some kind of in-memory corruption, but we don’t know if the bug is in btrfs or somewhere else in the kernel.

Do either of you have full dmesg anywhere for any boot in which the “write time tree” or “read time tree” Btrfs message happens?

Of course, the root file system goes read-only so journald can’t capture it normally when this problem is triggered. And user space will not be very cooperative either, so this case could be difficult to extract the information we need from the system.

You could mount a file system to /var/log/journal to redirect log files there. Or you can use remote ssh session and journalctl -fk to continuously follow journal messages.

Unfortunately I couldn’t find anything in the journal. IIRC the dmesg was something like corrupted leaf… very similar to what @stadsport shared in the OP. I still have the 6.10.9 kernel but to reproduce the problem to collect more info would be too risky. I use this computer for daily work and it would take a lot of time/work to reinstall everything in a case of failed repair.