Fedora 40 BTRFS filesystem corruption on 6.9-6.10 kernals

hmitsuha · September 2, 2024, 6:11pm

I had a similar issue also on the same laptop and CPU. Don’t know if it’s the same issue you’re having, but dmesg showed BTRFS critical (device dm-0) corrupt leaf:... I don’t think my issue was a hardware issue (and I’m hoping it isn’t) since it was a new SSD, the NVME self-test passed, Memtest passed, and HP’s onboard tools also passed.

Interestingly, both times I had this problem was with Fedora installed using the Everything iso, though that may be a coincidence.

hobbes1069 · September 3, 2024, 10:50pm

Looks like I found someone on Reddit with the same problem:
https://www.reddit.com/r/btrfs/comments/1f7mrjm/btrfs_read_only_file_system_problem/

Looking less like a random HW failure…

@chrismurphy can you weigh in? Seem strange someone with the same/very similar laptop is having the same issue.

chrismurphy · September 4, 2024, 4:01am

I’m assuming the RAM is good after 15+ Memtest86 passes.

There are enough anecdotes about memtest86(+) taking days of testing to find memory defects that it’s hard to dismiss. I don’t know anything about why that would be the case, but it seems to be true sometimes.

There’s memtester in the Fedora repo. The portion of memory being used prior to running memtester can’t be tested because it’s in use. It’s a user space memory tester, so you want to make the available user space as big as possible by booting single user mode or at least non-graphical, and run it as long as you can tolerate, and then reboot and try again if it hasn’t found anything. Again, anecdotally, it may find defects sooner than memtester - with the caveat that it can only test what isn’t in use before it’s executed.

There are other sources of memory corruption possible. Really rare for it to happen in the CPU but if you’re overclocking, it could happen.

[ 36.670146] BTRFS error (device nvme0n1p6): block=37421056 write time tree block corruption detected

The write time tree checker finding this problem suggests the issue is CPU, memory, or a software bug. i.e. not a drive problem. But we need a detailed bug report with logs, and we can escalate to a Btrfs developer to see what’s going on.

It must not be a commonly hit set of code, if it’s a bug, or we’d be seeing a lot more of these. Include the workload(s) in the bug report.

hobbes1069 · September 4, 2024, 11:40am

Thanks for responding so quickly!

In general I would be inclined to agree with you but since starting this thread we went from a N=1 problem to a N=3 problem if you include @hmitsuha and the Reddit post, all with the same or very similar laptop. I think it highly unlikely that all 3 of us had zero issues until F40, or as I’ve narrowed down, F40 after kernel upgrade.

Additionally I dual boot windows and have not had any issues with it, though I don’t know how quickly I would find out with NTFS…

As we’re running laptops I think we can rule out overclocking. Also in the reddit post, he upgraded his SSD and is no longer running the HP provided Intel SSD but I am.

I don’t have anything I care about on my laptop as I’ve already had to reinstalled twice, so how do I go about getting more debug info?

gnwiii · September 4, 2024, 6:46pm

I hope a btrfs developer will provide some pointers. For now, it could be very helpful to compile cast a wide net for details of recent encounters with similar errors. Workloads, system model, kernel version, linux distro, drive model, and BTRFS error records in journalctl or dmesg that would help oin down conditions where the error may occur.

For workloads, Phoronix 6.11 filesystem comparisons mentions a number of filesystem benchmarks. You could use the network installer to create a small Fedora installation and include the Phoronix test suite.

chrismurphy · September 4, 2024, 10:01pm

I think the back and forth nature makes it better to discuss on https://matrix.to/#/#kernel:fedoraproject.org and if you mention cmurf, I’ll get a ping. We need to get as much information into a bug report as possible.

Version of known good and bad kernels; version of btrfs-progs.

These output to a file and then attached to the bug report.

lspci -vvnn
dmesg
btrfs check --readonly

In effect I’d like to see the full dmesg from the boot that resulted in the partial dmesg posted above 3 days ago. Developers will want to see it top to bottom. We need a description for each dmesg. e.g. is this the first instance of the problem? Was there a crash or power failure during the previous boot before the btrfs messages indicating a problem?

We’re limited what can probably be found in journalctl using the -b flag to go back to earlier boots, because when the file system goes read only, nothing gets written to the journal. The only record we have is on volatile media.

Scant chance of being useful would be something like journalctl -k --since=-1w|-i btrfs and see if anything suspicious comes up, and if so, then look at those boots without filtering to see if there’s any evidence of something happening prior to the problem. Btrfs often reports the consequences of a problem, not the problem itself, so we have to look around for the actual problem.

Make/model/CPU info would also be good to know, maybe there’s some other bug, not a btrfs bug, but some component that btrfs depends on.

barryascott · September 4, 2024, 10:41pm

Maybe rebuild the system with one change.

Create a small ext4 file system and mount it on /var/log, leave everything else as before.

That way the journal should survive the btrfs bug.

hobbes1069 · September 4, 2024, 10:51pm

Possibly, but that assumes this is a BTRFS specific problem, but EXT4 might get corrupted similarly just without it being detected

chrismurphy · September 4, 2024, 10:55pm

True but for /var/log it’s not that big of a deal if there’s some file corruption. In this case it’s looking like file system metadata corruption. As I think about it, the fact it’s a much smaller target for corruption than data, if there’s been two cases on the same system (reproducible) and both hit metadata, suspicious it’s a kernel bug somewhere. Something btrfs leverages directly or indirectly, that doesn’t affect the data encoding path.

hobbes1069 · September 5, 2024, 12:06am

I’m currently running kernel 6.8.5 w/ /var/log mapped to an external thumb drive while I collect the data for matrix.

Nothing has happened so I’m about to boot to 6.10.6 which has been known to fail.

hobbes1069 · September 21, 2024, 2:24pm

Sooooo… No help from Fedora kernel matrix channel.

I got tired of waiting and installed 6.11 from the F41 repos and so far so good, but I need to reboot a few times to see if I can trigger the issue.

hobbes1069 · September 21, 2024, 2:36pm

@chrismurphy

Well, definitely not fixed in 6.11 yet… 2nd boot:

btrfs check came back clean but btrfs scrub did not.

computersavvy · September 21, 2024, 3:19pm

I wonder if those errors are showing up with scrub since you appear to be doing that while the system is booted & active (which could mean there are changes occurring during the check/scrub). IME file system checks should always be performed while the file system is idle and not mounted (in other words, while booted from a live usb device).

hobbes1069 · September 21, 2024, 4:29pm

Not exactly…

btrfs check defaults to read-only so is safe to run on a mounted filesystem but to do anything else, yes, I would need to boot from a USB stick or similar.

btrfs scrub is intended to run on a mounted filesystem.

What’s interesting is that they disagree on the health of the file system.

anotheruser · September 21, 2024, 4:50pm

btrfs check makes sure the logical structure of a filesystem (meta data ) makes sense. I would be very surprised if it would check the validity of file data! This could take very long time on large filesystems.

That’s what btrfs scrub does.

this means the filesystem is healthy, but some files are damaged.
btrfs scrub logs the filenames and inodes in the journal.
If the files are part of a rpm package then simply reinstall the package.

chrismurphy · September 21, 2024, 8:00pm

?

https://matrix.to/#/#kernel:fedoraproject.org

I don’t see any messages in there.

Need a complete top to bottom dmesg, a screen photo is missing too much information.

Also I suggest running the btrfs check twice from a recent Fedora Live USB stick so that the file system isn’t mounted at the time you run it.

btrfs check --readonly
btrfs check --readonly --mode=lowmem

The lowmem and original modes have different checks and code paths so it’s worth seeing if one finds something the other doesn’t.

chrismurphy · September 21, 2024, 8:01pm

btrfs check defaults to read-only so is safe to run on a mounted filesystem but to do anything else, yes, I would need to boot from a USB stick or similar.

Safe but not reliable if it’s mounted rw at the time you run the check. In theory the check is valid if the fs is ro mounted.

What’s interesting is that they disagree on the health of the file system.

scrub computes csums for data blocks and compares to the csums stored for those blocks in the checksum tree to see if they’ve changed since they were originally written. scrub isn’t a file system consistency check.

btrfs check has --check-data-csum which is a user-space implementation (and thus different code path) for checking data csums, from the in-kernel code path for scrub. They should agree or there’s a bug somewhere.

hobbes1069 · September 22, 2024, 1:59am

Well the mystery continues…

Even after booting to a F40 Live USB stick, btrfs check --check-data-csum still shows no errors even though btrfs scrub did.

hobbes1069 · September 22, 2024, 2:08am

Here’s the link to my post on 9/4:

I haven’t seen anything unusual in dmesg but posted the full log to matrix if you can find it.

Since --readonly is the default I’m not sure it adds any value.

Interestingly --lowmen did find the errors, LOTS of output like:
mirror 1 bytenr [bignum] csum 0x[number] expected csum 0x[number].

chrismurphy · September 22, 2024, 10:28pm

Since --readonly is the default I’m not sure it adds any value.

It’s emphasis NOT to run --repair

I think this needs to be reported to upstream. I found the Matrix messages, but I don’t know why I didn’t get notified.

Topic		Replies	Views
Btrfs woes Ask Fedora f38 , btrfs , intel	16	2542	September 23, 2023
Fedora boots in read-only mode on kernel 6.12 Ask Fedora btrfs , f40	23	249	February 11, 2025
LUKS filesystem corruption Ask Fedora luks2 , btrfs , filesystem , f41	14	179	May 13, 2025
Cannot boot kernels newer than 6.7.11-200.fc39 \| Fedora 40 \| Ask Fedora kernel , workstation , f40	35	794	July 9, 2024
Fedora 40 returns Btrfs error with Kernels 6.8.9 and 6.8.10. Boots fine with Kernel 6.8.5 Ask Fedora pipewire , kde , kde-plasma , wayland , flatpak , bluetooth , audio , intel , f39 , server	6	766	July 2, 2024

Fedora 40 BTRFS filesystem corruption on 6.9-6.10 kernals

Related topics