Fedora 40 BTRFS filesystem corruption on 6.9-6.10 kernals

I had a similar issue also on the same laptop and CPU. Don’t know if it’s the same issue you’re having, but dmesg showed BTRFS critical (device dm-0) corrupt leaf:... I don’t think my issue was a hardware issue (and I’m hoping it isn’t) since it was a new SSD, the NVME self-test passed, Memtest passed, and HP’s onboard tools also passed.

Interestingly, both times I had this problem was with Fedora installed using the Everything iso, though that may be a coincidence.

Looks like I found someone on Reddit with the same problem:
https://www.reddit.com/r/btrfs/comments/1f7mrjm/btrfs_read_only_file_system_problem/

Looking less like a random HW failure…

@chrismurphy can you weigh in? Seem strange someone with the same/very similar laptop is having the same issue.

I’m assuming the RAM is good after 15+ Memtest86 passes.

There are enough anecdotes about memtest86(+) taking days of testing to find memory defects that it’s hard to dismiss. I don’t know anything about why that would be the case, but it seems to be true sometimes.

There’s memtester in the Fedora repo. The portion of memory being used prior to running memtester can’t be tested because it’s in use. It’s a user space memory tester, so you want to make the available user space as big as possible by booting single user mode or at least non-graphical, and run it as long as you can tolerate, and then reboot and try again if it hasn’t found anything. Again, anecdotally, it may find defects sooner than memtester - with the caveat that it can only test what isn’t in use before it’s executed.

There are other sources of memory corruption possible. Really rare for it to happen in the CPU but if you’re overclocking, it could happen.

[ 36.670146] BTRFS error (device nvme0n1p6): block=37421056 write time tree block corruption detected

The write time tree checker finding this problem suggests the issue is CPU, memory, or a software bug. i.e. not a drive problem. But we need a detailed bug report with logs, and we can escalate to a Btrfs developer to see what’s going on.

It must not be a commonly hit set of code, if it’s a bug, or we’d be seeing a lot more of these. Include the workload(s) in the bug report.

3 Likes

Thanks for responding so quickly!

In general I would be inclined to agree with you but since starting this thread we went from a N=1 problem to a N=3 problem if you include @hmitsuha and the Reddit post, all with the same or very similar laptop. I think it highly unlikely that all 3 of us had zero issues until F40, or as I’ve narrowed down, F40 after kernel upgrade.

Additionally I dual boot windows and have not had any issues with it, though I don’t know how quickly I would find out with NTFS…

As we’re running laptops I think we can rule out overclocking. Also in the reddit post, he upgraded his SSD and is no longer running the HP provided Intel SSD but I am.

I don’t have anything I care about on my laptop as I’ve already had to reinstalled twice, so how do I go about getting more debug info?

I hope a btrfs developer will provide some pointers. For now, it could be very helpful to compile cast a wide net for details of recent encounters with similar errors. Workloads, system model, kernel version, linux distro, drive model, and BTRFS error records in journalctl or dmesg that would help oin down conditions where the error may occur.

For workloads, Phoronix 6.11 filesystem comparisons mentions a number of filesystem benchmarks. You could use the network installer to create a small Fedora installation and include the Phoronix test suite.

I think the back and forth nature makes it better to discuss on https://matrix.to/#/#kernel:fedoraproject.org and if you mention cmurf, I’ll get a ping. We need to get as much information into a bug report as possible.

Version of known good and bad kernels; version of btrfs-progs.

These output to a file and then attached to the bug report.

lspci -vvnn
dmesg
btrfs check --readonly

In effect I’d like to see the full dmesg from the boot that resulted in the partial dmesg posted above 3 days ago. Developers will want to see it top to bottom. We need a description for each dmesg. e.g. is this the first instance of the problem? Was there a crash or power failure during the previous boot before the btrfs messages indicating a problem?

We’re limited what can probably be found in journalctl using the -b flag to go back to earlier boots, because when the file system goes read only, nothing gets written to the journal. The only record we have is on volatile media.

Scant chance of being useful would be something like journalctl -k --since=-1w|-i btrfs and see if anything suspicious comes up, and if so, then look at those boots without filtering to see if there’s any evidence of something happening prior to the problem. Btrfs often reports the consequences of a problem, not the problem itself, so we have to look around for the actual problem.

Make/model/CPU info would also be good to know, maybe there’s some other bug, not a btrfs bug, but some component that btrfs depends on.

Maybe rebuild the system with one change.

Create a small ext4 file system and mount it on /var/log, leave everything else as before.

That way the journal should survive the btrfs bug.

1 Like

Possibly, but that assumes this is a BTRFS specific problem, but EXT4 might get corrupted similarly just without it being detected :slight_smile:

True but for /var/log it’s not that big of a deal if there’s some file corruption. In this case it’s looking like file system metadata corruption. As I think about it, the fact it’s a much smaller target for corruption than data, if there’s been two cases on the same system (reproducible) and both hit metadata, suspicious it’s a kernel bug somewhere. Something btrfs leverages directly or indirectly, that doesn’t affect the data encoding path.

I’m currently running kernel 6.8.5 w/ /var/log mapped to an external thumb drive while I collect the data for matrix.

Nothing has happened so I’m about to boot to 6.10.6 which has been known to fail.