I had a similar issue also on the same laptop and CPU. Don’t know if it’s the same issue you’re having, but dmesg showed BTRFS critical (device dm-0) corrupt leaf:... I don’t think my issue was a hardware issue (and I’m hoping it isn’t) since it was a new SSD, the NVME self-test passed, Memtest passed, and HP’s onboard tools also passed.
Interestingly, both times I had this problem was with Fedora installed using the Everything iso, though that may be a coincidence.
I’m assuming the RAM is good after 15+ Memtest86 passes.
There are enough anecdotes about memtest86(+) taking days of testing to find memory defects that it’s hard to dismiss. I don’t know anything about why that would be the case, but it seems to be true sometimes.
There’s memtester in the Fedora repo. The portion of memory being used prior to running memtester can’t be tested because it’s in use. It’s a user space memory tester, so you want to make the available user space as big as possible by booting single user mode or at least non-graphical, and run it as long as you can tolerate, and then reboot and try again if it hasn’t found anything. Again, anecdotally, it may find defects sooner than memtester - with the caveat that it can only test what isn’t in use before it’s executed.
There are other sources of memory corruption possible. Really rare for it to happen in the CPU but if you’re overclocking, it could happen.
[ 36.670146] BTRFS error (device nvme0n1p6): block=37421056 write time tree block corruption detected
The write time tree checker finding this problem suggests the issue is CPU, memory, or a software bug. i.e. not a drive problem. But we need a detailed bug report with logs, and we can escalate to a Btrfs developer to see what’s going on.
It must not be a commonly hit set of code, if it’s a bug, or we’d be seeing a lot more of these. Include the workload(s) in the bug report.
In general I would be inclined to agree with you but since starting this thread we went from a N=1 problem to a N=3 problem if you include @hmitsuha and the Reddit post, all with the same or very similar laptop. I think it highly unlikely that all 3 of us had zero issues until F40, or as I’ve narrowed down, F40 after kernel upgrade.
Additionally I dual boot windows and have not had any issues with it, though I don’t know how quickly I would find out with NTFS…
As we’re running laptops I think we can rule out overclocking. Also in the reddit post, he upgraded his SSD and is no longer running the HP provided Intel SSD but I am.
I don’t have anything I care about on my laptop as I’ve already had to reinstalled twice, so how do I go about getting more debug info?
I hope a btrfs developer will provide some pointers. For now, it could be very helpful to compile cast a wide net for details of recent encounters with similar errors. Workloads, system model, kernel version, linux distro, drive model, and BTRFS error records in journalctl or dmesg that would help oin down conditions where the error may occur.
For workloads, Phoronix 6.11 filesystem comparisons mentions a number of filesystem benchmarks. You could use the network installer to create a small Fedora installation and include the Phoronix test suite.
I think the back and forth nature makes it better to discuss on https://matrix.to/#/#kernel:fedoraproject.org and if you mention cmurf, I’ll get a ping. We need to get as much information into a bug report as possible.
Version of known good and bad kernels; version of btrfs-progs.
These output to a file and then attached to the bug report.
lspci -vvnn
dmesg
btrfs check --readonly
In effect I’d like to see the full dmesg from the boot that resulted in the partial dmesg posted above 3 days ago. Developers will want to see it top to bottom. We need a description for each dmesg. e.g. is this the first instance of the problem? Was there a crash or power failure during the previous boot before the btrfs messages indicating a problem?
We’re limited what can probably be found in journalctl using the -b flag to go back to earlier boots, because when the file system goes read only, nothing gets written to the journal. The only record we have is on volatile media.
Scant chance of being useful would be something like journalctl -k --since=-1w|-i btrfs and see if anything suspicious comes up, and if so, then look at those boots without filtering to see if there’s any evidence of something happening prior to the problem. Btrfs often reports the consequences of a problem, not the problem itself, so we have to look around for the actual problem.
Make/model/CPU info would also be good to know, maybe there’s some other bug, not a btrfs bug, but some component that btrfs depends on.
True but for /var/log it’s not that big of a deal if there’s some file corruption. In this case it’s looking like file system metadata corruption. As I think about it, the fact it’s a much smaller target for corruption than data, if there’s been two cases on the same system (reproducible) and both hit metadata, suspicious it’s a kernel bug somewhere. Something btrfs leverages directly or indirectly, that doesn’t affect the data encoding path.
I wonder if those errors are showing up with scrub since you appear to be doing that while the system is booted & active (which could mean there are changes occurring during the check/scrub). IME file system checks should always be performed while the file system is idle and not mounted (in other words, while booted from a live usb device).
btrfs check defaults to read-only so is safe to run on a mounted filesystem but to do anything else, yes, I would need to boot from a USB stick or similar.
btrfs scrub is intended to run on a mounted filesystem.
What’s interesting is that they disagree on the health of the file system.
btrfs check makes sure the logical structure of a filesystem (meta data ) makes sense. I would be very surprised if it would check the validity of file data! This could take very long time on large filesystems.
That’s what btrfs scrub does.
this means the filesystem is healthy, but some files are damaged.
btrfs scrub logs the filenames and inodes in the journal.
If the files are part of a rpm package then simply reinstall the package.
btrfs check defaults to read-only so is safe to run on a mounted filesystem but to do anything else, yes, I would need to boot from a USB stick or similar.
Safe but not reliable if it’s mounted rw at the time you run the check. In theory the check is valid if the fs is ro mounted.
What’s interesting is that they disagree on the health of the file system.
scrub computes csums for data blocks and compares to the csums stored for those blocks in the checksum tree to see if they’ve changed since they were originally written. scrub isn’t a file system consistency check.
btrfs check has --check-data-csum which is a user-space implementation (and thus different code path) for checking data csums, from the in-kernel code path for scrub. They should agree or there’s a bug somewhere.