BTRFS has been in development for more than a decade and has never been seen as a stable filesystem. The majority of users didn’t even need all the advanced features of btrfs. An ext4 provisioned on LVM was good enough. I really don’t understand why an average user or even a professional would want to use the bloated feature-creep filesystem that no one considered stream release ready.
Btrfs was unstable even back when they decided to push it by making it the default fs. Both the kernel and userland tools had too many bugs in them that I just had to drop the idea of trying out btrfs altogether at the time. I was hoping for the better, but the reality is that not much has changed since then.
I can see that it started off with the good intention. Linux needed a competitor to ZFS. Cut out the middle man(LVM), take the load off the blockdev. Subvolumes, snapshots, quota … they are all wonderful ideas, but sadly for devs who work on the fs, the bug happened. It was too good to be true. They lost credibility. They gave Fedora a bad name.
Btrfs should have never been the default filesystem of choice, regardless of the edition(workstation, server, cloud or whatnot). They should roll it back and leave the choice up to the heavy users to decide. Do the right thing. We want sane defaults.
The change should mostly be in Anaconda and image builder scripts.
Fedora serves as a testbed for new stuff, so has provided testing for many use-cases. I think Fedora users have contributed to improving btrfs.
There is, however, a disconnect letting users know about differences in the way filesystems should be managed. With XFS or ext4, the eventual but inevitable failure (bitrot or hardware) requires restoring systems from backups. Before retiring I worked with satellite remote sensing data where defending against bitrot on XFX and ext4 filesystems required maintaining and verifying file checksums that added significant overhead in data processing pipelines and disrupting processing schedules when a corrupt file was encountered.
Using btrfs should include regular maintenance in return for reduced overhead in data processing pipelines. I’ve taught mini-courses for users new to Unix and Linux – learning to find and use trustworthy reference materials (man pages, etc.) and applying those to basic command-line exercises. Other instructors have found that students who attended those mini-courses had far fewer problems working with linux.
One colleague, a statistician, suggested that new users of statistical software should have to pass a test before being allowed to use the software. Maybe Fedora should offer new users some online training or study guide.
I’d say that’s the fundamental problem: Fedora users aren’t going to do any of that. The most we can possibly expect is that users will create backups. But backups alone are worthless against bitrot. With ext4 or xfs, users will just back up corrupt data over their old good data, and not notice the problem until their old good backups are gone. With btrfs, writing corrupt backups is not possible because you cannot read the corrupt data in the first place.
I think if somebody wants to propose an alternative to btrfs, it needs to be way smarter than ext4 + LVM, which is only safe if your hardware is perfect. (And we know Fedora users’ hardware is not perfect.)
My two cents: first, I am a Fedora user and I use whatever filesystem is default in the installer. All these years and I did not find a real reason to opt for one or the other. Then I would say that for the average user corruption of data is way better than not being able to boot the system.
My guess for non-professional user cases an hardware failure is a bigger concern. Besides, also local backups can fail like regular drives. Ironically you could have an healthy main drive and a faulty backup.
Indeed any weakness in the backup chain will lead to sadness.
I backup over the network to a fle server with RAID’ed drives.
The fileserver is backed up to external driver periodically that I store away from my home.
There was a usenix paper from a few years ago suggesting that while btrfs’s checksums do successfully detect hardware faults, btrfs was less capable of recovering from them compared to ext4 or f2fs.
We observe in Table 5 that Btrfs is the only file system that
consistently detects all I/O errors as well as corruption events,
including those affecting data (rather than only metadata). It
achieves this through the extensive use of checksums.
However, we find that Btrfs is much less successful in
recovering from any issues than the other two file systems. It is
the only file system where four of the six error modes can lead
to a kernel crash or panic and subsequently a file system that
cannot be mounted even after running btrfsck. It also has the
largest number of scenarios that result in an unmountable file
system after btrfsck (even if not preceded by a kernel crash).
Furthermore, we find that node level checksums, although
good for detecting block corruption, they remove an entire
node even if a single byte becomes corrupted. As a result,
large chunks of data are removed, causing data loss.
This means that even if btrfs is actually working as advertised, it can give users the impression of being less reliable than ext4. When users can’t boot their system, they’re going to conclude that that their OS is broken, not that their filesystem just prevented them from reading corrupt data.
Working as advertised? I wasn’t advertised anything. It happens that BTRFS is default in Fedora. The day I cannot boot the system I would not think “good, it works as advertised”, I would have to walk to the public library to connect to the Internet and look for some workaround (but I learned most of the times I cannot get out of these issues without losses) or I would ride the bike to my cousin’s to make an USB key with Debian. Praying the backups are ok.
I don’t know, like I said, in my opinion data corruption is way less concerning and critical than an unbootable system.
For many use cases (compressed data), 1-byte corruption leads to loss of
a much larger block of data and potentially unusable data. With legacy filesystems
the issue can go undetected until someone tries to work with the data.
An unbootable system is pretty bad, certainly, but at least it’s generally easy for a technical user to recover from. And if you’re not a technical user, you can hopefully get help from somebody who is. In contrast, if your backups are filled with corrupted data, recovery is probably impossible. I prefer to continue prioritizing data integrity over all other considerations.
Anyway, good news: this btrfs bug should hopefully be fixed by kernel-6.15.9-201.fc42 (via a downstream patch, because the fix is not present in the upstream kernel yet).
I would much rather have a filesystem that prevents the damage from getting worse when it’s very minor, and thus is very recoverable, rather than one which will let it grow and grow until you get not just an unbootable OS, but one that is far harder to actually recover.
People don’t really care about the features. Stuff shouldn’t break. That must be the priority. That’s what Linus was talking about the whole time. Btrfs failed that so miserably.
I like how everyone here jumps to blame users’ incompetence. I see the sentiment, given by Microsoft’s recent self-sabotage of their own products and the influx of new people to the community. I get that. I find it annoying myself. Hopefully it will pass.
The original post was not about that, though. It’s actually the incompetence of the Btrfs kernel devs, not the users this time. The fact that there was no test for the case is just mindblowing and people defending it are even worse. Btrfs smelled right from the beginning when it was first introduced to Fedora. Not much has changed. 16 years is a long time and they still can’t even get the basics right.
This is exactly why Rust kernel devs were so stressed out. The filesystem support is so critical in an OS. There’s no messing around especially when there’s data loss at stake.