BTRFS restore with ext4 formatted /boot

Long answer:

A question with any rollback strategy must also answer what not to rollback. If /boot were on a subvolume that we snapshot, rolling back means we rollback BLS snippets in /boot/loader/entries and we need some way for GRUB blscfg.mod to now read and present multiple generations of /boot/loader/entries. There is almost no development interest at the moment for enhancing/expanding the bootloader menu. It already confuses users, it’s not a great environment for UI/UX including a11y and i18n support. Also, on BIOS, a significant amount of GRUB modules are located in /boot/grub/ which then means many snapshots of the bootloader, with versioning becoming disconnected with the core.img embedded in either the MBR gap or BIOS Boot partition.

If /boot is on Btrfs, then /boot/grub2/grubenv cannot be written to by GRUB in the preboot environment. There is a variable in grubenv, boot_success which GRUB resets to 0, and later in user space a service makes it a 1 if the boot gets to a certain point. The idea of this is, if boot fails, then boot_success=0 and GRUB will then disable the GRUB menu so the user can make a choice other than the default choice, which likely just fails again. If boot_success=1 then GRUB menu is hidden. The reason grubenv is not writeable when on Btrfs is how GRUB writes to grubenv is just by directly writing to the block making up that file, it doesn’t write through a file system driver. On Btrfs, the file contents changing without updating checksums means it’s indistinguishable from corruption. Therefore, GRUB knows to disallow writes to grubenv on Btrfs (and on LUKS, mdadm raid, LVM, and LUKS). Ok so that’s a missing feature we’d have to figure out a work around for, and we are, but we haven’t decided on a solution with upstream yet.

There might be a dozen more examples. So it’s really a tangled web just to lay it all out and explain the tradeoffs. And what time frames are. And what resources are available.

It’s a bit of a holding pattern because there’s so much interaction between so many other things that need work in the bootloading space.

These are unrelated projects that we kinda have to somehow figure out how some or all might work together, but we’re still in the design phase. Like, what would this look like? And then there’s a bunch of work in the installer, docs, you name it, to make it actually happen.

txnupd

https://pagure.io/libdnf-plugin-txnupd

bootupd

rpm-ostree

boom

Boot Loader Spec
http://systemd.io/BOOT_LOADER_SPECIFICATION/

Boot Loader Interface
http://systemd.io/BOOT_LOADER_INTERFACE/

Snapper is also a consideration. But one of the things we’re really trying to focus on is simplicity, and just doing the right thing automatically when the wrong thing happens. We don’t want to create such a complicated storage stack that users can’t follow how it works. If an update fails, we should be able to automatically delete the snapshot containing the update attempt, rather than bother the user with having to fix things. We should be able to test updates (maybe some combination of “booting” it in a container, or a small qemu machine, and see if it gets to a certain milestone in the startup process) before we make them active. We should be able to do updates off to the side, so that users aren’t waiting or interrupted for them.