F44 Change Proposal: BtrfsBootForCloud [SelfContained]

BtrfsBootForCloud

Wiki

Announced

This is a proposed Change for Fedora Linux.
This document represents a proposed Change. As part of the Changes process, proposals are publicly announced in order to receive community feedback. This proposal will only be implemented if approved by the Fedora Engineering Steering Committee.

Summary :open_book:

Eliminate the separate /boot partition on Fedora Cloud images

Owner :open_book:

Detailed Description :open_book:

The images produced by Fedora Cloud for Cloud platforms and Vagrant will drop the separate /boot in favor of a Btrfs subvolume. This will not apply to UEFI-UKI and s390x Cloud images due to limitations of those platforms.

Feedback :open_book:

Benefit to Fedora :open_book:

Fedora Cloud Edition is typically deployed as images of fixed sizes and grown on deployment, so it is attractive for us to minimize the footprint of the image up-front. Since Fedora Cloud images do not rely on grubenv features like the [[Changes/HiddenGrubMenu|GRUB Hidden Menu]] feature (which requires resolving [2372973 – GRUB is unable to use bootloader header space for grubenv on btrfs (patch available) rhbz#2372973] first), we can easily consolidate the bootloader data on the Btrfs volume. By using a Btrfs subvolume, it can be trivially omitted from any snapshot mechanisms used on the deployment while avoiding space contention for boot data and the rest of the operating environment data.

Scope :open_book:

  • Proposal owners: Merge [Making sure you're not a bot! fedora-kiwi-descriptions#228]

  • Other developers: N/A

  • Trademark approval: N/A (not needed for this Change)

  • Alignment with the Fedora Strategy: N/A (not needed for this Change)

Upgrade/compatibility impact :open_book:

There are no compatibility impacts, as this only affects new Cloud deployments with Fedora 44 or higher.

Early Testing (Optional) :open_book:

N/A

How To Test :open_book:

Once the kiwi-descriptions PR is merged, images should be available in the new configuration. Just boot them in the platform of your choice to test.

User Experience :open_book:

This should be transparent to users.

Dependencies :open_book:

N/A

Contingency Plan :open_book:

  • Contingency mechanism: Revert the pull request to go back to separate /boot volume.
  • Contingency deadline: Final Freeze
  • Blocks release? Yes

Documentation :open_book:

N/A (not a System Wide Change)

Release Notes :open_book:

Fedora Cloud images (except the UEFI-UKI images) on all architectures except IBM Z systems no longer have a separate /boot partition, and instead now ship /boot as a subvolume in the main Btrfs operating system volume. This allows for much better space utilization and smaller images.

Last edited by @alking 2025-10-16T19:39:42Z

Last edited by @alking 2025-10-16T19:39:42Z

How do you feel about the proposal as written?

  • Strongly in favor
  • In favor, with reservations
  • Neutral
  • Opposed, but could be convinced
  • Strongly opposed
0 voters

If you are in favor but have reservations, or are opposed but something could change your mind, please explain in a reply.

We want everyone to be heard, but many posts repeating the same thing actually makes that harder. If you have something new to say, please say it. If, instead, you find someone has already covered what you’d like to express, please simply give that post a :heart: instead of reiterating. You can even do this by email, by replying with the heart emoji or just “+1”. This will make long topics easier to follow.

Please note that this is an advisory “straw poll” meant to gauge sentiment. It isn’t a vote or a scientific survey. See About the Change Proposals category for more about the Change Process and moderation policy.

I’m leary of this for the reasons Simo mentioned on the devel list…

We are depending on btrfs handling code in grub to not mess up things on the main storage. :frowning:

Considering the GRUB Btrfs driver is a read-only driver, I think that’s fairly safe. It has also been in use in openSUSE for over a decade, so I consider it well-tested as well.

2 Likes

I’m not for this change.

I’m not sure there’s much point rehashing all the issues brought up on the mailing list, but I’ll just add that I’d really like to see the Cloud images move to UKIs.

I do think we could do away with the extended boot partition, though, so I guess I agree with the “let’s not do a separate boot partition” bit.

Can someone link to the discussion? I’m not finding it in neither devel nor cloud mailing lists archives.

It crossed several subthreads of the increasing the size of boot partition thread so it was kind of easy to miss.

1 Like

I don’t think this is a good idea.

The proposal should be viewed in two scopes: what it means for those Cloud images and what it means for the whole Fedora project. To make a good decision, we need to see how it ties into the overall strategy for bootloaders. My understanding of where we should go with the boot loader stack is the following:

  • embrace the BLS.
  • on non-UEFI architectures, use the BLS layout and formats, even if the implementation is different.
  • UKIs (BLS type#2) are the only mechanism we have to verify initrd and kernel command line integrity, so plan for their use everywhere.
  • we are heavily invested in grub2, so continue to use it, but follow the spec and keep an eye out of for using different boot loaders.
  • embrace boot counting and automatic fallback to older versions.
  • allow parts of the stack to be omitted, e.g. skip the boot loader.
  • keep the stack simple, to reduce the attack surface and maintenance burden.

In the smaller scope of Cloud itself, this proposal is an interesting idea, but it comes with significant downsides and doesn’t fit the overall strategy above. The BLS was recently amended to require the ESP and the XBOOTLDR partition to use the same file system type. We had a lively discussion about this change, but I find the arguments pretty convincing that this needed to be done. Obviously, this change proposal is incompatible with that. This proposal creates a very complex installation and also ties us very heavily into using grub2 as the boot loader. This proposal is also incompatible with boot counting, at least in the way described in the BLS, which requires writing to the partition with the boot entry. In theory we could try to use a different system where the boot count is stored on a different partition, but that requires coordination of two separate storage areas and is incompatible with the BLS and is inferior in various ways. Finally, in the bigger scope of all of Fedora, this would create a complicated booting scheme that is different than in other places.

Instead of this change, I think we should take steps that are compatible with the overall strategy listed above and not take steps in incompatible directions.

This setup is not complex, it’s very simple. There’s not a lot changing. And for things like boot counting and whatnot, Btrfs has a bootloader space where this data is written, so it is still getting written to the partition, just not to a file in the main data space.

When this change was written, the capability to do so was not supported in GRUB, but it was merged upstream two weeks ago and there is a pending pull request to backport it. So in this respect, we can and will be able to do this once that’s merged.

You were told that the consequence of you forcing VFAT in the UAPI spec was that we were going to ignore that part by both @chrismurphy and myself, since we explicitly indicate we are not following the full spec. Furthermore, there are EFI drivers for Btrfs that allow EFI boot managers to access the Btrfs boot data (both btrfs-efi and efifs are options in Fedora). Multiple boot managers support loading EFI filesystem drivers, even ones that support BLS-style configuration. For pure-EFI boot managers, boot counting data can be stored on the ESP alongside the boot manager binary (or if you like punishing EFI NVRAM, you can store it there instead).

For Fedora Cloud, we want less partitions, more storage flexibility, and we want to use Btrfs features more. @davdunc has been working on more things to take advantage of Btrfs in the cloud, and I’ve heard from people that want to use Btrfs features for full-system snapshotting and replication. This is a major step in that direction.

I think Neal is raising some improtant points about the storage flexibility and BTRFS features that we can use. He is also pointing out that the changes that are in flight are the reason that this makes sense today. with the backport in place, so is the adherence to the principles mentioned above. This isn’t in any way far away from the points of the counter arguement.

I also think that these are valuable most especially in the cloud environments. I understand that everyone is working hard on building a line of sight alignment with the boot loader specifications, but I want to remind you that the core idea here is simplifying the partitioning, and this is the part that is most important to my work, enabling more advanced storage capabilities for subvolume distribution and snapshotting. That is the part that I need carefully considered in this context.

The cloud environments present operational requirements that are unique and I hold a strong belief that we will benefit from this approach more than the traditional deployments. I think that everyone seems keenly focused on the break from the overall strategy and I appreciate that, but in this case, I am hoping htat we can develop an approach that is better crafted to serve cloud use cases. On the other hand, I do want us to continue to maintain core project principles.

I need to complete additional validation before sharing details on the work that makes this so important for me, but for now I am in full support for the work that we are proposing here on the principals of the storage flexibility patterns that it provides. I do also believe that based on the work on grub in flight, we are continuing to keep the broader system architecture goals in our sights.

There’s already a separate Fedora Cloud deliverable that does this for people that want it. But it isn’t necessarily the only way to do integrity verification. And in fact this mechanism only works if you consider UEFI variables reliable and tamper-proof (which they are not) and that you have a working TPM2 device (which is common in clouds now, but not elsewhere).

The complexity is in bootloader code reimplementing one of the most complicated file systems. This is risky and problematic on many levels: every buffer overflow creates a risk of repeating the BootHole scenario. Here we have lots of complicated code accessing complicated data structures, so chances of a programming mistake are relatively high. The coding environment is very primitive, without many of the niceties like address space randomization that we use everywhere else. And even if we ignore the potential for errors, every change that the kernel makes, e.g. to compression details or storage optimization creates chances for the boot loader code to not understand something. We could be in a situation where we cannot update the kernel because old versions of the bootloader are deployed.

This is a terrible design. This means that the userspace cannot query or set the configuration in a normal fashion. One of the problems with grub2 usability is the fact that we cannot nicely introspect and interact with what the user will see after a reboot from inside of the system. E.g., we should be able to preview how the boot menu will look after a reboot, we should be able to select the item to boot into from a gnome shell menu dropdown, override the default boot entry, etc.

As I and multiple other people already mentioned, drop the second boot partition?

I think btrfs is great, and we should use modern file system features more. No disagreement with the general sentiment. But this doesn’t mean we should do this in every scenario, in particular doing this from non-kernel code carries significant risks.

I don’t think we have any realistic alternatives for that part of the stack. UEFI variables are quite secure. Nothing is foolproof, but in security we want to have multiple layers and mitigations. And the integrity of early-boot code is an important mitigation by making it much harder for an attacker to achieve persistency in the system.

And here we’re talking about boot for cloud, so we can generally assume availability of TPMs. Even if not everything has them, we should design with them in mind (and suitable fallbacks for systems which don’t).

I’m looking forward to hearing those. I think additional context about what you’re trying to achieve would be useful.

… which ignores the space contention issue of storing OS-private data on the ESP if we put it there. On the Btrfs volume, we don’t have that problem, since it shares space with the rest of the OS data.

This is the direction even Red Hat is going, though they did it with a big XFS volume instead.

What are you talking about? Whether it’s GRUB or something else, there has to be an abstraction of some kind for this because you cannot assume a particular boot platform design. GRUB provides programs for accessing this stuff, and other bootloaders / boot managers / boot platform tools do too.

Nobody seriously expects what you’re talking about. It’s not even true on EFI, where the information is split between the ESP and EFI NVRAM.

Here were are specifically talking about cloud images, where we fully control the layout of the disk. So “OS-private” is not very meaningful when there’s just one system on the disk ever. We can make the ESP as large as appropriate and use it for whatever purposes we want, as long as we keep the boot functionality intact.

This another vague and weak argument. “Red Hat” is not doing anything, because it’s a huge company with >20k employees. At any given time, it is moving in five different directions in the boot loader space, because the considerations and requirements for traditional servers, cloud, CoCo, cars, and various other things that Red Hat does are just so different. So the fact that some people in Red Hat used a solution for some particular purpose is not helpful when deciding what we should in Fedora cloud images.

What are you talking about? We’re literally talking about “boot platform design” here.

You are correct that an abstraction of some form is needed. The question is how many components and layers this abstraction needs. As you wrote, in UEFI the information is split between the ESP and EFI variables. We are not getting away from either of those. The question is whether we should add a third mechanism that is similar to the ESP but has a custom non-portable storage backend. I think that using the “boot storage area” is more like a 90’s design than something that we want to design towards in 2025.

Well, not really? We’re limited by what the platform supports and the various bugs different cloud and VPS platforms have. And also our images are hybrid BIOS+EFI images on x86_64, not singularly UEFI-only (unlike our experimental UEFI+UKI ones). And we offer images for non-EFI platforms too (POWER and IBM Z).

And while uncommon, our images are also used in bare metal provisioning too (so similar concerns apply there too).

But it is relevant for this context since they did it for cloud images because of customer issues with space contention. And we have had similar issues from Fedora community users too.

If that were the case, then we should also replace EFI, since it is also a 90s design. Please don’t use such arguments.