Easy-to-implement rollback option after upgrade

@jakfrost brought me to an idea I would like to put forward in discussion before putting it in the mailing list or writing a feature request. Especially the experience/opinion about solving user issues of other ask.fp people might be interesting.

Generally, there is no supported rollback capability for Fedora’s upgrades (in the discussed case, 35 to 36). In most cases I think that it does not make sense to roll back anyway because the old version will have no longer support and updates at some point. Therefore, rolling back would be just procrastinating issues.

However, in some cases, it might help a user to get over bugs or such, which are likely be solved over time externally (by developers) but that cannot be solved by the user himself immediately.

Therefore, I would like to know what you think of an adjustment of the package dnf-plugin-system-upgrade:

  1. before doing upgrades, check if the installation has a default btrfs configuration (only check for the os-related directories; data directories remain untouched by the upgrade anyway).
  2. if no, no further action → assume the user knows himself/herself what to do
  3. if yes: check if there is sufficient space for the snapshot plus the subsequent upgrade.
  4. if yes: make automatically a btrfs snapshot of the os-related directories before upgrading (ensure that /, /var, /etc, /usr, /opt, /boot are in; did I forget something?).

In this case, if any user experiences issues that prevent using his/her system as intended and if we assume the issue will be solved soon externally (bug fix by developers or whatever), a rollback can be done easily. I assume the implementation ain’t a big issue. It would not be time critical anyway. What do you think?

@jakfrost I suggest to keep the related debate here and not in the original thread because it seems to be hard for Perry to differentiate and distinguish. We should help him to focus and to not mix up different things. For him, the rollback is unlikely to be a solution in my opinion, but his issue caused the idea.

Sure you beat me to the “let’s split this topic”, but you’re right. TMI was something I was worried about in this case.

As for the suggestion about snapshot rollback capabilities at upgrade time, it is a topic about something similar here somewhere but I haven’t tracked it recently. This would be ideal I think with btrfs as default. I have a snapshot of my /home subvolume as a backup if I need to re-install. There are two separate methods I have used, snapshot and send/receive subvolumes, both work fine with send/receive making a read only subvolume.
Of course, you could just run Silverblue since it already does this, atomic updates.

1 Like

Sorry, my argument was not meant as a general rebuke about topic splitting or such. It was specific to the circumstances in the original thread and meant as a suggestion to help Perry to focus and to structure/sort things. It has already become complicated and error-prone when we tackled different solution approaches in the same thread at the same time. I wanted to avoid this to happen again with rollback (which is not possible for him anyway).

/boot doesn’t necessarily have to be on the/a btrfs filesystem. I’m a fan of using /boot as both my ESP and for storing the kernel images. If the system is configured that way though, some post-rollback work needs to be done to make sure that /boot has the right content. It is not difficult though since /boot is relatively small and its contents are normally maintained by scripts anyway.

The other gotcha to be aware of is that there may be a small possibility of important “in-flight” data being cached in RAM at the time the snapshot is taken if the system is running. Well-written programs should try to keep any important data “flushed” to disk, but I don’t think there are any guarantees that that will always be the case and there may be a very small chance of data corruption after the rollback. I think the snapshot is almost like yanking the data cable off the hard drive of a running system or pulling the power plug on the whole PC. Modern software and journaled filesystems are very good at recovering from such occurrences, but I don’t think it is 100% certain that everything will be OK. There is a way to avoid any chance of losing in-flight data though. You just need to perform the snapshot on system boot before any programs have begun performing I/O on the btrfs volume. It should be possible to do that from the initramfs stage.

Just my 2¢.

2 Likes

Absolutely, the other directories as well. But 3. and 4. would only apply if the partition configuration is default. However, now that you say it, I just tested an installation using a live system with default partitioning. Indeed, default uses btrfs for / but it separates /boot in a traditional ext4, making /boot snapshots impossible with the default partitioning. But as you said, there are alternatives for /boot that might be easy for developers, especially given the small amount of data. Nevertheless, now we are already at more than just one snapshot cmd.

The upgrade process contains a reboot, which could be used for that. I do not know the upgrade process in details (and I have not used it since 34 → 35), but as far as I remember, the actual upgrade process did not start during/before reboot but after (feel free to correct me here!). So snapshots should be possible after reboot and before the actual upgrade process starts.

Yeah, I think so and I think that is probably the best way to do it for the reason that I tried to explain before.

There is one potential downside to taking the snapshot at that point in the upgrade process that I can think of. At that point, all of the RPM packages for the upgrade are present under /var/cache/dnf. That can be quite a lot of data during a system upgrade and it will be “trapped” in the snapshot for as long as the snapshot exists. These upgrade snapshots may end up being larger than the regular system installation. So you may want to think about a system for getting rid of them.

My recommendation would be to associate the snapshots with the kernels. There are a few correlations between the way Fedora Linux handles kernels and the way that such recovery snapshots should be handled IMO.

  1. A reboot is required to upgrade the system kernel. Likewise, A reboot should be required to get a “clean” snapshot of the root file system.
  2. A configurable number of previous/fallback kernels are kept around after new ones are installed. Likewise, it should be possible to configure some small number of previous/fallback snapshots of the root filesystem. (The number of kernels the system keeps is determined by the “installonly_limit” setting in /etc/dnf/dnf.conf.)

Also, system upgrades are not the only time that having a relatively recent snapshot to rollback to can be useful. It makes it a little bit of time-and-chance as to just how old the most recent snapshot will be, but if you make kernel installs and uninstalls the “trigger” for when snapshots are created and removed, that will automatically include all system upgrades since a system upgrade always comes with a new kernel (even if it is the same kernel version). It shouldn’t really matter if the system snapshot that you roll back to is a bit old. It just means that it will take a little longer to re-upgrade all the packages on the system. For the /home subvolume, you would probably want the snapshots scheduled with a regular frequency, but it doesn’t matter as much for the OS.

I’m not sure how snapshots are named with btrfs (I use zfs personally), but if you name the snapshots after the kernel version, then it should be easy to determine if a snapshot should be taken during the boot – if there isn’t a snapshot with a name equal to the current kernel version, then take one.

For removing snapshots, I think that should be as simple as dropping a small shell script in /etc/kernel/prerm.d that removes the snapshot with the same name as the kernel that is being removed if one exists.

No worries, I didn’t feel rebuked.

Yes, with block-level snapshots there ain’t a way around having the need for space for all three things: the old os snap, the upgrade files that end up in the snapshot after deletion, and the upgraded os. However, I think that ain’t a big issue. Most os have sufficient space to manage this for a short time. Step 2 would manage that. Nevertheless, the intention would be to keep that snapshot only for a short amount of time. The snap comes with the upgrade (where users have to ensure sufficient free space in advance anyway). The goal would be a rollback capability that is easy to implement/maintain. It can be up to the user to decide if a snap is wanted or not. If not enough space is free to also keep the snapshot, a warning might make sense and that’s it.

However, you raised a good point. How/when to delete the snapshot. Obviously, it should be automated. Maybe the pre-upgrade snapshot could be deleted by a short script after a week or so? If today > (Upgrade date + 7 days), then delete snap. The os-related directories should not change much in this period. By default, /home is separated anyway.

imho, I agree that the kernel updates would be a proper determinant about when regular os-snapshots should be created if they are scheduled regularly. But I think it would be too invasive to implement regular snapshots automatically, this should be up to the user and should be part of his/her backup strategy. I think our responsibility here should be limited to the upgrade process and the related plugin. Within the upgrade, I think we are already on the same page in terms of when to do what: download upgrade - initialize upgrade - reboot - snapshot at initramfs stage or at the latest at runlevel 1 - start the upgrade.

Ask.fp people: in your experience, do we have sufficient cases where a temporary rollback makes sense so that a snap-rollback-capability is worth to be implemented?

I would be inclined to think more in terms of “mileage” than time. It is possible, though perhaps not likely, that the user upgraded their system and then left it on the shelf for a week. It is not necessarily the case that a user has found a significant problem after a fixed period of time. But the likelihood of a problem going unnoticed does diminish with use. I’m not sure what the best way to measure that is though. Maybe you could sum the last column of output from the last command?

1 Like

I think this would be a challenge to implement in a universally workable way using btrfs.

There are a lot of things that can be painful to determine

  • Ensuring the snapshot has all the required data for a rollback. Since subvolumes can be either nested or mounted ensuring that you include all the right data isn’t the easiest task.
  • Ensuring that there isn’t too much data in the snapshot. Not everyone keeps all their data in the home directory. Most people wouldn’t expect an update rollback to delete their data.
  • Ensuring that there is a valid kernel with matching kernel modules on which to rollback to. Because of grub limitations, the kernel will often not be in a btrfs filesystem. However, the modules probably will be.
  • Dealing with the chaos of rolling back a DE update. Since Fedora updates bring new gnome versions, what are the implications of rolling back to an old gnome version with the config for the newer version?

When people build a system with Btrfs in mind, they often consider some of these things when laying out their subvolumes. However, if you are just using Btrfs because it was pre-installed, you probably aren’t considering which subvolume your data is ending up in.

1 Like

In order to keep it simple, my goal was to implement this only in upgrades that are based upon default partitioning (see the first step of the rollback process in my initial post). If the partitioning is custom, it can be assumed the user takes care himself/herself about such things.

Therefore, it would only apply if the /home dir is separated. Clearly it has to be focused that only the os-directories that may be changed during an upgrade process are in the snapshot. If a user has much custom data in /var or so, yes, then it would fall back to step 3: if not sufficient space is existent, inform the user that there will be no snap and that no rollback will be possible. There is another issue I was not considering on /var: the target group also tends to have many gigabytes of logs, while I do not like the idea of deleting old logs without asking the user (while I also do not like the idea of adding further questions to the upgrade process the user has to consider).

The kernel can be in a btrfs partition, but the default is indeed to keep /boot in ext4. Therefore, this has to be solved differently, as you indicated. Gregory made some suggestions about it before. Storing configs of /home that are changed during upgrades might be no big issue as well imho. Generally, I was thinking mostly about the users that don’t care and don’t think of where their data is stored (using simply home), who tend to not customize the os-related partitioning. But you are right: this is already far beyond a simple “snapshot and that’s it”. Therefore, I admit that I become myself more skeptic about it. To much changes would be necessary for a rollback snapshot that is prepared for all types of upgrades Fedora might experience. The solution would become itself more complicated than a few individual functions, or it has to be tailored to each upgrade. Both seems not justified.

Generally, besides the rollback capability (and if we choose to not implement it), does it make sense to inform the user more clearly about the fact that there is no rollback possibility? Yes, it is stated in the Docs, but it might make sense to add the information once before the upgrade process begins with dnf system-upgrade reboot that no rollback is possible + backup if it doesn’t work. I remember several posts of users who ask if rollback is possible AFTER upgrading. On one hand, some users do not see the Docs at all (the commands are available on many sources), and on the other hand, the Docs are a lot of stuff to read, which most people don’t do. A simple warning with a few seconds to wait before “y” can be chosen or such?

Agreed. It’s not an elegant way to determine. But in my experience, the issues that are so critical that a rollback might be the sole alternative to re-install or not use the os until bugfix, appear early. So the expectation is just that the user does not upgrade the system and than shutting it down for a week without even logging in or starting some basic applications. The idea was a simple solution that does not need noteworthy efforts but that could tackle hard case issues after upgrade. Counting some shutdowns might be an alternative.

However, I currently start to think myself it is not a good idea. There are too much adjustments necessary so that it will not be a KISS solution, which already begins with the default ext4 /boot partition. A bigger focus on informing users in advance makes maybe more sense.

The others I think can be dealt with (mainly by making the feature opt-in with a sufficient warning). But this one intrigues me. Are you referring to config files under /home? Are you saying that the DE might not be able to handle config files with unknown entries or unparsable config files? If so, is that a problem for Silverblue as well?

I think he is referring to, e.g., a major upgrade of GNOME, which might also lead the upgrade to adjust the user’s configuration files in /home. Nothing that happens on each upgrade, but it is possible and has to be considered. It is a solvable issue, but still another consideration.

How would you know? Let’s say I create a system using default partitioning. Then, later a add a subvolume? Is this still default partitioning? If it is, what if I put that subvolume somewhere that is important to the system? That will cause breakage.

If it isn’t, and you are saying that you have to have both the default partitioning layout and the default subvolume layout, what about someone who uses snapper? Snapper create a subvolume to store it’s snapshots. Would this invalidate the users ability to do the rollback? That would probably be unexpected. Snapper isn’t the only tool like that.

Again, the problem is “how would you know?”. For example, many people store their data in /data or similar. That could easily be in the subvolume mounted at / and be almost guaranteed to be in the snapshot.

However, if you don’t have working understanding of how btrfs works, you might not understand that, even if a warning was presented to you.

I certainly have seen issues where that is the case. However, how often does it happen in practice? I suspect not very often. That point was more of a question for me.

Yeah, rolling back the OS isn’t something that should be taken lightly. However, I still think snapshoting the OS periodically could be done in a fairly KISS way. In fact, the core of it would potentially just be two files/scripts – a systemd service installed in the initramfs with a ExecStart=... line to create the snapshots and a corresponding script to delete the snapshots.

I would like to see an optional package like, e.g., dracut-config-snapshot (similar in vein to dracut-config-rescue) that would provide and install such scripts. The package should not be pulled in or installed automatically because of the significant disk space requirements. But it would be a good start to addressing the problem. That part of the process – creating and removing snapshots – should be quite passive and harmless (other than the disk space requirement which is easy to recover from). I would even write such a package myself except that I am not a guru with btrfs.

Just having the snapshot available is enough for the power users to, potentially, be able to recover their system quickly and easily. It might also be possible to talk non power users through the processes on sites such as ask.fp.o. Eventually, people could develop shell scripts to automate the rollback process (including fixing up the /boot partition et al.). But those would always come with a big caveat emptor …. The final fallback would be no worse than the current situation – tell the user to reinstall their system.

Anyway, just my 2¢.

Answering all at once: see step 2 :slight_smile: So no, it would no longer be default. You should not transfer your knowledge, and the possibilities that rise from it, to users without a comparable background. As mentioned, if a user customizes the partitioning, I would have assumed he/she knows what he/she is doing. The approach would have aimed to those who have no idea about it and who start to ask what to do after everything has gone wrong, when no backup is available and the question if there is a rollback possibility is asked. One who ends up at this point does not tend to possess a reasoning that makes him/her considering separated /data partitions.

If /data is existing, again, I assume that person has started to take the responsibility & considerations itself. /data partitions already imply a sense of data safety and such: not the target group of the “80-20 KISS rollback option” :slight_smile:

As you indicated, a lot of users would have been out of this rollback possibility anyway. But it is those who do not think about customizing partitioning or about making own subvolumes for specific reasons (and who have default separated /home but no /data) about who I am worrying in here :slight_smile:

However, as I said, I already agree that this approach is not feasible: we have already identified too many variables that need to be considered so that the effort does not justify the benefit.

I do not think one needs deep understanding of btrfs to understand a warning that makes aware that a rollback will not be possible, or to understand what would be necessary to enable the rollback possibility: free more space, or you will not be able to roll back. If partitioning was customized, /data and such, we would have assumed anyway the user knows about these realms, and so the snapshot won’t be necessary.

So even if I agree to not implement the rollback option, a clear warning in the upgrade plugin about making backups in advance and that rollback will not be possible is something I keep lobbying for :slight_smile: and related questions have risen more than once in ask.fp.

A general snapshot function from time to time would indeed offer the user some “forward data safety” that might be useful in some circumstances and be KISS. However, I think it can be even easier: runlevel 1 is already able to make use of btrfs functions by default. Except a few logs in /var/log, I do not see anything that is running at runlevel 1 that might cause I/O-related problems. So here it could be reduced to adding a script and that’s it.

I assume the users who can work with that on themselves have already their own solutions in place, tailored to their needs. Does Fedora need to be already shipped with an off-the-shelf solution for them?

The question is if there are scenarios where these snapshots could be used by those users who do not prepare for problems in advance due to lack of knowledge/experience. But I could imagine that here and there it could make a difference, without much efforts. A related Docs page might facilitate the use. As far as I know, btrfs snapshots do not need dedicated space, unlike LVM? (on block-level, I only use LVM for snaps) If this is the case, it indeed would not need much considerations but be easy to do. I assume Fedora has sufficiently short kernel update cycles so that snapshots are unlikely to become too big.

1 Like

Well, I already have my own solution. But of course, the way I’ve done it is likely unique compared to the way others have done it. If Fedora were to ship an “off-the-shelf” solution, it would provide a more consistent starting point that others could then build on and improve.

If the intent is to have a living on the system rollback, Silverblue does that already, and Kionite and FCOS and IoT.
The problem that originally brought up (continues to bring up) the topic, is that most current Linux users already know some form of recovery that does work for them, while most of the new to Linux users that happen to choose Fedora will not. So for instance the response that the Live USB has everything needed to recover a system that is recoverable, is valid but doesn’t help those who aren’t already equipped with that knowledge. Try searching for something on a non booting system you are unfamiliar with, and can’t get networking going on. Now add an accessibility issue and matters worsen rapidly.

I think that it should not be a part of Anaconda but should be a tool available to Anaconda if installing to an existing system like Blivet is, but I believe it needs to be a stand alone purpose focused tool, solely for system recovery. I think there is potential to use snapshots of the current WS as at stable points to provide a rollback image, BTRFS snapshots can be moved, so lets say we look at having an unmounted top level volume that all other subvolumes are created under, these would be mounted as / /var /home. Then when the system wants to make a snapshot of itself it could be automated to do it periodically, and also upon successful state change. This snapshot could be a btrfs send/receive which creates a subvolume named as applicable, deleting the previous one beforehand. Then if you want to rollback, unmount the / and /var subvolumes and mv snapshots to replace / and /var, and if you really wanted to be more harmonious sync the home subvolume, which you can’t do via send/recv. I would need to look at how it was done but there is an example in some of the advanced BTRFS doc’s on doing just that with a system.

Systemd doesn’t really use runlevels anymore. But dracut has some “hooks” that work very much like the old sysv init runlevels did (they even prefix a two-digit number to order things within the hook).

Here is a simple way that you could write a script and embed it in the pre-pivot stage/hook during system startup.

First, create a /tmp/snapshot.sh file containing only the following line. It is a shell script that will be “sourced” during system startup.

uname -r > /sysroot/snapshot.log

Next, run the following command to embed the script in your current initramfs.

sudo dracut --force --include /tmp/snapshot.sh /usr/lib/dracut/hooks/pre-pivot/snapshot.sh /boot/$(</etc/machine-id)/$(uname -r)/initrd

And that’s it! When you reboot your PC, the script should be sourced during the pre-pivot stage of system startup and it should create a “snapshot.log” file at the root of your filesystem containing the current kernel version. If this works for you, then you should be able to substitute some btrfs commands that will create snapshots of the root filesystem under whatever conditions you decide to test.

Once you’ve developed the snapshot.sh script, the next step would be to write an actual dracut module that would embed the script in the initramfs and then package that as an RPM.

HTH,
gb

1 Like