Standard/simple way to repair a corrupted deployment?

Relatively new here, so I could be about to commit a noob-error. If this isn’t posted in the right place or is otherwise inappropriate, please point me to the right place.

Those worries aside, however -

Should one detect checksum errors via btrfs check or ostree fsck that can be traced back to files in the immutable part of the system, what is the standard way to resolve this?

In my noob-naivety, I imagined this would simply be a matter of pulling a good image from the repo, rebooting/switching to it, and leaving the bad one to wither and die in the wake of successive updates (and I feel that’s how it should work, even if just by exception when a repair needs doing).

However, I’ve discovered it doesn’t work that way at all. The issue, as I understand it, is in practice each new image just links to existing files already on the system if it doesn’t contain an updated version of it, so if said file on the system has become corrupted, it persists across updates. (is that broadly correct, those who actually understand this? Pretty sure my understanding is incomplete…)

There doesn’t seem to be an obvious/standard feature one could use to trigger an exception to this and get it to pull an entire image, regardless of what’s already present on the system (correct?). Thus, we are reduced to hacky/counter-intuitive/don’t-actually-work-in-practice (delete depending on specific circumstances) workarounds instead.

Some context: Initially drawn to the idea that the underlying OS was theoretically uncorruptible and if something did go wrong I could just roll back, and having has a good experience with SteamOS (also immutable) beforehand, I’ve been daily driving Kinoite for a couple of months now to see what the implications of living it while doing “normal things” are, and separately, doing the same with a Bazzite system for a spot of gaming on a desktop (which is also atomic so, in theory, has the same advantages). On both systems, files in the supposedly immutable part of the system became corrupted (csum errors) and, worse, I didn’t spot the issue until whatever the last good image I’d had had long since disappeared from the respective systems. From that point onwards, their inaccessibility seemed to just be a liability.

I don’t know why the images got corrupted, but it doesn’t really matter because, from my experience and what I’ve read on forums, reddit, etc. (see below) it’s not an uncommon occurence - certainly common enough that there should be a straightforward way to “reset-to-clean”.

I managed to “clean” a Fedora Kinoite 43 system with corrupted files in the liblpcnetfreedv.so library (part of the OS) by rebasing to the updates stream, rebooting, rebasing back to the regular kinoite stream (so I was back on the stream I was on previously) and rebooting again. The rebasing seems to trigger a fresh image repaving, but a regular upgrade doesn’t, hence having to resort to this two-step. This seems like quite a palaver when all I was trying to do was just get a good image of the deployment I was already on (and it took me an entire day to figure out/stumble on this as a “solution” for my particular case after going down many dead end rabbitholes first).

Using its rebasing guide, I eventually figured out how to do the same with Bazzite, but sadly the two-step didn’t have the same effect there (ostree chuck layers already present: 123, ostree chunk layers needed: 5… hmm). So, seems it’s set on hanging on to its borkage and I’ll probably just nuke and repave with a fresh install of the OS. While this is fine in one sense - I have backups and so on - I feel it shouldn’t have to be this way. There should be a way out. The whole point of the atomic/immutable thing was to be more reliable, I thought, not the all or nothing affair that has been my experience so far.

At this point I’d - believe it or not - quite like the “well acktually…” crowd to jump in and correct my mental model. Am I missing something really obvious? If not and I’m - somehow - on to something, how to take it forward?

Things I’d digested before posting:

In the troubleshooting section of the docs page you’ve mentioned there is a link to the workaround.

It is basically a 10-year-old post with the steps to be carried out. Assuming the steps are still valid (haven’t tested them), you would need to replace a few things (the RPM-OSTree status command, the name of the remote when running the ostree pull command, and of course the commit ids).

@tqcharm, thanks for responding; however, that’s the same link as my second bullet.

I couldn’t get it to work - at least not yet - either by following it closely (just got a bunch of “… read-only filesystem” type messages, which, to be honest, is what I was expecting by that point), or by “riffing” off of it by correctly subbing in commands (I don’t have the knowledge and it seems it’s beyond my google/chatbot skills).

I suspect it doesn’t work at all, but that’s hardly definitive - after two days of slowly building up knowledge (on just this one issue) through trial-and-error, I probably now only know enough to be dangerous. That said, if I’ve understood it correctly, the workaround is predicated on the underlying object store being mutable for the root user, but I don’t think it is, on Kinoite? Or I missed a step to make it mutable for root when it normally isn’t? Or something has happened to make it not mutable for the root user when it normally is? No idea…

…which is kind of my point. How is any “normal person” supposed to know how to get out of this (afaik) not uncommon mess (rhetorical)? I don’t think they would be able to parse the 10-year-old post we both found (pertaining to a different distro with different commands), even if they did, somehow, understand the theory behind it is “the underlying object store is mutable for the root user” (not exactly the sort of phrase I’d expect them to have in the back of their mind while they puzzled this out, let alone know what any of the words meant).

I think the broader problem is:

  • The theoretical resilience of these atomic systems is to be able to just rollback when something goes wrong (correct?). However, that’s of no use if your current deployment is wrong because it’s referencing corrupted files in the one you’re rolling back to.
  • Something like rpm-ostree repair (which, say, downloaded a fresh image with no references back to the old one - or through some other means achieved a repave/reset) isn’t a thing.
  • rpm-ostree rebase <tree>, to force a rsync to a known-good ostree reference, seems to be the closest thing there is, but (1) the one tree you can’t use this with is the one you’re already on, (2) it requires knowledge of a sensible alternative ostree reference to involve in the process and (3) setting up the conditions to have that achieve the desired effect aren’t obvious and/or guaranteed (cf. in my bazzite (admittedly slightly off topic) example, above even the rebase didn’t do a full repave, although that world is further complicated by having containerised OS images).

I think I’m not the only one? e.g. RFE: filesystem errors: detect them, make recovery easier · Issue #3994 · coreos/rpm-ostree · GitHub & How to deal with corrupted file objects in ostree? - General - Universal Blue, etc.

So (widening this out to the whole audience again - I’m not ragging on you @tqcharm - I appreciate any reply):

  • is there a simpler way to deal with this that I’ve missed?
  • if there isn’t a simpler way but you agree there should be one, how do I/we push that thought?

(sorry, another long one, I’m aware)

For anyone following this with a similar issue (ostree fsck returns i/o errors and/or btrfs check returns csum errors that can be traced back to the immutable part of the filesystem), this is a reasonable summary of what I understand so far (but beware - this is almost certainly still incomplete/flawed. Hilariously, that AI summary is including this very discussion in its sources. Take the whole thing with an iceberg of salt).

I am using Silverblue on my systems for several years already, but never had to deal with file system checks/errors.

Looking at the docs again, unless I’m missing something, the read-only nature of the file system makes the proposed workaround indeed obsolete.

A not-so-nice but possibly functional workaround I am thinking of right now would be to rpm-ostree rebase, which would (presumably) download the complete objects, not only deltas. The rebase needs to be performed twice (away to the current ref and then back to the current ref), given that rebasing to the same remote ref would result in an error.

I guess you could start a discussion in Fedora Atomic Desktop SIG’s issue tracker. If this issue similarly affects bootc as well, it might get the expected attention.

I’d already clocked the rpm-ostree rebase idea. Curious to know where you found it though?

For me, it worked in one scenario but not in another. For whatever reason, it doesn’t always trigger a full repave and sometimes still tries to reference chunks in the OStree repo (even if some of those chunks have been deleted by ostree fsck --delete, believe it or not - you would have thought it would have spotted that…).

What’s bootc? Something to do with using containerised images like (e.g.) Bazzite does?

Thanks for the pointer to the issue tracker. I’ll have a go…

Didn’t find it. I was browsing the man page of rpm-ostree hoping to find a flag to the upgrade command which would pull the full commit, and since I didn’t find any, I just thought of the rebase option, considering that in a rebase the whole image should be pulled (or so I thought).

Yes, Universal Blue uses bootc in its offerings for image deployments, and there is a plan that also Fedora will switch to it whenever considered ready (replacing OSTree), though unofficial bootable container images already exist. See the below thread for details: