Relatively new here, so I could be about to commit a noob-error. If this isn’t posted in the right place or is otherwise inappropriate, please point me to the right place.
Those worries aside, however -
Should one detect checksum errors via btrfs check or ostree fsck that can be traced back to files in the immutable part of the system, what is the standard way to resolve this?
In my noob-naivety, I imagined this would simply be a matter of pulling a good image from the repo, rebooting/switching to it, and leaving the bad one to wither and die in the wake of successive updates (and I feel that’s how it should work, even if just by exception when a repair needs doing).
However, I’ve discovered it doesn’t work that way at all. The issue, as I understand it, is in practice each new image just links to existing files already on the system if it doesn’t contain an updated version of it, so if said file on the system has become corrupted, it persists across updates. (is that broadly correct, those who actually understand this? Pretty sure my understanding is incomplete…)
There doesn’t seem to be an obvious/standard feature one could use to trigger an exception to this and get it to pull an entire image, regardless of what’s already present on the system (correct?). Thus, we are reduced to hacky/counter-intuitive/don’t-actually-work-in-practice (delete depending on specific circumstances) workarounds instead.
Some context: Initially drawn to the idea that the underlying OS was theoretically uncorruptible and if something did go wrong I could just roll back, and having has a good experience with SteamOS (also immutable) beforehand, I’ve been daily driving Kinoite for a couple of months now to see what the implications of living it while doing “normal things” are, and separately, doing the same with a Bazzite system for a spot of gaming on a desktop (which is also atomic so, in theory, has the same advantages). On both systems, files in the supposedly immutable part of the system became corrupted (csum errors) and, worse, I didn’t spot the issue until whatever the last good image I’d had had long since disappeared from the respective systems. From that point onwards, their inaccessibility seemed to just be a liability.
I don’t know why the images got corrupted, but it doesn’t really matter because, from my experience and what I’ve read on forums, reddit, etc. (see below) it’s not an uncommon occurence - certainly common enough that there should be a straightforward way to “reset-to-clean”.
I managed to “clean” a Fedora Kinoite 43 system with corrupted files in the liblpcnetfreedv.so library (part of the OS) by rebasing to the updates stream, rebooting, rebasing back to the regular kinoite stream (so I was back on the stream I was on previously) and rebooting again. The rebasing seems to trigger a fresh image repaving, but a regular upgrade doesn’t, hence having to resort to this two-step. This seems like quite a palaver when all I was trying to do was just get a good image of the deployment I was already on (and it took me an entire day to figure out/stumble on this as a “solution” for my particular case after going down many dead end rabbitholes first).
Using its rebasing guide, I eventually figured out how to do the same with Bazzite, but sadly the two-step didn’t have the same effect there (ostree chuck layers already present: 123, ostree chunk layers needed: 5… hmm). So, seems it’s set on hanging on to its borkage and I’ll probably just nuke and repave with a fresh install of the OS. While this is fine in one sense - I have backups and so on - I feel it shouldn’t have to be this way. There should be a way out. The whole point of the atomic/immutable thing was to be more reliable, I thought, not the all or nothing affair that has been my experience so far.
At this point I’d - believe it or not - quite like the “well acktually…” crowd to jump in and correct my mental model. Am I missing something really obvious? If not and I’m - somehow - on to something, how to take it forward?
Things I’d digested before posting:
- Fedora Atomic Desktops Troubleshooting Page, “ostree fsck reports file corruption”.
It says “The workaround in this case is to mark the entire OSTree commit as partially retrieved and then re-pull the commit.” How? What are the commands for this? - a process outlined in a github conversation, linked to from the troubleshooting page. Can’t follow the template here because it involves operations that get “… Read-only Filesystem” responses i.e. doesn’t seem to apply to kinoite et al.
- a redditor’s comment that there “might be a gap in the current ostree tooling”, albeit 3 years ago (related: solving `ostree fsck` errors when using container images? · Issue #3092 · ostreedev/ostree · GitHub, support ostree-ext tar mode · Issue #2499 · ostreedev/ostree · GitHub)
- someone else with the same issue, with silverblue and another one
- lots of nonsense (and dangerous or non-existent commands) spouted by various AI chatbots on the subject before deciding it was their special way of telling me they had no idea.