Trouble with live reinstallation

Hi, we are doing “complete reinstalls” on some systems. This worked pretty well so far, I think we somehow broke some dependency chain or the like when our manifest-locks became invalid. Let me explain:

How are we updating
We are using kexec to boot into a new kernel, initramfs and rootfs image to do essentially what coreos install also does. We provide some kargs to the next boot which specifies the ignition file to provision. Below are the commands, which I hope are pretty self-explainatory in what is happening:

sudo curl -o /var/reinstall/kernel
sudo curl -o /var/reinstall/initrd.img
sudo curl | sudo dd oflag=append conv=notrunc of=/var/reinstall/initrd.img
sudo kexec -l /var/reinstall/kernel --initrd="/var/reinstall/initrd.img" --append="coreos.inst.install_dev=/dev/mmcblk0 console=ttyS0 coreos.inst.ignition_url= coreos.inst.insecure ip="
sudo systemctl kexec

Some links where basically the same is done:
Reinstall POC, entrypoint
Relevant PR
Manual equivalent

Our image
We are building our image, the “initramfs”, “rootfs” and “kernel” files you see above, with Coreos Assembler. We are pretty much using the same base as official CoreOS is, but have added some extra packages and an “overlay.d” layer.

What is failing
All things worked flawlessly, until recently I added some updates to our overlay.d layer and we tried rebuilding the image. At which point building failed, mentioning some packages could not be found, I assumed some package repo no longer existing? Some googling around stumbled us across this. This led me to “clean things” up and remove the “lock” files, EG: src/config/manifest-lock.x86_64.json, cache folder and overrides. This restored the ability to build an image.

It seems however that this new image is incompatible with the “old” one, old meaning the image built before the “cleanup”.

Previously we could rebuild images, and reinstall them just fine. In the http server hosting all of these files we could see al requests passing by, even the retrieval of the ignition file, config.ign, this last request for the ignition file we no longer see. This leads me to believe that for some reason this new image, whatever part of this it may be, is incompatible with the old causing the “initramfs initialization process” (am I correct that indeed initramfs is the “part” making this request?) to fail doing so. The question is why…

What could help me forward

  • First and foremost, the only thing I have to try and debug this are the request logging in the http server and timing’s of a screen getting power/backlit. I’d very much like some pointers as to what component could be at fault here and how to even start properly debugging these, possibly early, stages.
  • Some clarification on what stages are being executed in my workflow. Where does the ignition request come from? Has the machine rebooted fully into the new kernel at this time? Etc…
  • Some clarification on what these lock files are along with the cache and overrides directories, what have I actually done when “cleaning up”? And how could this “Package not found” error appear suddenly, is this a package repo missing? Are the alternative ways of fixing this besides “cleaning up”?
  • Any and all thoughts on what could be causing this “incompatibility”.

Your help is greatly appreciated!

Ideally you should update the lock files to match the versions of the RPMs that you want to include in the image and not remove them, to keep all the locked packages that we have specified.

Otherwise you’re getting the latest from Fedora, and there might be issues from time to time.

I think what might be happening is that you were using really really old lockfiles (and thus RPMs) because you weren’t updating your config repo at all. When you deleted the lockfiles you probably got all latest content but with super out of date config.

Can you try to rebase your fork on top of latest testing-devel from GitHub - coreos/fedora-coreos-config: Base configuration for Fedora CoreOS ?