Failing to boot after atomic desktop rebase from 39 to 40

I am having issues rebasing a fedora atomic desktop from 39 to 40. The deployment stages but during reboot I just get a black screen and it reverts to the previous deployment.

I get no errors or any other information, it’s just a black screen for a bit until the previous deployment starts. I tried getting to the grub menu multiple times, but I haven’t been able to while the rebase was staged. Holding shift does nothing, there’s just a black screen eventually followed by the previous deployment booting. I have been able to get to the grub menu reliably when there was no rebase staged.

I’m using intel integrated graphics. I’ve found some posts about some nvidia issues, but as far as I’ve seen those also display some errors instead of just showing an entirely black screen.

My latest attempt was to 40.20240521.0.

I can’t find a boot that seems related to the failed 40 boot using journalctl --list-boots, but I’m getting some unexpected output. My computer was off for about a day. It’s a laptop and was charged, even when it booted, so I don’t think there’s any bios clock power issues or significant drift. It was on for at most a few hours while I rebased to 40, this would have been during the -1 boot. The first and last entry times in journalctl --list-boots look correct for boot 0, but it shows boot -1 as being over a 5 hour span, which it definitely was not. journalctl -b -1 gives logs with timestamps that span the range for the -1 boot in --list-boots, but journalctl -b -0 gives logs with timestamp that span from about an hour after it says the first entry in boot -1 was at to the current time, about 4 hours later.

To summarize the journal stuff, --list-boots shows boot -1 as being around 5 hours long, starting significantly earlier than it actually did. It seems to show boot -0 correctly. -b shows boot -1 as it appears in --list-boots, but shows boot -0 as starting about an hour after boot -1 started, and continuing to now. In reality, boot -1 lasted around an hour, maybe less, and boot -0 has only lasted 10-20 minutes.

Does anyone have any idea what could be wrong? Is there anything that I can do to get more info?

Also, is there any way to prevent a staged deployment from being removed if it fails to boot? I tried to pin it but you can’t pin staged deployments. It’s wasted a lot of time having to redo the entire rebase every time I tried to get into grub.

1 Like

Removed nvidia

Hello @talan0 m
Welcome to :fedora: !

I would start by looking at available space in /boot, it could be simply not enough available space to write and it silently fails. usually though, in that case you do get an error at the end of not completing without a reason other than “could not write”. Some time ago, my ESP on my Lenovo E530 was full of stuff unrelated to boot (Lenovo BIOS Diagnostic info), and that prevented the bootloader updating due to lack of space, even though it appeared I had ample. In that case I had to delete the diagnostic file created by the Lenovo diagnostic tool.

Thank you for the suggestion. Unfortunately it looks like that’s not the issue.

Before rebasing, df -h shows only about 30% usage on /boot and 2% on /boot/efi. The amount of used space doesn’t change after the new deployment is staged, or at least it’s not significant enough to be seen with the -h flag which is giving me sizes in megabytes, so maybe I’m misunderstanding something or it doesn’t write here until during the boot.

Just to try to rule out other storage space issues out, the only filesystem shown by df with more percent usage than /boot is /sys/firmware/efi/efivars at around 40%

Adding info to the journal issues:
The current -1 boot, which was the 0 boot in my original post, still shows what look to be correct times for the first and last entry, but the new 0 boot shows a first entry that’s about half an hour after the last entry for the current -1 boot and a correct last entry time. In reality, this boot was many hours after the current -1 boot

I’d be wary about assigning too much to an RTC issue if that’s what you’re thinking. But it is interesting that the timestamps seem so out of sync with what you expect. Can you set the default deployment using rpm-ostree or ostree to the desired upgrade commit? Like does the upgrade finish?(If it didn’t finish) Before doing the upgrade did you first update the current F39 Atomic (which one?).

I found that ostree admin set-default allows you to set the default deployment index, but I can’t find a way to use either rpm-ostree or ostree to specify a deployment commit. Could you explain how I can do this?

It does appear to finish staging with no issues. After running the rebase I get Changes queued for next boot. Run "systemctl reboot" to start a reboot and I’m able to verify that the 40 deployment is staged with both rpm-ostree status and ostree admin status.

Yes, I’m currently on Kinoite 39.20240520.0, which appears to be the latest one.

So, it fails during boot then falls back to previous is what you’re thinking? I wonder if it actually boots the commit, or even attempts to. You can use rpm-ostree status to list your deployments. Once you know which one you want, you can actually deploy it right there to make sure it functions as per expected using rpm-ostree deploy <commit>. If you have no success you can also use ostree as sudo to do the same thing using sudo ostree admin status to see your commits it’s aware of, then sudo ostree checkout <commit> to immediately change into it as your running system.

1 Like

Hello EveryOne!

In my old T530 a UEFI secure boot (Fedora 40 SilverBlue) doesn’t load GRUB. Changing to UEFI first, legacy second result in loading GRUB. I know that isn’t solution… BTW. What’s the default pass to T530’s BIOS?

Regards

Yes, that’s my best guess right now. With it staged, the screen goes black for about 10 seconds between the bios splash and LUKS, and it looks like it flickers off then back on right before LUKS comes up.

Using rpm-ostree deploy <commit> I get a similar output to when I do the rebase, the only difference I see is a slightly different wording in the reboot instruction. Using sudo ostree checkout <commit> I get no output, but also no errors. After each rpm-ostree status and ostree admin status both still show 40 as staged the same as they do after a rebase and the issue persists

Hello @talan0 ,
What was the rebase command used? Also, I’m not sure if it’s possible, but there is rpm-ostree apply-live which is supposed to apply the pending deployment changes to the running commit. I’m just not certain what it will do in this case for you (ie would it actually update the kernel or drivers?). You could redo the rebase as well.

rpm-ostree rebase fedora:fedora/40/x86_64/kinoite

For the earlier rpm-ostree deploy, I ran that rebase command and then ran rpm-ostree deploy <commit> with <commit> replaced with the BaseCommit provided by rpm-ostree status for the staged 40 deployment.

I haven’t tried apply-live yet, I may be wrong or missing something but my understanding of it is that it doesn’t actually make the staged deployment live, it just writes a duplicate over the running system. As long as I’m not overlooking anything, this wouldn’t affect the boot of the staged deployment.

I might be misunderstanding what you mean here, but I have been having to run rpm-ostree rebase fedora:fedora/40/x86_64/kinoite each time I make an attempt. Once it reboots into 39, the staged deployment is no longer in the list of deployments, only the deployments that existed prior to staging 40.

So if you just do an update to your F39 install does that work? Why I asked to re-rebase was to make sure there was no chance it was just a bad upgrade. Before making another rebase attempt make sure to cancel any pending rpm-ostree operations, and flush out it’s cached metadata to get rid of any of the bits from failed rebase attempts. rpm-ostree cancel rpm-ostree refresh-md -m. Also if you are using an nvidia card you may have to remove the overlaid nvidia drivers (if using them) then re-install them during the transaction.

Yes, I don’t think I’ve ever had an issue with normal updates. I’ve just updated to 39.20240522.0 using rpm-ostree update and didn’t run into any issues.

Something I noticed during this update is that the black screen behavior I described earlier between the bios splash and LUKS happened after that reboot too. After testing the rebase again using the steps below, I tried to restart to see if the black screen behavior happens without any staged deployment, and it does. It seems this is unrelated to the problem and was just something I noticed while looking for issues that I never noticed when I wasn’t focused on finding something.

Given that I was wrong about the startup behavior being different and it’s actually seem to be almost identical to any other boot, I’m wondering if there’s anything I can do to verify that it is actually attempting to boot into the staged 40 deployment. Normally, I can get into the grub menu reliably and see the list of available deployments, but I have never been able to get into it during a restart with 40 staged and I have no clue why that would be the case.

Here is the procedure I followed today:

  1. rpm-ostree update, resulting in a staged update to 39.20240522.0
  2. systemctl reboot to deploy the update
  3. rpm-ostree status to verify that the update was applied
  4. rpm-ostree update to verify that there are no more updates
  5. rpm-ostree cancel which outputted No active transactions.
  6. rpm-ostree refresh-md which completed without errors
  7. rpm-ostree rebase fedora:fedora/40/x86_64/kinoite to stage the update to 40.20240522.0
  8. rpm-ostree status to verify that the update is staged
  9. systemctl reboot

Using rpm-ostree status gives the same output as in step 3, the update to 39.20240522.0 is deployed and it is as if I never attempted the rebase.

I don’t have a nvidia card or any layered nvidia drivers.

Can you look for the logs of the ostree-finalize-staged.service unit to see if there is an error there?

$ sudo journalctl -u ostree-finalize-staged.service
2 Likes

Thank you for the suggestion, there is an error. It appears to be failing to finalize the deployment due to an SELinux policy issue:

systemd[1]: Finished ostree-finalize-staged.service - OSTree Finalize Staged Deployment.
systemd[1]: Stopping ostree-finalize-staged.service - OSTree Finalize Staged Deployment...
ostree[6897]: Finalizing staged deployment
ostree[6897]: Copying /etc changes: 1362 modified, 1 removed, 125 added
ostree[6897]: Copying /etc changes: 1362 modified, 1 removed, 125 added
ostree[6897]: Refreshing SELinux policy
ostree[6909]: Re-declaration of type virt_bridgehelper_t
ostree[6909]: Previous declaration of type at /etc/selinux/targeted/tmp/modules/100/virt_supplementary/cil:5
ostree[6909]: Bad type declaration at /etc/selinux/targeted/tmp/modules/100/virt_supplementary/cil:5
ostree[6909]: Failed to build AST
ostree[6909]: semodule:  Failed!
ostree[6897]: Refreshed SELinux policy in 8451 ms
ostree[6897]: error: Finalizing deployment: Finalizing SELinux policy: Child process exited with code 1
systemd[1]: ostree-finalize-staged.service: Control process exited, code=exited, status=1/FAILURE
systemd[1]: ostree-finalize-staged.service: Failed with result 'exit-code'.
systemd[1]: Stopped ostree-finalize-staged.service - OSTree Finalize Staged Deployment.

Here are the changes that I can think of that I’ve made related to SELinux on my install:

To try to rule out any of the packages causing issues, I tried to do a rebase using the --uninstall= flag to stage a deployment of 40 without any additional packages layered. I then used rpm-ostree status to verify that the staged deployment did not list any LayeredPackages or LocalPackages. After rebooting, I still experienced the issue. I checked the ostree-finalize-staged.service logs and it seems the same error still occurred.

Can you try resetting your SELinux policy to the default one using Troubleshooting :: Fedora Docs and then re-apply your customizations?

Yes, after following the following steps:

  1. rpm-ostree update to ensure that 39 is up to date
  2. sudo rsync -rlv /usr/etc/selinux/ /etc/selinux/
  3. sudo ostree admin config-diff | grep policy to verify there are no differences in the policy
  4. rpm-ostree rebase fedora:fedora/40/x86_64/kinoite to stage 40
  5. sudo ostree admin config-diff | grep policy just to make sure nothing strange happened
  6. systemctl reboot

The issue persists with the same error in finalizing the deployment.

In all runs, including after reboot, sudo ostree admin config-diff | grep policy outputs:

M    selinux/targeted/active/policy.kern
M    selinux/targeted/active/policy.linked
M    selinux/targeted/active/modules/100/policykit/lang_ext
M    selinux/targeted/active/modules/100/policykit/hll
M    selinux/targeted/active/modules/100/policykit/cil

I assume this wouldn’t cause issues because these are changes to the active policy, not the persistent one, but I wanted to try the rebase without any changes at all in case it was somehow the issue. Unfortunately, I’m having trouble doing that. I thought that the changes to the active policy were due to the service I use to setsebool container_manage_cgroup true, but disabling it with systemctl disable and rebooting doesn’t change anything in the config-diff output. I verified that the service was disabled with systemctl status.

sudo rsync -rlv /usr/etc/selinux/ /etc/selinux/, sudo rsync -rclv /usr/etc/selinux/ /etc/selinux/, and sudo semodule -R all had no effect. As a sanity check, I used sudo cmp /etc/selinux/targeted/active/policy.kern /usr/etc/selinux/targeted/active/policy.kern to compare one of the files that say they were modified. This returned no output, indicating the files are identical, so I think ostree admin config-diff is not actually comparing the files, but looking at something like size and timestamp. If I’m correct about this, I’m unsure why the documentation at Troubleshooting :: Fedora Docs says that sudo ostree admin config-diff | grep policy should not return any changes.

Performing the steps above again, after disabling the service that runs setsebool container_manage_cgroup true and rebooting, still failed to deploy 40.

I managed to get it to work. Thank you everyone for your help.

I realized that I missed the last step in the troubleshooting page. After performing sudo rm -rf /etc/selinux && sudo cp -aT /usr/etc/selinux /etc/selinux I was able to rebase to 40. Unfortunately, I did the rebase in combination with uninstalling all layered packages, so I’m not entirely sure if the difference in selinux/targeted/active/ was solely responsible. Also, despite the setsebool container_manage_cgroup true service being active, ostree admin config-diff currently shows no change in selinux/targeted/active/, so I think I misunderstood what that directory is used for, which means I also have no idea why it had changes.

I guess the main question left is whether this is some sort of bug or poor handling on the part of ostree, or if I caused this by doing something I shouldn’t be doing, or in a way that I shouldn’t be doing it?

1 Like