Unable to boot after upgrade to fc37

Hello,

I’ve got problems upgrading my fc36 to fc37 I can not solve on my own. I used DNF System Upgrade :: Fedora Docs as a guide but after rebooting the system hangs in the boot screen. By pressing Esc I can see the error messages, but they don’t help me because many services failed. I took screenshots of the remote KVM/IPMI but can only share one of those below because I am new here. After sending a power reset I pressed Esc early and can see that the system is timing out waiting on systemd-udev.service, iscsid.service and dev-zram0.device. Only a message for the timeout of zram is printed after that. I guess because the other two forming a dependency chain and not beeing the root problem. A few seconds later the whole screen is full of errors of failed services.
I already tried the same about a month ago with the same result. Fortunately the system boots via iscsi from a NAS so I can create snapshots and reset the system.
What can I do to dig deeper into the problem? I did not find anything that helped me about problems with zram after upgrading to fc37.

Thanks in advance!

systemctl status systemd-zram-setup@zram0.service 

should give your more info

journalctl -u systemd-zram-setup@zram0.service --since yesterday

will give you log entries related to the service

Thanks for your suggestions! Problem is, I don’t get to a usable command prompt. Not even when I select the rescue mode in grub. Strangely this changes next to nothing. The system hangs exactly the same:

Are you able to boot to a live USB device and troubleshoot from there. Screenshots with the hang like that are helpful but not enough.

Did you upgrade by the recommended ‘dnf’ procedure or did you use the software control center to upgrade?

How long did you wait after the upgrade completed before rebooting?

If one waits long enough for the background tasks to complete after the upgrade there are few problems, but those who reboot too soon may interrupt the upgrade and cause system corruption.

If you can reboot the system then we can troubleshoot the logs and hopefully find the source of the errors.

I strongly recommend that you use the dnf system-upgrade process to do version upgrades. I also recommend that you follow carefully and do not skip steps in that process you linked to.

There seems no reason for zram to be an issue unless your system has a problem with RAM or is extremely limited on the size of RAM. How much RAM is installed?

I will try the live system to read the logs later today.

I upgraded by:

sudo dnf upgrade --refresh
sudo systemctl reboot && logout
sudo dnf system-upgrade download --releasever=37
sudo dnf system-upgrade reboot

The upgrading reboot (and every later reboot) hangs with the screen above.

I guess by waiting before rebooting you meant waiting after sudo dnf system-upgrade reboot not before this command, right? If I did understand you correctly, there was no room for to early rebooting, because of the system hang.

The system is a EPYC with 8x 32GiB → 256GiB RAM.

So I understand the hang is at some point during the reboot and actual install of the upgrades being done.

Is this the reboot after the upgrade completes? Or is this the initial reboot at the beginning of the upgrade? We need to identify at exactly what stage of the upgrade this is happening.

The upgrade with the dnf system-upgrade reboot command displays a graphical progress bar, but pressing esc gives the text messages as the upgrade progresses.

I just did a clean install of F36 in a VM, then did the update to latest with a following reboot so F36 was running at the latest package versions. I followed that with the download step for upgrade to F37.

I then did the reboot step to complete the upgrade so I could follow the steps you showed above, and watched the text messages as the upgrade continued. When the upgrade completed the system once again automatically rebooted into the F37 version with all packages fully upgraded.

Also, we need to know if this is with a remote terminal or if this is on a local terminal on the machine being upgraded?

Boot a Fedora 37 live session, check the filesystem and verify the package database integrity:
https://discussion.fedoraproject.org/t/fedora-35-to-36-and-37-via-dnf-system-upgrade-selinux-breaks-system/67316/5
You can as well proceed with chroot following the same link if DNF still works.

The problem occurs at the initial reboot directly after the downloading of the upgrade.

The processbar did not appear. I only saw the three little rectangles that are shown on every boot. I only pressed Esc after some minutes when I realized that nothing happened anymore.

The first time in january I tried to upgrade from an SSH connection. This time (after reseting the file system to an snapshot) I started the upgrade via remote KVM to TTY1.

I have the feeling that the problem is hard to reproduce because I directly boot from iSCSI cofigured via UEFI. The system is completely diskless and does not use PXE.

This is totally beyond my experience. You are upgrading a diskless workstation that is remote and using iscsi boot. Thus the upgrade and reboot would seem to break your remote connection for the boot and I have no clue how the diskless part would function.

No it isn’t any different to any “normal” fedora installation without xserver used via locally connected keyboard and monitor.

I used the standard installer. It has all the functionality to install a system with iSCSI boot. But you also need a mainboard with this feature beeing implemented in the UEFI. I use a Supermicro H11SSL-i.

I’m not shure I should do this because the upgrade most probably never really started.

All the more reason to force the upgrade.
Be sure to mount the root filesystem and the nested ones to /mnt and its subdirectories respectively according to the linked documentation, and then proceed like this:

sudo dnf distro-sync --installroot=/mnt \
    --releasever=37 --setopt=deltarpm=false

I managed to mount the iscsi target from a live system and chrooted into the system. Unfortunately the logs did not help. Maybe because they are corrupt?

Anything I could look into?

Thanks for telling me a way how I can do that! I will definitely try that later. First I really would like to know what happened with my system and maybe even help a little in fixing a bug. Would you maybe help with this aswell?

Unfortunately, my experience with debugging boot issues is mostly superficial, so I can only suggest to isolate the problem:

  • Check the related file systems and fix errors, if any.
  • Check integrity of the package database and repair missing and corrupted files, if any.
  • Restore SELinux labels and regenerate initramfs.

Follow the official documentation if you want to proceed with dracut debugging.

Thanks for the suggestions! It’s a bit much and I don’t really know where to start so I’m a little lost. I started somewhere but found nothing to suspicios via rpm -qaV so started looking around a little and found something which seems strange to me. I realized that on partition 3 (mounted to /mnt) I have a /boot folder containing older stuff but also a rather old kernel of fc36. I then mounted partition 2 to /mnt2 and found the newer kernels. Why is it that I have a /boot folder in partition3 when I also have a separate boot partition 2? Did something went wrong in a previous upgrade? Did something similar hit me again this time?

Your boot partition must be properly mounted according to /etc/fstab, otherwise installing and upgrading packages such as GRUB and kernels will write files to an empty mount point, which may have happened in the past.

To add more detail to the above.
The root file system must have a mount point where another (partition) file system may be attached. This is /boot.
Then that newly attached file system must have another mount point (/boot/efi) where the next file system may be attached.

Failure to have each and all necessary file systems mounted may result in apps writing to the root file system at the mount point, and then those files written in that manner “disappear” from the system when the file system is actually mounted. It also may result in failure to properly write data if the directory structure needed does not exist when a partition is not mounted.

This means that should a user have even one of the normal file systems not properly mounted and perform an update the system will now be unable to access the updated files at some point and it may cause unanticipated errors due to file system and file mismatches.

The /boot folder on partition 3 is the mount point for the partition 2 file system. When mounted it appears as a single device to the user.

I’m very sorry that my unclear messages lead to you investing your time in explaining things that are perfectly clear to me. I try to choose different words this time.

I am wondering why some partitions where not mounted correctly in the past, such that files were written to /boot on partition 3. I did not configure anything myself on that system. The iSCSI booting was configured perfectly well by the installer and nothing was changed (I am not even 100% sure the problem is related to iSCSI booting). There must be some kind of bug and I wonder where to start digging or whom to ask.