Initially this started as a normal Fedora 34 upgrade for four systems. First issue noticed was the re-occurring hardware watch dog issue which has occurred over many Fedora versions but usually would only affect one system and then when we tried diagnosing it would go away. Yes bug reports were filed. This time all four systems had the issue which can be worked around by editing /etc/systemd/system.conf file. This started the thinking about the systems being upgraded for 10 or more years and maybe the issue would go away with a fresh install. Well on a reboot one system had a corruption which failed to boot.
Configuration is an Asus H971Plus mother board with two 1G hardware raided SSD drives and a MyPassport USB portable external drive for disaster recovery. Both the internal SSD and external MyPassport drive have Fedora 34 installed. The purpose of the portable drive is to store DD images of the running systems for a quick recovery so it has a live Fedora install with a large partition for storing image files for the four systems.
First it appears that Fedora now knows about the hardware raid configuration and has created /dev/md device entries which causes duplicate path issues in the LVM file system which needed a filter added /etc/lvm/lvm.conf to not scan the /dev/sda device to get rid of the multiple paths for the LVM and then needed to do a vgck --updatemetadata fedora_redwood to fix the volume headers as the preferred device is now the /dev/md device.
The real issue now is after at least 10 installs of Fedora 34 over the two weeks is the system looks fine until a reboot and it ends up back to where to what prompted the new install the boot fails and you end up in rescue mode at the grub prompt to enter the root password. Digging around I found that Fedora 34 has made changes to the UEFI boot which appears to be the issue.
If I boot the system without the portable drive attached the boot fails and logged into the rescue system if the efibootmgr command is run there is no EFI data. Suspect that information was put on the portable drive as a efibootmgr command returns EFI data. Yet it must be wrong as the system will still fail to boot.
So the question is how to get out of this pickle? Is it possible to fix the missing EFI data issue on the internal SSD storage without doing another install? If it can be done without another install where do you get the information to fix it? How to prevent this issue in the future?
Another issue to be further dug into is the mother board setup boot order seems to get changed by discovery of boot able devices being plugged in.
I am not sure how to recover the boot info, but I wonder why you are regularly booting with the external drive connected, and especially why you are installing with the external drive connected.
It seems from your description that the external drive is not always connected so it really should not be attached when doing an install. I think that since it has an efi partition the install is picking the external drive as the boot device, which seems to not be what is wanted.
This could also be due to having the SSDs in hardware raid and setting them up as a /dev/md? device which cannot be accessed until the kernel loads. You may be at a dead end unless you have drive space that is not allocated to the raid array on the installed SSDs for the efi partition and the /boot data.
The portable drive is normally plugged in as scheduled backups are done to the drive. It would be new behavior if the Fedora install touched a drive that is not selected for the install. This might be new behavior with the UEFI changes, maybe one of the developers who knows how the install is working can respond.
The EFI data that is returned by the efibootmgr command is not stored on the SSD. It is stored in the system’s flash memory.
You said that these systems are 10+ years old. At that age, I think it is possible that the flash memory has degraded to a point that it is no longer retaining the stored values across reboots. I’ve encountered several decade-old devices (printers, routers, etc.) that will fail in various ways (if they have an LCD, it might show random garbled characters for example). Sometimes these devices can be revived for a few years by re-flashing the flash memory (this would be a “BIOS update” on a computer). Ultimately though, flash memory has a limited lifetime.
An article from CMU in 2015 states “Today’s flash devices, which do not require flash refresh, have a typical retention age of 1 year at room temperature.” And that retention time decreases exponentially with increasing temperature. The phenomenon can be modeled by the Arrhenius equation.
It is difficult to say for certain because the technology has changed a lot over time and environmental factors (e.g. temperature) have a huge impact. But the problem you describe sounds like bad flash memory to me.
I knew the flash memory wasn’t the issue as the thumb drive and the portable passport drive did not have boot issues.
I have a little egg on the face as the saying goes. Remember I said that there were many UEFI devices listed at boot time. I thought I had tried all of the choices but apparently not. I unplugged all the other media and was left with one choice OS UEFI which turned out to be the right device. So the install did not update an old entry.
When I feel I can deal with digging in again I will reopen the hardware watchdog timeout issue. Maybe we will get lucky and run that issue to ground. Don’t know what would be special about these systems but I do not see others running into this issue.