Major failure with recent update

I have been using raid 5 with mdadm on a server for many years.

Somehow with a recent update the system now refuses to boot and it seems the failure is caused by not enabling the raid array.

I can boot with the f41 live media, either the original with the 6.11.4 kernel or the respin for today (both workstation) and it sees the raid array and assembles it for access.

I have 7 different kernels installed and none of the installed kernels will activate the raid array during boot. I have no idea what happened nor where to look for the errors. The /run/initramfs/rdsosreport.txt file does not give many hints either.

Perusing that file I see this first so it is apparent that the raid devices are recognized.

/dev/sdd: UUID="c66f241a-545c-e2fa-d200-68419423bfe0" UUID_SUB="e55d4d3b-088c-9983-68d6-14d55c55c089" LABEL="eagle.home.domain:fedora_raid" TYPE="linux_raid_member"
/dev/sdb: UUID="c66f241a-545c-e2fa-d200-68419423bfe0" UUID_SUB="f8808c9e-e414-d28d-6665-af003b9cb8cd" LABEL="eagle.home.domain:fedora_raid" TYPE="linux_raid_member"
/dev/sdc: UUID="c66f241a-545c-e2fa-d200-68419423bfe0" UUID_SUB="070cc792-735b-441d-6df1-b863cc8023ff" LABEL="eagle.home.domain:fedora_raid" TYPE="linux_raid_member"
/dev/sda: UUID="c66f241a-545c-e2fa-d200-68419423bfe0" UUID_SUB="be5831bc-f771-e94b-bce4-1ddcde32ac4b" LABEL="eagle.home.domain:fedora_raid" TYPE="linux_raid_member"

The following stanza are repeated for each of those 4 drives.

P: /devices/pci0000:00/0000:00:01.2/0000:02:00.1/ata1/host0/target0:0:0/0:0:0:0/block/sda
M: sda
U: block
T: disk
D: b 8:0
N: sda
L: 0
S: disk/by-id/ata-WDC_WD30EZRZ-00GXCB0_WD-WCC7K6EU0Y70
S: disk/by-diskseq/1
S: disk/by-path/pci-0000:02:00.1-ata-1.0
S: disk/by-id/wwn-0x50014ee20e58c2bc
S: disk/by-path/pci-0000:02:00.1-ata-1
Q: 1
E: DEVPATH=/devices/pci0000:00/0000:00:01.2/0000:02:00.1/ata1/host0/target0:0:0/0:0:0:0/block/sda
E: SUBSYSTEM=block
E: DEVNAME=/dev/sda
E: DEVTYPE=disk
E: DISKSEQ=1
E: MAJOR=8
E: MINOR=0
E: USEC_INITIALIZED=3728515
E: ID_ATA=1
E: ID_TYPE=disk
E: ID_BUS=ata
E: ID_MODEL=WDC_WD30EZRZ-00GXCB0
E: ID_MODEL_ENC=WDC\x20WD30EZRZ-00GXCB0\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20
E: ID_REVISION=80.00A80
E: ID_SERIAL=WDC_WD30EZRZ-00GXCB0_WD-WCC7K6EU0Y70
E: ID_SERIAL_SHORT=WD-WCC7K6EU0Y70
E: ID_ATA_WRITE_CACHE=1
E: ID_ATA_WRITE_CACHE_ENABLED=1
E: ID_ATA_FEATURE_SET_HPA=1
E: ID_ATA_FEATURE_SET_HPA_ENABLED=1
E: ID_ATA_FEATURE_SET_PM=1
E: ID_ATA_FEATURE_SET_PM_ENABLED=1
E: ID_ATA_FEATURE_SET_SECURITY=1
E: ID_ATA_FEATURE_SET_SECURITY_ENABLED=0
E: ID_ATA_FEATURE_SET_SECURITY_ERASE_UNIT_MIN=65906
E: ID_ATA_FEATURE_SET_SECURITY_ENHANCED_ERASE_UNIT_MIN=65906
E: ID_ATA_FEATURE_SET_SECURITY_FROZEN=1
E: ID_ATA_FEATURE_SET_SMART=1
E: ID_ATA_FEATURE_SET_SMART_ENABLED=1
E: ID_ATA_FEATURE_SET_PUIS=1
E: ID_ATA_FEATURE_SET_PUIS_ENABLED=0
E: ID_ATA_DOWNLOAD_MICROCODE=1
E: ID_ATA_SATA=1
E: ID_ATA_SATA_SIGNAL_RATE_GEN2=1
E: ID_ATA_SATA_SIGNAL_RATE_GEN1=1
E: ID_ATA_ROTATION_RATE_RPM=5400
E: ID_WWN=0x50014ee20e58c2bc
E: ID_WWN_WITH_EXTENSION=0x50014ee20e58c2bc
E: ID_ATA_PERIPHERAL_DEVICE_TYPE=0
E: ID_PATH=pci-0000:02:00.1-ata-1.0
E: ID_PATH_TAG=pci-0000_02_00_1-ata-1_0
E: ID_PATH_ATA_COMPAT=pci-0000:02:00.1-ata-1
E: ID_FS_UUID=c66f241a-545c-e2fa-d200-68419423bfe0
E: ID_FS_UUID_ENC=c66f241a-545c-e2fa-d200-68419423bfe0
E: ID_FS_UUID_SUB=be5831bc-f771-e94b-bce4-1ddcde32ac4b
E: ID_FS_UUID_SUB_ENC=be5831bc-f771-e94b-bce4-1ddcde32ac4b
E: ID_FS_LABEL=eagle.home.domain:fedora_raid
E: ID_FS_LABEL_ENC=eagle.home.domain:fedora_raid
E: ID_FS_VERSION=1.2
E: ID_FS_TYPE=linux_raid_member
E: ID_FS_USAGE=raid
E: DEVLINKS=/dev/disk/by-id/ata-WDC_WD30EZRZ-00GXCB0_WD-WCC7K6EU0Y70 /dev/disk/by-diskseq/1 /dev/disk/by-path/pci-0000:02:00.1-ata-1.0 /dev/disk/by-id/wwn-0x50014ee20e58c2bc /dev/disk/by-path/pci-0000:02:00.1-ata-1
E: TAGS=:systemd:
E: CURRENT_TAGS=:systemd:
+ ls -l /dev/disk/by-diskseq /dev/disk/by-id /dev/disk/by-label /dev/disk/by-partlabel /dev/disk/by-partuuid /dev/disk/by-path /dev/disk/by-uuid
/dev/disk/by-diskseq:
total 0
lrwxrwxrwx 1 root root  9 Feb 14 19:17 1 -> ../../sda
lrwxrwxrwx 1 root root  9 Feb 14 19:17 2 -> ../../sdb
lrwxrwxrwx 1 root root  9 Feb 14 19:17 3 -> ../../sdc
lrwxrwxrwx 1 root root  9 Feb 14 19:17 5 -> ../../sr0
lrwxrwxrwx 1 root root  9 Feb 14 19:17 6 -> ../../sdd
lrwxrwxrwx 1 root root 13 Feb 14 19:17 7 -> ../../nvme0n1
lrwxrwxrwx 1 root root 15 Feb 14 19:17 7-part1 -> ../../nvme0n1p1
lrwxrwxrwx 1 root root 15 Feb 14 19:17 7-part2 -> ../../nvme0n1p2
lrwxrwxrwx 1 root root 15 Feb 14 19:17 7-part3 -> ../../nvme0n1p3
lrwxrwxrwx 1 root root  9 Feb 14 19:17 9 -> ../../sde
lrwxrwxrwx 1 root root 10 Feb 14 19:17 9-part1 -> ../../sde1
lrwxrwxrwx 1 root root 10 Feb 14 19:17 9-part2 -> ../../sde2
lrwxrwxrwx 1 root root 10 Feb 14 19:17 9-part3 -> ../../sde3

/dev/disk/by-id:
total 0
lrwxrwxrwx 1 root root  9 Feb 14 19:17 ata-HL-DT-ST_DVD-RAM_GHC0N_K9SG1AI1403 -> ../../sr0
lrwxrwxrwx 1 root root  9 Feb 14 19:17 ata-WDC_WD30EZRZ-00GXCB0_WD-WCC7K1NJRX1H -> ../../sdb
lrwxrwxrwx 1 root root  9 Feb 14 19:17 ata-WDC_WD30EZRZ-00GXCB0_WD-WCC7K5PF586Y -> ../../sdd
lrwxrwxrwx 1 root root  9 Feb 14 19:17 ata-WDC_WD30EZRZ-00GXCB0_WD-WCC7K6EU0Y70 -> ../../sda
lrwxrwxrwx 1 root root  9 Feb 14 19:17 ata-WDC_WD30EZRZ-00Z5HB0_WD-WCC4N0EFT5A4 -> ../../sdc

/dev/disk/by-path:
total 0
lrwxrwxrwx 1 root root   9 Feb 14 19:17 pci-0000:02:00.1-ata-1 -> ../../sda
lrwxrwxrwx 1 root root   9 Feb 14 19:17 pci-0000:02:00.1-ata-1.0 -> ../../sda
lrwxrwxrwx 1 root root   9 Feb 14 19:17 pci-0000:02:00.1-ata-2 -> ../../sdb
lrwxrwxrwx 1 root root   9 Feb 14 19:17 pci-0000:02:00.1-ata-2.0 -> ../../sdb
lrwxrwxrwx 1 root root   9 Feb 14 19:17 pci-0000:02:00.1-ata-3 -> ../../sdc
lrwxrwxrwx 1 root root   9 Feb 14 19:17 pci-0000:02:00.1-ata-3.0 -> ../../sdc
lrwxrwxrwx 1 root root   9 Feb 14 19:17 pci-0000:02:00.1-ata-6 -> ../../sdd
lrwxrwxrwx 1 root root   9 Feb 14 19:17 pci-0000:02:00.1-ata-6.0 -> ../../sdd

and the same continues for each of those listings from ls

It is quite apparent that all the 4 raid devices are seen and recognized as raid members.

Yet the listings fail to reveal that the raid array is assembled and activated. It should appear as /dev/md127 but does not.

As a result the lvm on that array is not found and cannot be used.
At about 3.5 sec into the boot each of those 4 devices gets this type setup

[    3.474117] raptor.home.domain kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    3.474121] raptor.home.domain kernel: ata1.00: ATA-10: WDC WD30EZRZ-00GXCB0, 80.00A80, max UDMA/133
[    3.474125] raptor.home.domain kernel: ata1.00: 5860533168 sectors, multi 16: LBA48 NCQ (depth 32), AA
[    3.474129] raptor.home.domain kernel: ata1.00: configured for UDMA/133
[    3.474236] raptor.home.domain kernel: scsi 0:0:0:0: Direct-Access     ATA      WDC WD30EZRZ-00G 0A80 PQ: 0 ANSI: 5
[    3.474332] raptor.home.domain kernel: sd 0:0:0:0: Attached scsi generic sg0 type 0
[    3.474427] raptor.home.domain kernel: sd 0:0:0:0: [sda] 5860533168 512-byte logical blocks: (3.00 TB/2.73 TiB)
[    3.474521] raptor.home.domain kernel: sd 0:0:0:0: [sda] 4096-byte physical blocks
[    3.474615] raptor.home.domain kernel: sd 0:0:0:0: [sda] Write Protect is off
[    3.474709] raptor.home.domain kernel: sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[    3.474807] raptor.home.domain kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    3.474901] raptor.home.domain kernel: sd 0:0:0:0: [sda] Preferred minimum I/O size 4096 bytes
[    3.474996] raptor.home.domain kernel: sd 0:0:0:0: [sda] Attached SCSI disk

Finally the errors start like this

[    4.452427] raptor.home.domain systemd[1]: Mounted sys-kernel-config.mount - Kernel Configuration File System.
[    4.452726] raptor.home.domain systemd[1]: Reached target sysinit.target - System Initialization.
[    4.452773] raptor.home.domain systemd[1]: Reached target basic.target - Basic System.
[    4.475546] raptor.home.domain dracut-initqueue[703]:   fedora_raptor/root linear
[    4.492842] raptor.home.domain systemd[1]: Found device dev-mapper-fedora_raptor\x2droot.device - /dev/mapper/fedora_raptor-root.
[    4.492881] raptor.home.domain systemd[1]: Reached target initrd-root-device.target - Initrd Root Device.
...
[  130.432583] raptor.home.domain dracut-initqueue[578]: Warning: dracut-initqueue: timeout, still waiting for following initqueue hooks:
[  130.433561] raptor.home.domain dracut-initqueue[578]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2ffedora_raid\x2fhome.sh: "[ -e "/dev/fedora_raid/home" ]"

which continues for an additional 60 seconds

  191.787267] raptor.home.domain dracut-initqueue[578]: Warning: dracut-initqueue: starting timeout scripts
[  192.301399] raptor.home.domain dracut-initqueue[578]: Warning: dracut-initqueue: timeout, still waiting for following initqueue hooks:
[  192.302408] raptor.home.domain dracut-initqueue[578]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2ffedora_raid\x2fhome.sh: "[ -e "/dev/fedora_raid/home" ]"
[  192.303565] raptor.home.domain dracut-initqueue[578]: Warning: dracut-initqueue: starting timeout scripts
[  192.303644] raptor.home.domain dracut-initqueue[578]: Warning: Could not boot.
[  192.312126] raptor.home.domain systemd[1]: Starting dracut-emergency.service - Dracut Emergency Shell...
[  192.338357] raptor.home.domain systemd[1]: Received SIGRTMIN+21 from PID 586 (plymouthd).
[  192.348366] raptor.home.domain systemd[1]: Received SIGRTMIN+21 from PID 586 (plymouthd).

and dumps me to an emergency shell.

It seems quite obvious that something on the main system has gotten damaged and will not properly activate the raid array and its contained LV, but booting from a live media does activate the array.

I don’t want to do a complete new installation since I have many apps running on this server that would be a pain to reinstall but am lost as to what actions may solve the problem.

I have used the latest respin dated 20250214 and tried recreating the initramfs in a chroot to see if that would work, but no progress. Dracut did create the new initramfs for kernel 6.12.11 and lsinitrd on that image shows that the raid kernel modules are included yet the array is not activated during boot and the messages shown above are the result with every attempt, regardless of the kernel used for boot (in the installed system).

Are there any suggestions on what to try next before I use the nuclear option and reinstall fresh.?

Are you able to fsck it from the live image? I’m wondering if there’s some recent update to mdraid itself that might have introduced this situation.

Yes, and I did that right after the first time it failed to boot.
The file system is ext4 and there was an error with inode count that was fixed.

If it were mdadm then theoretically it should also have been seen on the latest respin image, but that works

I’m assuming that you didn’t run anything akin to dracut --regenerate-all, so if those older kernels+initramfs no longer work, I think something must have changed on the disk or maybe in the firmware. You might double-check the timestamps on your initramfs images just to be sure they are unchanged.

Could your kernel parameters have changed across all those kernels somehow? Again, you might check the timestamps on the drop-in *.conf files to be sure they are unmodified.

1 Like

Looking at the output above, it looks like the array might have been created on a machine called eagle.home.domain but is being used on a machine called raptor.home.domain - has the machine hostname been changed recently?

I think mdadm only assembles arrays that belong to the local system by default. If that is the cause of the problem then you might be able to fix the issue by editing mdadm.conf to specifically list the array details.

That is correct, but has been running this way for at least 2 years with no problems.
At one point I tried to change the metadata showing the host but decided that it was not necessary since everything worked as it was.

This just suddenly happened. I do not have and never have had an mdadm.conf file.
This is the content of /proc/mdstat when booted from the live media and chrooted into the installed system.

Personalities : [raid6] [raid5] [raid4] 
md127 : active raid5 sdc[4] sdd[5] sda[1] sdb[3]
      8649600000 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 0/22 pages [0KB], 65536KB chunk

unused devices: <none>

It certainly won’t hurt to try creating an mdadm.conf file and see what happens that way.
The strange part is that it was working but now does not.
Also that the live media activates the array but the installed system does not.

The initramfs files appear as expected for time stamps.
I had made a change to one kernel parameter for the command line that had changed the .conf files, but that has also been removed.

I am now getting a changed result.
There are lots of messages from systemd
systemd[1]: unable to fix SELinux security context of /dev/usbmon3: Permission denied and many others all of which seem to apply to the contents of /dev (and maybe more but scrolling off screen before they can be noted.)

It ends with welcome to fedora 41 then the final line now is
[!!!!!!] Failed to allocate manager object.

secureboot is disabled.

I always use any for the hostname in /etc/mdadm.conf to avoid having to worry about the hostname changing (any is a reserved word and matches any host). For example:

MAILADDR root
AUTO +all
DEVICE partitions

ARRAY /dev/md/boot metadata=1.0 UUID=7edf1fbc:50102b02:0713b773:ec0095d0 name=any:boot

For the SELinux errors, I would try setting enforcing=0 on the kernel command line and then maybe downgrading selinux-policy*.

Edit: Don’t forget that changes to /etc/mdadm.conf will require a rebuild of your initramfs to get included. Another gotcha is that setting hostonly=no in a conf file under /etc/dracut.conf.d will prevent /etc/mdadm.conf from being included in your initramfs.

1 Like

That is not a stable name as I discovered when upgrading to f41.

I think you need mdadm.conf to force reliable md raid.

You can use this command to create the /etc/mdadm.conf contents.

mdadm --examine --scan

The update your /etc/fstab to use the name used in the mdadm.conf.

And last rebuild your initramfs with dracut.

That is what fixed md raid for me.

I wonder if allowing the array to be automatically defined during boot, combined with the use of LVM and a single LV on that array might have become the issue.

My array is automatically named the same as the VG it contains.

Unfortunately I will have to revisit this tonight.

I generally use UUID (via blkid) in fstab instead of /dev/XXX for this reason.

In my specific config it was /etc/crypttab that I had to fix the /dev/md/name in. I also use UUID in the /etc/fstab.

1 Like

Interesting, my /etc/crypttab is also by UUID.

I have a raid that is encrypted with lvm inside it.
I do not think there is a UUID for the raid I can use in /etc/crypttab.

what happens when you type “mdadm --assemble --scan --verbose” ?