i have a couple of full size racks containing old-ish hp proliant (gen 7, 8, 9, 10) servers. most have capacity for 8 or so physical drives. two have capacity for 12 (3.5") 3tb drives and one has capacity for 25 (2.5") 900gb drives. i also have a pair of external hp rack mount drive enclosures that hold 12 (3.5") 3tb sas disks each. all servers and drive enclosures have a spinning sas disk in every available drive slot and use the default hardware raid configuration that the bios raid setup recommends when you do a factory reset and start on a clean server, usually raid 50.
i always install the latest fedora server release on each system i rebuild. when i do, i tell the installer to use all of the available storage capacity with automatic partitioning. throughout the last several releases at least, this results in an xfs root partition of about 15gb on the lvm. on first boot i usually extend the root partition to use all of the remaining space, like so (sometimes /dev/mapper/fedora_${HOSTNAME}-root
. i haven’t worked out what causes anaconda to alternate between including or not the hostname in that path):
sudo lvextend /dev/mapper/fedora-root -l +100%FREE -r
on most of these servers, this results in a robust filesystem that i can hot-swap drives out of whenever they light up amber and everything just works peachy, always and forever. happy days.
on my three larger storage capacity servers or any server that i attach the external drive enclosures to, the server will live for a few days or maybe weeks before dying and requiring a full rebuild in order to repeat the cycle. i define deadness by observing a post-boot message which reads:
Generating "/run/initramfs/rdosreport.txt"
Entering emergency mode. Exit the shell to continue.
Type "journalctl" to view system logs.
You might want to save "/run/initramfs/rdosreport.txt" to a USB stick or /boot
after mounting them and attach it to a bug report.
Press Enter for maintenance
(or press Control-D to continue): _
this message will always greet me on large storage servers after power is restored following a power failure, sometimes after an ordinary reboot following a dnf system update, and most recently on all of my large storage servers following the first reboot in the fedora server 40 to 41 upgrade cycle.
bottom line is that these servers will never survive more than a few reboots before i see the dreaded Control-D prompt.
the journal has orange and red lines that contain these clues:
systemd[1]: sysroot.mount: Mounting timed out. Terminating.
systemd[1]: sysroot.mount: Mount process timed out. Killing.
systemd[1]: sysroot.mount: Killing process 881 (mount) with signal SIGKILL.
kernel: XFS (dm-0): Ending recovery (logdev: internal)
kernel: XFS (dm-0): Unmounting Filesystem d455dec8-2e63-4467-9a59-fd9fe75fab6d
systemd[1]: sysroot.mount: Mount process exited, code=killed, status=15/TERM
systemd[1]: sysroot.mount: Failed with result 'timeout'.
systemd[1]: Failed to mount sysroot.mount - /sysroot.
systemd[1]: Dependency failed for initrd-root-fs.target - Initrd Root File System.
systemd[1]: Dependency failed for initrd-parse-etc.target - Mountpoints Configured in the Real Root.
- the author of xfs gave some repair guidance back in 2007.
- someone on the internet fixes his in 2016 with (whilst booted up with a rescue disc):
lvscan xfs_repair /dev/sdx1 vgscan -v --mknodes vgchange -a y lvscan xfs_repair /dev/centos/root
i tried both of these possible solutions several times on different systems (after altering paths appropriately for my systems). both always reported triumphant success and both always failed miserably by returning me to the Control-D screen after the next hopeful reboot. even if they had succeeded, i have zero faith that a subsequent reboot or power cycle would not land me back at square one.
so, is there something i can change in my routine that will get these large storage servers working reliably? is this problem specifically attributable to proliants, xfs, lvm, large drives or raid? is there a different filesystem i can choose at install time with manual partitioning that will result in machines that can be rebooted without triggering a multi-hour rebuild cycle? how can i get these servers working for me instead of me for them?