Is xfs on lvm on raid for large filesystems just totally and irreparably broken?

i have a couple of full size racks containing old-ish hp proliant (gen 7, 8, 9, 10) servers. most have capacity for 8 or so physical drives. two have capacity for 12 (3.5") 3tb drives and one has capacity for 25 (2.5") 900gb drives. i also have a pair of external hp rack mount drive enclosures that hold 12 (3.5") 3tb sas disks each. all servers and drive enclosures have a spinning sas disk in every available drive slot and use the default hardware raid configuration that the bios raid setup recommends when you do a factory reset and start on a clean server, usually raid 50.

i always install the latest fedora server release on each system i rebuild. when i do, i tell the installer to use all of the available storage capacity with automatic partitioning. throughout the last several releases at least, this results in an xfs root partition of about 15gb on the lvm. on first boot i usually extend the root partition to use all of the remaining space, like so (sometimes /dev/mapper/fedora_${HOSTNAME}-root. i haven’t worked out what causes anaconda to alternate between including or not the hostname in that path):

sudo lvextend /dev/mapper/fedora-root -l +100%FREE -r

on most of these servers, this results in a robust filesystem that i can hot-swap drives out of whenever they light up amber and everything just works peachy, always and forever. happy days.

on my three larger storage capacity servers or any server that i attach the external drive enclosures to, the server will live for a few days or maybe weeks before dying and requiring a full rebuild in order to repeat the cycle. i define deadness by observing a post-boot message which reads:

Generating "/run/initramfs/rdosreport.txt"

Entering emergency mode. Exit the shell to continue.
Type "journalctl" to view system logs.
You might want to save "/run/initramfs/rdosreport.txt" to a USB stick or /boot
after mounting them and attach it to a bug report.

Press Enter for maintenance
(or press Control-D to continue): _

this message will always greet me on large storage servers after power is restored following a power failure, sometimes after an ordinary reboot following a dnf system update, and most recently on all of my large storage servers following the first reboot in the fedora server 40 to 41 upgrade cycle.

bottom line is that these servers will never survive more than a few reboots before i see the dreaded Control-D prompt.

the journal has orange and red lines that contain these clues:

systemd[1]: sysroot.mount: Mounting timed out. Terminating.
systemd[1]: sysroot.mount: Mount process timed out. Killing.
systemd[1]: sysroot.mount: Killing process 881 (mount) with signal SIGKILL.
kernel: XFS (dm-0): Ending recovery (logdev: internal)
kernel: XFS (dm-0): Unmounting Filesystem d455dec8-2e63-4467-9a59-fd9fe75fab6d
systemd[1]: sysroot.mount: Mount process exited, code=killed, status=15/TERM
systemd[1]: sysroot.mount: Failed with result 'timeout'.
systemd[1]: Failed to mount sysroot.mount - /sysroot.
systemd[1]: Dependency failed for initrd-root-fs.target - Initrd Root File System.
systemd[1]: Dependency failed for initrd-parse-etc.target - Mountpoints Configured in the Real Root.
  • the author of xfs gave some repair guidance back in 2007.
  • someone on the internet fixes his in 2016 with (whilst booted up with a rescue disc):
    lvscan
    xfs_repair /dev/sdx1
    vgscan -v --mknodes
    vgchange -a y
    lvscan
    xfs_repair /dev/centos/root
    

i tried both of these possible solutions several times on different systems (after altering paths appropriately for my systems). both always reported triumphant success and both always failed miserably by returning me to the Control-D screen after the next hopeful reboot. even if they had succeeded, i have zero faith that a subsequent reboot or power cycle would not land me back at square one.

so, is there something i can change in my routine that will get these large storage servers working reliably? is this problem specifically attributable to proliants, xfs, lvm, large drives or raid? is there a different filesystem i can choose at install time with manual partitioning that will result in machines that can be rebooted without triggering a multi-hour rebuild cycle? how can i get these servers working for me instead of me for them?

1 Like

Using info from 2007 and 2016 may be out of date may not help.

Can you login at the ctrl-d prompt?
Have you investigated what is wrong?

My server had an issue that was down to depending on undefined behaviour, that once that was explained to me I fixed.
But I was only able to do this because I investigated what was wrong.

One tool I have to help debug my systems is a rescue fedora system on an external USB-3 SSD. I can boot into the rescue system and mount disks one by one to check for where any issue is.

2 Likes

I hope you have UPS and generators. In my experience, much of it with XFS on SGI hardware, improper shutdown of servers can require days of recovery work cleaning up after failed processing jobs, partial file transfers, etc. At one time our servers were in a relatively small room (with UPS’s and generator). After power failures, including the monthly generator load test, systems were shutting down due to high temperatures. We discovered that the cooling unit was installed without a time-delay relay. This should always be installed because the compressor can’t restart immediately after power failure, but needs time for “pressure” to reduce in order to restart.

Another thing were learned was to configure shutdown after 5 mins on UPS – a) to give time to do a clean shutdown, and b) the batteries (this was 20 years ago) could only take a few cycles of deep discharge.

@grenade,

What a fun problem. At least for me where I have no skin in the game.

Old hardware has been a realm I’ve had a lot of trouble with no matter how many times I produce studies that show we need to plan for 5 year replacement cycles. I am always overruled and end up with 15 year old hardware that just has to be gotten to work.

What kind of RAID are you using? Do these servers have hardware RAID controllers? Can they be configured from within Linux or do you need to configure them in bios or BMC? What kind of drive health reporting is available from within Linux? Does every disk in a single raid have an identical byte count?

Because hardware RAID controllers were often not well supported in Linux, using JBODs and software RAID is often a better choice. It may even turn out that software RAID boosts performance over hw RAID.

I had a blast working as a contractor at NCR Peripheral Products Division on the very first hardware RAID controller. Over the years a lot of RAID setups have been the cause of problems rather than a solution. For instance, one hardware RAID product I tested would only sustain 6mb/s if all drives were in a single raid6 with 2 hot spares. The same device broken up into 5 drives per raid5 ran at 160mb/s which was about the same as a single drive on its own. With more raid5s with smaller total drive counts the mtbf was similar to the larger number of drives in a single raid6. Choosing a 5 drive raid5 gave 4 sectors of data plus one sector parity multiples which easily aligned with filesystem block size multiples. Putting 6 drives in a raid5 is not optimal for any of the access patterns I’ve been around.

Planning and testing before choosing a layout is worthwhile.

Even with the best layout I would separate data from os and apps. The os/apps storage should be easily reproducible. No drive sharing between os/app storage and data storage at all.

This topic is quite large so I will stop here.

We had a commercial RAID box with 6 SCSI drives connected to an SGI Octane and using XFS, but #6 was a spare (remove failed drive, add spare and rebuild, vendor sends a new spare), and we didn’t have LVM. Our RAID drives were not used for OS, only data.