Boot Drive Dead

Hi forum,

I’ve got a little home server, with the following RAID set up. Any help would be amazing, as its fallen over.

A. 500GB HDD - /boot, and /
B. 500GB SSD - /

C. 3TB HDD /family
D. 3TB HDD /family
E. 3TB HDD /family

The RAID is software, with mdadm, its been running a good while now and the LVM stuff wasn’t around back then.

A few of these have died, and I’ve replaced them. B. was a HDD, but I went for an SSD last time.

A has just died, so I have two questions …

  1. When I buy a new drive, and set it up for RAID, how do I go about sorting out a new /boot partition and GRUB? Never had to do this before.

  2. 500GB HDD seems to be hard to come by, is there anything to be said for sticking with a HDD for reliability on / and buy a 1TB drive, but only use 500GB of it? I went for the other SSD for speed, and thought stick pairing with a HDD for reliability? There was some mdadm param I set to allow for the different write speeds.

Thanks for any help or comments.

Pairing an HDD with an SSD in raid is asking for trouble. The write speeds are tremendously different and thus cause issues.

I had read that, I found my notes when I set this up, and it was the “writemostly” flag I saw here that suggested to me it was okay to do this - Hybrid HDD + SSD RAID1 on Linux

If that page is wrong, then maybe I’m better getting a 2nd SSD.

In which case, its just working out how to sort out /boot and GRUB.

I like the info in that page, but have never tried the hybrid raid.
YMMV but it may be worth a try.

Smaller HDDs are difficult to find any more, and you must be careful about the way it writes.
HDDs that use SMR recording are worthless in most cases since after the drive gets a little bit of data written it begins ‘shingling’ the data (overlapping layers) and that really slows down the writes and potentially introduces read errors. Drives with CMR tech are the way to go for most and the only type I will accept.

Yeah maybe its safer in some ways to go SSD then and turn off the writemostly flag.

That old HDD lasted 13-15 years I reckon, which is amazing. That’s the appeal of HDD. But maybe SSD tech is better now and they might last as long.

I have a couple USB HDD that that old and still working. 15 years ago may have been “peak quality” for HDD. A few years later at work we bought a case of 2TB drives with 5-year warranty. One failed in under a year and was replaced under warranty, the others started failing just after 5 years – a testament to how well manufacturers have dialled in their processes.

The failure modes are very different. SSD and spinning drives share (small) risks of some electrical component failing, but SSD wear is based on write cycles, so SSD drives used mostly read-only (e.g., web servers) should last longer than drives used in a data pipeline where new data are written, processed to write more new products, then moved to archival storage (tape robot). With SSD’s you can see when spare cells are allocated and plan for replacement.

So far so good, failed the old drive with mdadm, removed it, added the new SSD, and the raid array rebuilt fine.

I am now onto trying to reinstall grub and the EFI partition, with a chroot.

Are these instructions anything like?

I managed to chroot and reinstall GRUB, but it didn’t give me a bootable system (no bootable disk it said), so must be missing something. I suspect its to do with the EFI partition.

I have 1. 500mb efi, 5gb /boot and 250gb /

As I understand it, the efi partition may cause problems if it is in raid. The same for /boot.
It appears those 2 partitions should be on a single drive outside of the raid array.

When using software raid, the raid does not become active until after the kernel is loaded.
This means that first the bios must read the efi partition to launch grub; then grub must reach the grub.cfg in /boot in order to load the kernel. All of this before the raid is activated.

I’ve managed to get it booting again having largely followed instructions from here - Chroot / rebuild Grub / Fix Damaged EFI Partition / rebuild Kernel (initramfs, vmlinuz) [Fedora 39] · GitHub

I also had to update fstab as it was a new drive, so UUIDs had changed.

Its failing to boot though, and going to rescue mode. The logs say …

sysroot.mount: Job sysroot.mount failed.start failed with result Dependency
x2droot.device failed on boot

Not sure what I’m missing.

I’d seen this about RAID and /boot and /boot/efi so I had only done the RAID for /

This is why I’m in this mess, as I had no RAID backup of /boot and /boot/efi they were on the drive that died.

I got it working. I hadn’t mounted /home and another partition from the original install in the chroot. I did that, and these commands …

dnf reinstall grub2-efi shim -y
grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg
dnf reinstall kernel-core -y

Amazingly, when the HDD died, it was trying to boot, to install updates. It picked up right where it left off, and installed the updates.

Phew.

That command will permanently break the ability to update kernels unless the user makes repairs or manually repeats the command above with every kernel update.

The file you overwrote is created by grub as a pointer to the real grub.cfg at /boot/grub2/grub.cfg. The location of the grub.cfg file was changed about the release of f32 (5 years ago) and most users seem aware of the change.

Once changed as you did with that command the system will no longer automatically update it since the file under /boot/efi is a static file and only a pointer.

The only way I am aware of repairing it at present is to follow the following steps. (I was told recently that there is a new command being provided with the release of f43 to do the repairs differently, but for now these steps are required)

  1. Remove both grub.cfg files
    sudo rm /boot/efi/EFI/fedora/grub.cfg /boot/grub2/grub.cfg
  2. rebuild both those files properly
    sudo dnf reinstall grub2-common

Once this is done then the system should perform updates properly.

The grub2-mkconfig command should only be used with the output file as one of
/etc/grub2.cfg, /etc/grub2-efi.cfg (both symlinks) or /boot/grub2/grub.cfg (the real file).

I suspect that if a new kernel was installed as the update was performed it may not be bootable.

1 Like

Did this actually do anything? On recent Fedoras it should refuse to write to /boot/efi and give you an error message explaining the correct place to write the config to (as @computersavvy said).

I don’t remember which, but on either f41 or f42 there was added an error message to indicate the command should not be run with that output file. The user I responded to seemed to indicate that the command completed properly, which was why I gave the proper fix when that happens.

It did seem to let me run the command, yeah.

Thank you for this. I ran steps 1. and 2. Do I need to run grub2-mkconfig at all, or is that run for me behind the scenes?

I did see the following output, which sounds promising.

Scriptlet output:
Generating grub configuration file …
Adding boot menu entry for UEFI Firmware Settings …
done

No need.
The reinstall (step 2) does it for you and as stated, rebuilds both files properly.

That’s great, thank you so much for warning me about that!

Phew! So glad that’s fixed … until next time anyway …

1 Like

A post was split to a new topic: Booting with two drives in raid 1 array