Boot Drive Dead

Hi forum,

I’ve got a little home server, with the following RAID set up. Any help would be amazing, as its fallen over.

A. 500GB HDD - /boot, and /
B. 500GB SSD - /

C. 3TB HDD /family
D. 3TB HDD /family
E. 3TB HDD /family

The RAID is software, with mdadm, its been running a good while now and the LVM stuff wasn’t around back then.

A few of these have died, and I’ve replaced them. B. was a HDD, but I went for an SSD last time.

A has just died, so I have two questions …

  1. When I buy a new drive, and set it up for RAID, how do I go about sorting out a new /boot partition and GRUB? Never had to do this before.

  2. 500GB HDD seems to be hard to come by, is there anything to be said for sticking with a HDD for reliability on / and buy a 1TB drive, but only use 500GB of it? I went for the other SSD for speed, and thought stick pairing with a HDD for reliability? There was some mdadm param I set to allow for the different write speeds.

Thanks for any help or comments.

Pairing an HDD with an SSD in raid is asking for trouble. The write speeds are tremendously different and thus cause issues.

I had read that, I found my notes when I set this up, and it was the “writemostly” flag I saw here that suggested to me it was okay to do this - Hybrid HDD + SSD RAID1 on Linux

If that page is wrong, then maybe I’m better getting a 2nd SSD.

In which case, its just working out how to sort out /boot and GRUB.

I like the info in that page, but have never tried the hybrid raid.
YMMV but it may be worth a try.

Smaller HDDs are difficult to find any more, and you must be careful about the way it writes.
HDDs that use SMR recording are worthless in most cases since after the drive gets a little bit of data written it begins ‘shingling’ the data (overlapping layers) and that really slows down the writes and potentially introduces read errors. Drives with CMR tech are the way to go for most and the only type I will accept.

Yeah maybe its safer in some ways to go SSD then and turn off the writemostly flag.

That old HDD lasted 13-15 years I reckon, which is amazing. That’s the appeal of HDD. But maybe SSD tech is better now and they might last as long.

I have a couple USB HDD that that old and still working. 15 years ago may have been “peak quality” for HDD. A few years later at work we bought a case of 2TB drives with 5-year warranty. One failed in under a year and was replaced under warranty, the others started failing just after 5 years – a testament to how well manufacturers have dialled in their processes.

The failure modes are very different. SSD and spinning drives share (small) risks of some electrical component failing, but SSD wear is based on write cycles, so SSD drives used mostly read-only (e.g., web servers) should last longer than drives used in a data pipeline where new data are written, processed to write more new products, then moved to archival storage (tape robot). With SSD’s you can see when spare cells are allocated and plan for replacement.

So far so good, failed the old drive with mdadm, removed it, added the new SSD, and the raid array rebuilt fine.

I am now onto trying to reinstall grub and the EFI partition, with a chroot.

Are these instructions anything like?

I managed to chroot and reinstall GRUB, but it didn’t give me a bootable system (no bootable disk it said), so must be missing something. I suspect its to do with the EFI partition.

I have 1. 500mb efi, 5gb /boot and 250gb /

As I understand it, the efi partition may cause problems if it is in raid. The same for /boot.
It appears those 2 partitions should be on a single drive outside of the raid array.

When using software raid, the raid does not become active until after the kernel is loaded.
This means that first the bios must read the efi partition to launch grub; then grub must reach the grub.cfg in /boot in order to load the kernel. All of this before the raid is activated.

I’ve managed to get it booting again having largely followed instructions from here - Chroot / rebuild Grub / Fix Damaged EFI Partition / rebuild Kernel (initramfs, vmlinuz) [Fedora 39] · GitHub

I also had to update fstab as it was a new drive, so UUIDs had changed.

Its failing to boot though, and going to rescue mode. The logs say …

sysroot.mount: Job sysroot.mount failed.start failed with result Dependency
x2droot.device failed on boot

Not sure what I’m missing.

I’d seen this about RAID and /boot and /boot/efi so I had only done the RAID for /

This is why I’m in this mess, as I had no RAID backup of /boot and /boot/efi they were on the drive that died.

I got it working. I hadn’t mounted /home and another partition from the original install in the chroot. I did that, and these commands …

dnf reinstall grub2-efi shim -y
grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg
dnf reinstall kernel-core -y

Amazingly, when the HDD died, it was trying to boot, to install updates. It picked up right where it left off, and installed the updates.

Phew.