Successfully Replaced RAID10 Drives, Degraded after Reboot

Yesterday, I got some NVMe drives and replaced two of four devices in a RAID10 (NEAR) md device that is encrypted and contains LVroot, the LV for ‘/’ filesystem.

Procedure:

  • add /dev/nvme#n1 as spare to md
  • fail /dev/sdX
  • autoresync started, wait till finished
  • repeat for other drive

Everything looked fine, and the state was Clean. So I rebooted, but the md device was assembled missing the two new drives!

I re-add them (with --add, not re-add, but that is supposed to be the same thing) and they immediately start to resync, which completes successfully.

I don’t understand this. They were added to the md, but they don’t even show up in mdadm --detail; it’s as if someone --remove-ed them. I also noticed that, after POST, the system took a long time to ask me for LUKS password to decrypt the md, as if it was having trouble finding all the disks or figuring out what was going on. But, I checked the kernel parameters and the correct UUIDs for this md are listed there, and the UUID didn’t change when I replaced the two disks, so, I don’t understand why mdadm would have any trouble assembling/running the md on boot. All the information it needed was there.

I’m certain of this, in fact, because I actually updated /etc/mdadm.conf after each stage of the process with mdadm --detail --brief. After I added the two new drives, the ARRAY line had an extra parameter: “num-spares=2”, or something like that. Then, after the one drive finished re-syncing, and I removed the old disk from the array, it when down to num-spares=1. And then, again, when the migration was fully completed, I updated /etc/mdadm.conf again, and it was exactly as it was before I started, matching the backup file I saved. So, I don’t even think there is any point to updating initrd img. I didn’t do that, but I could. That makes some sense, but I don’t know what files need to be updated, if not mdadm.conf.

I checked logs and there aren’t any disk errors or other things. I’m sure it wasn’t a real drive failure of any kind.

I’ve done this exact thing several times before; migrating my RAID10 from HDD to SDD and then faster SSD, and so on. But, it’s been a while, and I forget the details. I think the same exact thing happened last time, too, but I don’t remember what I did to fix it.

Now I see this:

  Personalities : [raid10]
  md126 : active raid10 nvme1n1[9] nvme0n1[4] sdc[7] sdd[6]
      1953262592 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 11/15 pages [44KB], 65536KB chunk

  unused devices: <none>

My only thought about this issue is that, because I (manually) failed over to spare disks, when I rebooted, at least one of the new NVMe disks had a (S) mark by it. It looked something like this:

Personalities : [raid10] # fake
md126 : active raid10 nvme1n1[9](S) nvme0n1[4](S) sdc[7] sdd[6] #fake
      1953262592 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU] # fake
      bitmap: 10/15 pages [40KB], 65536KB chunk # fake

Could this be why these two drives got kicked out? It doesn’t make any sense, since, if I hadn’t failed those disks, those two spares wouldn’t have been kicked out. (Right? I mean, I guess I cannot say for sure; I didn’t try.) Why would they get ticked out while in use! They were full members as spares. What process could remove them.

In any case, they are not spares, now. But, other than that, the situation looks exactly the same as the last time I rebooted. I’m going to try again and report back here.

Okay, no joy. And this is even weirder:

  • I tested the idea that spare disks wouldn’t get thrown out on reboot, and that worked. I added one of the old drives back to the array and, on boot, the two new drives, were lost, again, but the spare wasn’t, and immediately started re-syncing. So, I don’t think it has anything to do with the spare-status of any array member. Which makes the most sense, but doesn’t help understanding the situation.
  • Again, the two new drives were missing on boot.
  • Boot messages said stuff about finding the UUIDs for the md array and LUKS, which were exactly as expected and nothing has changed.

Okay, what about NVMe? It is possible initrd img doesn’t have nvme modules? lsmod shows actually several modules having todo with nvme. So, maybe that’s the problem. That makes the most sense. Guess I should have thought of that. I remember when not everything was a module. I wouldn’t expect any block device that could be boot device to be a module, but I guess that’s probably a large group.

I’ll try dracut --regenerate-all after it re-syncs.

The issue I see is mixing SSD and NVME drives.
That may not be an issue if the mirrored pairs were nvme → nvme and SSD → SSD since the devices would have approximately the same access speeds, but could be an issue if it were nvme → SSD

Striping across nvme & SSD could however also present an issue with access speeds for both read & write.

My recommendation would always be to have a raid 10 array (or any raid array for that matter) using matched devices to avoid potential problems for the system as a result of differing access speeds.

As far as managing an mdadm array, usually you should fail out one device and replace it, wait for the sync to complete, then fail out the next device and replace it.

Doing 2 devices at the same time is a lot more complex since you have to make certain you get the proper pair (one of each mirrored pair and not both of the same mirrored pair). Syncing is more complex since it requires building both the striping and mirroring at the same time.

Re-adding a drive that was previously part of an array that has not been cleanly wiped of all previous raid ID info presents a problem as you already have seen.
Wiping that data is as simple as dd if=/dev/zero of=/dev/sdX bs=1M count=1 where you would replace sdX with the actual device name you are wiping the raid data from. This is assuming you are using the bare device for the array and would not be the same if using a partition on the device.

After wiping the device data as noted then mdadm would see the drive as brand new and not know it had been part of the array previously.

This is what you showed above. I an not sure what the (S) means, but the next line with the [UUUU] shows that all devices are up and usable.

So far researching the documentation does not give me anything to indicate the meaning of the (S) but if the device is up and functional then it may be insignificant and inconsequential (or may not). It may also be related to the difference in speeds between the nvme and SSD.

Thanks @computersavvy . Interesting. I realized that mixing speeds wouldn’t be ideal, but I’m not worried that there is any harm to it. I doubt my issue is related to it, either. But, I guess we’ll see.

As far as managing an mdadm array, usually you should fail out one device and replace it, wait for the sync to complete, then fail out the next device and replace it.

Yes, that’s exactly what I did. Sorry if it wasn’t clear.

Although, I will say that the array reported re-sync on both drives when I added them back after the boot and found they were missing. It’s not clear if it was doing it in series or parallel.

(S) means spare. (F) means fail, so on. That’s what I’ve found out, but I no longer believe it has anything to do with disks getting silently kicked out of the array on boot. Actually, it’s not silent; I see “DegradedArray” and “SparesMissing” events in the journal early in the boot cycle. But, that’s after the devices have been “lost”.

The thing I’m trying to get across in this thread is that the array was UUUU before I shutdown…both times; State: Clean, all disks active sync. So, there isn’t anything wrong with the drives, themselves, or the array config. It’s something else. It has to be the kernel modules or boot parameters, and it doesn’t appear to be the params.

I just rebuilt all my initrd’s after restructuring the array a third time. /proc/mdstat show this, which exactly what it looked like when I rebooted the first time.

md126 : active raid10 nvme1n1[9] nvme0n1[8] sdc[7] sdd[6]
      1953262592 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 8/15 pages [32KB], 65536KB chunk

Bingo! Booted fine. It must have been that the nvme kernel module was missing from the initrd imgs.

SOLUTION: rebuild initrd images (dracut --force --regenerate-all) when adding new types of block devices to the boot chain.

Oh, also, I learned a better way to replace drives in an array!

The process for replacing active devices in an array, as opposed to bad or failed devices, you mark the outgoing disk for replacement; you don’t manually fail the drive to start the process. It will start the recovery on an existing spare or as soon as a new spare is available.

mdadm <array> --replace <old drive>

And, this way is better because the disc to be replaced will remain active until the spare is fully sync-ed and then the array will automatically fail the outgoing drive for you when it’s time. Very cool. I did this the third time I rebuilt the array and it worked great. The array is never without redundancy.

2 Likes