Yesterday, I got some NVMe drives and replaced two of four devices in a RAID10 (NEAR) md device that is encrypted and contains LVroot, the LV for ‘/’ filesystem.
Procedure:
- add /dev/nvme#n1 as spare to md
- fail /dev/sdX
- autoresync started, wait till finished
- repeat for other drive
Everything looked fine, and the state was Clean. So I rebooted, but the md device was assembled missing the two new drives!
I re-add them (with --add, not re-add, but that is supposed to be the same thing) and they immediately start to resync, which completes successfully.
I don’t understand this. They were added to the md, but they don’t even show up in mdadm --detail; it’s as if someone --remove-ed them. I also noticed that, after POST, the system took a long time to ask me for LUKS password to decrypt the md, as if it was having trouble finding all the disks or figuring out what was going on. But, I checked the kernel parameters and the correct UUIDs for this md are listed there, and the UUID didn’t change when I replaced the two disks, so, I don’t understand why mdadm would have any trouble assembling/running the md on boot. All the information it needed was there.
I’m certain of this, in fact, because I actually updated /etc/mdadm.conf after each stage of the process with mdadm --detail --brief
. After I added the two new drives, the ARRAY line had an extra parameter: “num-spares=2”, or something like that. Then, after the one drive finished re-syncing, and I removed the old disk from the array, it when down to num-spares=1. And then, again, when the migration was fully completed, I updated /etc/mdadm.conf again, and it was exactly as it was before I started, matching the backup file I saved. So, I don’t even think there is any point to updating initrd img. I didn’t do that, but I could. That makes some sense, but I don’t know what files need to be updated, if not mdadm.conf.
I checked logs and there aren’t any disk errors or other things. I’m sure it wasn’t a real drive failure of any kind.
I’ve done this exact thing several times before; migrating my RAID10 from HDD to SDD and then faster SSD, and so on. But, it’s been a while, and I forget the details. I think the same exact thing happened last time, too, but I don’t remember what I did to fix it.
Now I see this:
Personalities : [raid10]
md126 : active raid10 nvme1n1[9] nvme0n1[4] sdc[7] sdd[6]
1953262592 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
bitmap: 11/15 pages [44KB], 65536KB chunk
unused devices: <none>
My only thought about this issue is that, because I (manually) failed over to spare disks, when I rebooted, at least one of the new NVMe disks had a (S)
mark by it. It looked something like this:
Personalities : [raid10] # fake
md126 : active raid10 nvme1n1[9](S) nvme0n1[4](S) sdc[7] sdd[6] #fake
1953262592 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU] # fake
bitmap: 10/15 pages [40KB], 65536KB chunk # fake
Could this be why these two drives got kicked out? It doesn’t make any sense, since, if I hadn’t failed those disks, those two spares wouldn’t have been kicked out. (Right? I mean, I guess I cannot say for sure; I didn’t try.) Why would they get ticked out while in use! They were full members as spares. What process could remove them.
In any case, they are not spares, now. But, other than that, the situation looks exactly the same as the last time I rebooted. I’m going to try again and report back here.