RAID 5 disk failed; can't get it restarted

glb · August 9, 2023, 1:55am

You should be able to start the ssh service in the live environment with systemctl start sshd.service. You may also need to set a password for liveuser (or root) with the passwd command before you will be able to remote into the running live environment. Use ip address list to see what the IP address of your running live system is. Then, from another system on the same network, you should be able to remote in with, e.g.,ssh liveuser@<IP-Address>.

We certainly don’t expect you to hand type all that output.

computersavvy · August 9, 2023, 2:07am

Why not enable ssh on the workstation.?
sudo systemctl enable sshd would enable it and the next boot it would work. (or add the --now option to that command and it would be enabled immediately) or use sudo systemctl start sshd after enabling it and it would also be immediately available.

If you have copied the data correctly then you now have a partition UUID for sdb1, sdc1, and sdd1.
What is shown with ls -al /dev/disk/by-uuid and ls /dev/?
I am trying to track down why you see a partition UUID, and also need to see what the UUID for the raid array itself is seen as.

I don’t have quite the same situation since I build my raid array on the raw device, not in partitions.

As far as halting the rebuild the simplest would be just a reboot, and I would do that from the desktop menu (upper right corner) rather than a terminal window.

After the reboot one may be able to do an incremental assembly of the array and activate it using the device names shown with cat /proc/mdstat

If you have enough power cables and sata cables it might be easier to connect both the old and the new drives in before booting up to see what the system may tell you about the array after boot.

johniliffe · August 9, 2023, 2:35am

OK, thanks fellows. I’ll do that in the morning. The workstation is in pieces on the bench at the moment so I may be able to connect both disks if I can find a spare sata power cable. I have a data cable. At the moment it is running from the command line where it failed to, so I’ll have to do a systemctl command to stop it.

ls -al /dev/disk/by-uuid

drwxr-xr-x 2 root root 120 Aug 7 14:44 .
drwxr-xr-x 7 root root 140 Aug 7 14:44 …
lrwxrwxrwx 1 root root 10 Aug 7 14:44 138d815a-fa76-4c5a-a55d-4e5ea2071dd2 → …/…/sda1
lrwxrwxrwx 1 root root 10 Aug 7 14:44 82cf6fd1-0fe3-43ae-9652-184752b826c1 → …/…/sda4
lrwxrwxrwx 1 root root 10 Aug 7 14:44 CA17-88B3 → …/…/sda3
lrwxrwxrwx 1 root root 10 Aug 7 14:44 d4f4e3fe-5ffb-44ec-ab71-bb2857fad1eb → …/…/sda2

ls /dev/
I’ll send you this when I get ssh running, it is a 7 by 26 array and the odds of typing it correctly are about zero!

johniliffe · August 9, 2023, 5:11pm

I have everything running under the Live dvd and the OLD sdb disk is plugged to SATA6 with the replacement on SATA2. All disks came up properly and the array started rebuilding immediately but onto /dev/sde which is likely the defective disk. I’m going to let it run and as soon as possible add /dev/sdb as a spare so if the rebuild fails it should jump to the new disk automatically (right??)

FYI here is the current mdadm --detail /dev/md127

/dev/md127:
        Version : 1.2
  Creation Time : Fri Apr  6 03:07:32 2018
     Raid Level : raid5
  Used Dev Size : 244192256 (232.88 GiB 250.05 GB)
   Raid Devices : 3
  Total Devices : 3
    Persistence : Superblock is persistent

    Update Time : Sat Jan 22 11:49:52 2022
          State : active, degraded, Not Started
 Active Devices : 2
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 512K

           Name : wkstn04.iliffe.ca:root
           UUID : 623b22cb:827bd418:c7148f3b:d3165770
         Events : 141421

    Number   Major   Minor   RaidDevice State
       0       8       65        0      spare rebuilding   /dev/sde1
       1       8       33        1      active sync   /dev/sdc1
       3       8       49        2      active sync   /dev/sdd1

and here is the listing for ls -al /dev that Jeff requested last night:

total 4
drwxr-xr-x. 19 root root 4260 Aug 9 12:38 .
dr-xr-xr-x. 18 root root 4096 Aug 9 2023 …
crw-------. 1 root root 10, 235 Aug 9 2023 autofs
drwxr-xr-x. 2 root root 440 Aug 9 12:38 block
drwxr-xr-x. 2 root root 160 Aug 9 12:07 bsg
crw-------. 1 root root 10, 234 Aug 9 12:07 btrfs-control
drwxr-xr-x. 3 root root 60 Aug 9 12:07 bus
lrwxrwxrwx. 1 root root 3 Aug 9 2023 cdrom → sr0
drwxr-xr-x. 2 root root 4300 Aug 9 12:29 char
crw-------. 1 root root 5, 1 Aug 9 12:29 console
lrwxrwxrwx. 1 root root 11 Aug 9 12:07 core → /proc/k
drwxr-xr-x. 6 root root 140 Aug 9 12:07 cpu
crw-------. 1 root root 10, 62 Aug 9 2023 cpu_dma_latency
crw-------. 1 root root 10, 203 Aug 9 12:07 cuse
drwxr-xr-x. 8 root root 160 Aug 9 12:07 disk
brw-rw----. 1 root disk 253, 0 Aug 9 2023 dm-0
brw-rw----. 1 root disk 253, 1 Aug 9 2023 dm-1
crw-rw----. 1 root video 29, 0 Aug 9 2023 fb0
lrwxrwxrwx. 1 root root 13 Aug 9 12:07 fd → /proc/sel
crw-rw-rw-. 1 root root 1, 7 Aug 9 2023 full
crw-rw-rw-. 1 root root 10, 229 Aug 9 12:29 fuse
crw-------. 1 root root 246, 0 Aug 9 2023 hidraw0
crw-------. 1 root root 246, 1 Aug 9 2023 hidraw1
crw-------. 1 root root 10, 228 Aug 9 2023 hpet
drwxr-xr-x. 3 root root 0 Aug 9 12:29 hugepages
crw-------. 1 root root 10, 183 Aug 9 2023 hwrng
lrwxrwxrwx. 1 root root 25 Aug 9 2023 initctl → /run
itctl/fifo
drwxr-xr-x. 4 root root 460 Aug 9 2023 input
crw-r–r–. 1 root root 1, 11 Aug 9 2023 kmsg
crw-rw-rw-+ 1 root kvm 10, 232 Aug 9 12:29 kvm
lrwxrwxrwx. 1 root root 28 Aug 9 12:07 log → /run/sys
l/dev-log
brw-rw----. 1 root disk 7, 0 Aug 9 2023 loop0
brw-rw----. 1 root disk 7, 1 Aug 9 2023 loop1
brw-rw----. 1 root disk 7, 2 Aug 9 2023 loop2
crw-rw----. 1 root disk 10, 237 Aug 9 2023 loop-control
crw-rw----. 1 root lp 6, 0 Aug 9 2023 lp0
crw-rw----. 1 root lp 6, 1 Aug 9 2023 lp1
crw-rw----. 1 root lp 6, 2 Aug 9 2023 lp2
crw-rw----. 1 root lp 6, 3 Aug 9 2023 lp3
drwxr-xr-x. 2 root root 100 Aug 9 12:07 mapper
crw-------. 1 root root 10, 227 Aug 9 2023 mcelog
brw-rw----. 1 root disk 9, 127 Aug 9 2023 md127
crw-r-----. 1 root kmem 1, 1 Aug 9 2023 mem
crw-------. 1 root root 10, 59 Aug 9 2023 memory_bandwidt
drwxrwxrwt. 2 root root 40 Aug 9 12:07 mqueue
drwxr-xr-x. 2 root root 60 Aug 9 12:07 net
crw-------. 1 root root 10, 61 Aug 9 2023 network_latency
crw-------. 1 root root 10, 60 Aug 9 2023 network_through
crw-rw-rw-. 1 root root 1, 3 Aug 9 2023 null
crw-------. 1 root root 10, 144 Aug 9 2023 nvram
crw-r-----. 1 root kmem 1, 4 Aug 9 2023 port
crw-------. 1 root root 108, 0 Aug 9 2023 ppp
crw-rw-rw-. 1 root tty 5, 2 Aug 9 13:02 ptmx
drwxr-xr-x. 2 root root 0 Aug 9 12:07 pts
crw-rw-rw-. 1 root root 1, 8 Aug 9 2023 random
drwxr-xr-x. 2 root root 60 Aug 9 12:07 raw
crw-rw-r–+ 1 root root 10, 58 Aug 9 2023 rfkill
lrwxrwxrwx. 1 root root 4 Aug 9 2023 rtc → rtc0
crw-------. 1 root root 250, 0 Aug 9 2023 rtc0
brw-rw----. 1 root disk 8, 0 Aug 9 12:35 sda
brw-rw----. 1 root disk 8, 1 Aug 9 12:35 sda1
brw-rw----. 1 root disk 8, 2 Aug 9 12:35 sda2
brw-rw----. 1 root disk 8, 3 Aug 9 12:35 sda3
brw-rw----. 1 root disk 8, 4 Aug 9 12:35 sda4
brw-rw----. 1 root disk 8, 16 Aug 9 12:38 sdb
brw-rw----. 1 root disk 8, 17 Aug 9 12:38 sdb1
brw-rw----. 1 root disk 8, 32 Aug 9 12:39 sdc
brw-rw----. 1 root disk 8, 33 Aug 9 12:39 sdc1
brw-rw----. 1 root disk 8, 48 Aug 9 12:43 sdd
brw-rw----. 1 root disk 8, 49 Aug 9 12:43 sdd1
brw-rw----. 1 root disk 8, 64 Aug 9 12:41 sde
brw-rw----. 1 root disk 8, 65 Aug 9 12:41 sde1
crw-rw----. 1 root disk 21, 0 Aug 9 2023 sg0
crw-rw----. 1 root disk 21, 1 Aug 9 2023 sg1
crw-rw----. 1 root disk 21, 2 Aug 9 2023 sg2
crw-rw----. 1 root disk 21, 3 Aug 9 2023 sg3
crw-rw----+ 1 root cdrom 21, 4 Aug 9 2023 sg4
crw-rw----. 1 root disk 21, 5 Aug 9 2023 sg5
drwxrwxrwt. 2 root root 60 Aug 9 12:07 shm
crw-------. 1 root root 10, 231 Aug 9 2023 snapshot
drwxr-xr-x. 3 root root 380 Aug 9 2023 snd
brw-rw----+ 1 root cdrom 11, 0 Aug 9 2023 sr0
lrwxrwxrwx. 1 root root 15 Aug 9 12:07 stderr → /proc
lrwxrwxrwx. 1 root root 15 Aug 9 12:07 stdin → /proc/
lrwxrwxrwx. 1 root root 15 Aug 9 12:07 stdout → /proc
crw-rw-rw-. 1 root tty 5, 0 Aug 9 12:53 tty
crw–w----. 1 root tty 4, 0 Aug 9 2023 tty0
crw–w----. 1 gdm tty 4, 1 Aug 9 2023 tty1
crw–w----. 1 root tty 4, 10 Aug 9 2023 tty10
crw–w----. 1 root tty 4, 11 Aug 9 2023 tty11
crw–w----. 1 root tty 4, 12 Aug 9 2023 tty12
crw–w----. 1 root tty 4, 13 Aug 9 2023 tty13
crw–w----. 1 root tty 4, 14 Aug 9 2023 tty14
crw–w----. 1 root tty 4, 15 Aug 9 2023 tty15
crw–w----. 1 root tty 4, 16 Aug 9 2023 tty16
crw–w----. 1 root tty 4, 17 Aug 9 2023 tty17
crw–w----. 1 root tty 4, 18 Aug 9 2023 tty18
crw–w----. 1 root tty 4, 19 Aug 9 2023 tty19
crw–w----. 1 liveuser tty 4, 2 Aug 9 2023 tty2
crw–w----. 1 root tty 4, 20 Aug 9 2023 tty20
crw–w----. 1 root tty 4, 21 Aug 9 2023 tty21
crw–w----. 1 root tty 4, 22 Aug 9 2023 tty22
crw–w----. 1 root tty 4, 23 Aug 9 2023 tty23
crw–w----. 1 root tty 4, 24 Aug 9 2023 tty24
crw–w----. 1 root tty 4, 25 Aug 9 2023 tty25
crw–w----. 1 root tty 4, 26 Aug 9 2023 tty26
crw–w----. 1 root tty 4, 27 Aug 9 2023 tty27
crw–w----. 1 root tty 4, 28 Aug 9 2023 tty28
crw–w----. 1 root tty 4, 29 Aug 9 2023 tty29
crw–w----. 1 root tty 4, 3 Aug 9 2023 tty3
crw–w----. 1 root tty 4, 30 Aug 9 2023 tty30
crw–w----. 1 root tty 4, 31 Aug 9 2023 tty31
crw–w----. 1 root tty 4, 32 Aug 9 2023 tty32
crw–w----. 1 root tty 4, 33 Aug 9 2023 tty33
crw–w----. 1 root tty 4, 34 Aug 9 2023 tty34
crw–w----. 1 root tty 4, 35 Aug 9 2023 tty35
crw–w----. 1 root tty 4, 36 Aug 9 2023 tty36
crw–w----. 1 root tty 4, 37 Aug 9 2023 tty37
crw–w----. 1 root tty 4, 38 Aug 9 2023 tty38
crw–w----. 1 root tty 4, 39 Aug 9 2023 tty39
crw–w----. 1 root tty 4, 4 Aug 9 2023 tty4
crw–w----. 1 root tty 4, 40 Aug 9 2023 tty40
crw–w----. 1 root tty 4, 41 Aug 9 2023 tty41
crw–w----. 1 root tty 4, 42 Aug 9 2023 tty42
crw–w----. 1 root tty 4, 43 Aug 9 2023 tty43
crw–w----. 1 root tty 4, 44 Aug 9 2023 tty44
crw–w----. 1 root tty 4, 45 Aug 9 2023 tty45
crw–w----. 1 root tty 4, 46 Aug 9 2023 tty46
crw–w----. 1 root tty 4, 47 Aug 9 2023 tty47
crw–w----. 1 root tty 4, 48 Aug 9 2023 tty48
crw–w----. 1 root tty 4, 49 Aug 9 2023 tty49
crw–w----. 1 root tty 4, 5 Aug 9 2023 tty5
crw–w----. 1 root tty 4, 50 Aug 9 2023 tty50
crw–w----. 1 root tty 4, 51 Aug 9 2023 tty51
crw–w----. 1 root tty 4, 52 Aug 9 2023 tty52
crw–w----. 1 root tty 4, 53 Aug 9 2023 tty53
crw–w----. 1 root tty 4, 54 Aug 9 2023 tty54
crw–w----. 1 root tty 4, 55 Aug 9 2023 tty55
crw–w----. 1 root tty 4, 56 Aug 9 2023 tty56
crw–w----. 1 root tty 4, 57 Aug 9 2023 tty57
crw–w----. 1 root tty 4, 58 Aug 9 2023 tty58
crw–w----. 1 root tty 4, 59 Aug 9 2023 tty59
crw–w----. 1 root tty 4, 6 Aug 9 2023 tty6
crw–w----. 1 root tty 4, 60 Aug 9 2023 tty60
crw–w----. 1 root tty 4, 61 Aug 9 2023 tty61
crw–w----. 1 root tty 4, 62 Aug 9 2023 tty62
crw–w----. 1 root tty 4, 63 Aug 9 2023 tty63
crw–w----. 1 root tty 4, 7 Aug 9 2023 tty7
crw–w----. 1 root tty 4, 8 Aug 9 2023 tty8
crw–w----. 1 root tty 4, 9 Aug 9 2023 tty9
crw-rw----. 1 root dialout 4, 64 Aug 9 2023 ttyS0
crw-rw----. 1 root dialout 4, 65 Aug 9 2023 ttyS1
crw-rw----. 1 root dialout 4, 74 Aug 9 2023 ttyS10
crw-rw----. 1 root dialout 4, 75 Aug 9 2023 ttyS11
crw-rw----. 1 root dialout 4, 76 Aug 9 2023 ttyS12
crw-rw----. 1 root dialout 4, 77 Aug 9 2023 ttyS13
crw-rw----. 1 root dialout 4, 78 Aug 9 2023 ttyS14
crw-rw----. 1 root dialout 4, 79 Aug 9 2023 ttyS15
crw-rw----. 1 root dialout 4, 80 Aug 9 2023 ttyS16
crw-rw----. 1 root dialout 4, 81 Aug 9 2023 ttyS17
crw-rw----. 1 root dialout 4, 82 Aug 9 2023 ttyS18
crw-rw----. 1 root dialout 4, 83 Aug 9 2023 ttyS19
crw-rw----. 1 root dialout 4, 66 Aug 9 2023 ttyS2
crw-rw----. 1 root dialout 4, 84 Aug 9 2023 ttyS20
crw-rw----. 1 root dialout 4, 85 Aug 9 2023 ttyS21
crw-rw----. 1 root dialout 4, 86 Aug 9 2023 ttyS22
crw-rw----. 1 root dialout 4, 87 Aug 9 2023 ttyS23
crw-rw----. 1 root dialout 4, 88 Aug 9 2023 ttyS24
crw-rw----. 1 root dialout 4, 89 Aug 9 2023 ttyS25
crw-rw----. 1 root dialout 4, 90 Aug 9 2023 ttyS26
crw-rw----. 1 root dialout 4, 91 Aug 9 2023 ttyS27
crw-rw----. 1 root dialout 4, 92 Aug 9 2023 ttyS28
crw-rw----. 1 root dialout 4, 93 Aug 9 2023 ttyS29
crw-rw----. 1 root dialout 4, 67 Aug 9 2023 ttyS3
crw-rw----. 1 root dialout 4, 94 Aug 9 2023 ttyS30
crw-rw----. 1 root dialout 4, 95 Aug 9 2023 ttyS31
crw-rw----. 1 root dialout 4, 68 Aug 9 2023 ttyS4
crw-rw----. 1 root dialout 4, 69 Aug 9 2023 ttyS5
crw-rw----. 1 root dialout 4, 70 Aug 9 2023 ttyS6
crw-rw----. 1 root dialout 4, 71 Aug 9 2023 ttyS7
crw-rw----. 1 root dialout 4, 72 Aug 9 2023 ttyS8
crw-rw----. 1 root dialout 4, 73 Aug 9 2023 ttyS9
crw-------. 1 root root 10, 239 Aug 9 12:07 uhid
crw-------. 1 root root 10, 223 Aug 9 2023 uinput
crw-rw-rw-. 1 root root 1, 9 Aug 9 2023 urandom
drwxr-xr-x. 2 root root 60 Aug 9 12:07 usb
crw-------. 1 root root 247, 0 Aug 9 2023 usbmon0
crw-------. 1 root root 247, 1 Aug 9 2023 usbmon1
crw-------. 1 root root 247, 2 Aug 9 2023 usbmon2
crw-------. 1 root root 247, 3 Aug 9 2023 usbmon3
crw-------. 1 root root 247, 4 Aug 9 2023 usbmon4
crw-------. 1 root root 247, 5 Aug 9 2023 usbmon5
crw-------. 1 root root 247, 6 Aug 9 2023 usbmon6
crw-rw----. 1 root tty 7, 0 Aug 9 2023 vcs
crw-rw----. 1 root tty 7, 1 Aug 9 2023 vcs1
crw-rw----. 1 root tty 7, 2 Aug 9 2023 vcs2
crw-rw----. 1 root tty 7, 3 Aug 9 2023 vcs3
crw-rw----. 1 root tty 7, 4 Aug 9 2023 vcs4
crw-rw----. 1 root tty 7, 5 Aug 9 2023 vcs5
crw-rw----. 1 root tty 7, 6 Aug 9 2023 vcs6
crw-rw----. 1 root tty 7, 128 Aug 9 2023 vcsa
crw-rw----. 1 root tty 7, 129 Aug 9 2023 vcsa1
crw-rw----. 1 root tty 7, 130 Aug 9 2023 vcsa2
crw-rw----. 1 root tty 7, 131 Aug 9 2023 vcsa3
crw-rw----. 1 root tty 7, 132 Aug 9 2023 vcsa4
crw-rw----. 1 root tty 7, 133 Aug 9 2023 vcsa5
crw-rw----. 1 root tty 7, 134 Aug 9 2023 vcsa6
drwxr-xr-x. 2 root root 60 Aug 9 2023 vfio
crw-------. 1 root root 10, 63 Aug 9 2023 vga_arbiter
crw-------. 1 root root 10, 137 Aug 9 2023 vhci
crw-------. 1 root root 10, 238 Aug 9 2023 vhost-net
crw-------. 1 root root 10, 241 Aug 9 2023 vhost-vsock
crw-rw-rw-. 1 root root 1, 5 Aug 9 2023 zero

Not so nicely formatted under Fedora Live as in the rescue shell!

computersavvy · August 9, 2023, 7:38pm

John Iliffe:

    Number   Major   Minor   RaidDevice State
       0       8       65        0      spare rebuilding   /dev/sde1
       1       8       33        1      active sync   /dev/sdc1
       3       8       49        2      active sync   /dev/sdd1

Great !
Progress is being made.

If the rebuild completes successfully then adding the extra drive which is already on hand as a spare would be extra insurance against future failure. If it does not complete successfully then the mentioned steps above to first use mdadm to fail and remove the failing member then add the new device should succeed.

johniliffe · August 9, 2023, 11:09pm

7:00 PM, build is still running (8 hours now) and I’m just letting it go, I’ll let you know if it finishes or not; and then add the spare disk and see what it does. I think that even if it does complete I’ll fail/remove the disk and replace it as this is a very old disk.

glb · August 10, 2023, 12:13am

Doesn’t cat /proc/mdstat show a progress meter when it is rebuilding? If not, maybe you could use iotop to see if there is much disk I/O happening. (You may need to install iotop with dnf install iotop.)

johniliffe · August 10, 2023, 6:55pm

Still running and mdadm --detail says still rebuilding on /dev/sde (the defective disk). I installed iotop and most of the time no I/O activity but about every 30 seconds or so the activity goes to 99% but I can’t catch this due to short time on screen. It appears that the pid doing the I/O is 513 which has no command line ‘[loop2]’ and a user root writing 15.32k B/s. I have to assume that since nothing else is running this is the rebuild. Doesn’t seem to be any way to isolate active processes.
cat /proc/mdstat doesn’t have a progress indicator on it.
Following Jeff’s instructions I have NOT tried to issue any commands to mdadm to re-add the new disk.

computersavvy · August 10, 2023, 8:49pm

If the rebuild does not complete in another 12 hours or so then I would use mdadm to fail and remove sde1. Once that is done then add sdb1 and allow that to rebuild to completion.

Having sdd1 and sdc1 showing as active but the array not started one could probably start the array as is and use it while the rebuild is in progress

One could possibly use mdadm --replace /dev/sde1 to mark it for replacement then mdadm --add /dev/md127 /dev/sdb1 to insert the replacement into the array as a spare and rebuild onto the new device. I am not certain this would work since the array only has 3 member devices defined and adding another while 3 are still connected may not work.

johniliffe · August 11, 2023, 2:53pm

I gave it another night and still running so I tried to fail /dev/sde:

mdadm --manage --fail /dev/sde
mdadm: /dev/sde does not appear to be an md device

mdadm --detail /dev/md127
/dev/md127:
Version : 1.2
Creation Time : Fri Apr 6 03:07:32 2018
Raid Level : raid5
Used Dev Size : 244192256 (232.88 GiB 250.05 GB)
Raid Devices : 3
Total Devices : 3
Persistence : Superblock is persistent

Update Time : Sat Jan 22 11:49:52 2022
      State : active, degraded, Not Started

Active Devices : 2
Working Devices : 3
Failed Devices : 0
Spare Devices : 1

     Layout : left-symmetric
 Chunk Size : 512K

       Name : wkstn04.iliffe.ca:root
       UUID : 623b22cb:827bd418:c7148f3b:d3165770
     Events : 141421

Number   Major   Minor   RaidDevice State
   0       8       65        0      spare rebuilding   /dev/sde1
   1       8       33        1      active sync   /dev/sdc1
   3       8       49        2      active sync   /dev/sdd1

(Just in case mdadm wants only the short name I also tried ‘/sde’ with the same result).

Reading the man page I don’t think --force is a choice here, so do you have any other suggestions?

I also tried to assemble the array as suggested above:
mdadm --assemble --force /dev/md127 /dev/sdc1 /dev/sdd1

with the result:

mdadm /dev/sdc1 is busy - skipping
mdadm /dev/sdd1 is busy - skipping

Obviously they are busy because the rebuild is in progress but I have no idea how to halt that and it will immediately restart if I reboot. iotop shows activity over a period of time on about 15 processes and it isn’t obvious which one is the rebuild. I need a pid to issue a ‘kill’ command.

computersavvy · August 11, 2023, 3:07pm

Uh, Yeah. –
Use the proper device name.
mdadm --fail /dev/md127 /dev/sde1
mdadm --remove /dev/md127 /dev/sde1
mdadm --add /dev/md127 /dev/sdb1

The array is assembled already as is shown by --detail so the assemble command cannot work.
Right now care and patience are the best options.

This is from the man page for mdadm

       If a device is given before any options, or if the first option is one of --add, --re-add, --add-spare, --fail, --remove, or --replace, then the  MANAGE
       mode is assumed.  Anything other than these will cause the Misc mode to be assumed.

johniliffe · August 11, 2023, 4:32pm

I agree, patience is a virtue! That said, I have run out of ideas, even with a number of careful readings of the man page for mdadm. Mostly, the error messages don’t seem to provide useful information. I suspect the errors are because the build is running or that I am logged on using the Live DVD but I have no way to find out. Anyhow, using your suggestion:

mdadm --fail /dev/md127 /dev/sde1
mdadm: set device faulty failed for /dev/sde1: No such device

/dev/md127:
Version : 1.2
Creation Time : Fri Apr 6 03:07:32 2018
Raid Level : raid5
Used Dev Size : 244192256 (232.88 GiB 250.05 GB)
Raid Devices : 3
Total Devices : 3
Persistence : Superblock is persistent

Update Time : Sat Jan 22 11:49:52 2022
      State : active, degraded, Not Started

Active Devices : 2
Working Devices : 3
Failed Devices : 0
Spare Devices : 1

     Layout : left-symmetric
 Chunk Size : 512K

       Name : wkstn04.iliffe.ca:root
       UUID : 623b22cb:827bd418:c7148f3b:d3165770
     Events : 141421

Number   Major   Minor   RaidDevice State
   0       8       65        0      spare rebuilding   /dev/sde1
   1       8       33        1      active sync   /dev/sdc1
   3       8       49        2      active sync   /dev/sdd1

glb · August 11, 2023, 4:55pm

I’ve never had to use it before, but from skimming the man page, maybe you need to run mdadm /dev/md127 --action=frozen /dev/sde1 first?

computersavvy · August 11, 2023, 5:57pm

I have never used that either but it does seem possible from the man page. It should halt the rebuild.

johniliffe · August 11, 2023, 7:02pm

I tried this with the result “mdadm: /dev/sde1 is no an md array”. (their spelling on ‘no’ S/B ‘not’)

So I tried using the target as /dev/md127 and "mdadm: Count not set action for /dev/md127 to frozen: No such file or directory

Most of the errors I get seem to be related to not being able to find the array; is there a config file that has gotten lost here?

johniliffe · August 11, 2023, 7:12pm

Ooops I typed the command wrong, added in the first /dev/md127 and get a different error: “no action given for /dev/md127 in --misc mode” Action options must come before device names" So I tried:
“mdadm --action=frozen /dev/md127 /dev/sde1” and get:
“Count not set action for /dev/md127 to frozen: no such file or array mdadm: /dev/sde1 is no an md array”

If I delete the /dev/sde1 from the end I get back to the no such file or directory error.

johniliffe · August 11, 2023, 7:50pm

I rebooted into the rescue shell instead of the Live DVD and things changed (a bit). Still no mdadm.conf file that used to be there BUT I did “mdadm --replace /dev/md127 /dev/sde1” with the result "“Marked /dev/sde1 (device 0 in /dev/md127) for replacement” Then I ran mdadm --detail /dev/md127 with the result “active devices 2 working devices 4 spare devices 2” (sorry I’m typing again, ssh says “network unreachable”. At the moment /dev/sde1 is being rebuilt again. Can I safely do a --remove on it?

glb · August 11, 2023, 7:58pm

I think it is safe to say that /dev/sde1 can be given up on at this point.

Just make sure /dev/sde1 is still /dev/sde1 since you rebooted. (i.e., run cat /proc/mdstat and verify that it shows /dev/sde1 as defunct/rebuilding before running any stop/fail commands against it.)

johniliffe · August 11, 2023, 8:13pm

Well that certainly blew out a lot of print! I looked at cat /proc/mdstat and sde was still being rebuilt. So I issued “mdadm --remove /dev/sde1” Got several screens full of notes with the top visible line “BUG unable to handle kernel NULL pointer dereference at 00…0058 IP: remove_and_add spares+0x145/0x3b0” and screen ends with “_{end trace} killed” I can still run mdadm as the system seems to have recovered and it still shows /dev/sde1 as being rebuilt.

Next idea would be to physically remove the defective drive and reboot and see if it takes the spare (/dev/sdb1) or would this be a worse idea?

To assist you with this answer, currently mdadm --detail shows that there are 4 devices as follows:

0   8   65   0    spare rebuilding    /dev/sde1
1   8   33   1    active sync             /dev/sdc1
3   8   49    2   active sync            /dev/sdd1
-
4   8   17    -    spare                     /dev/sdb1

and cat /proc/mdstat says:
Personalities [raid6][raid5][raid4]
md127 : sde1[0] sdc1[1] sdd1[3] sdb14
2441645455 blocks super 1.2
unused devices {none}

glb · August 11, 2023, 8:36pm

As a last resort, yes.

But I’d still try to get the software to recognize it as a failed drive first with mdadm /dev/md127 --action=frozen /dev/sde1 and mdadm /dev/md127 --fail /dev/sde1.

Most importantly, make sure you have the correct drive when you remove it by running smartctl -i /dev/sde and making note of the serial number before you shut the system down.

Topic		Replies	Views
Software RAID drive issues Ask Fedora	5	67	January 5, 2025
Major failure with recent update Ask Fedora raid , f41	14	271	March 24, 2025
Md raid usage problem Ask Fedora f37 , mdadm , coreos	4	700	March 30, 2023
Confused about software RAID Ask Fedora mdadm , raid	6	972	May 8, 2024
Raid 5 missing 2 drives Ask Fedora f36	7	483	August 4, 2022

RAID 5 disk failed; can't get it restarted

Related topics