CoreOS - Raid1 /var/lib/data wont boot after boot

bugthing · January 31, 2022, 5:22pm

Hello,
I am trying to get CoreOS on bare metal with a 10TB raid1 as per this

Seems to install and I reboot successfully after installation.
My issue is that when I do another restart the machine does not come back up.

I thought perhaps the raid1 was the problem (hard to get error messages and they go off the screen) so I have waited for resync to complete (cat /proc/mdstat) … but waiting makes no difference.

The boot process hangs and eventually asks to “press Control+D or Enter for maintenance”

When in maintenance the /dev/md/data is not present… the disks seems ok thou

I am abit lost, not sure what/how to check … wonder if anyone has any pointers for me?

computersavvy · January 31, 2022, 6:55pm

Post the output of cat /proc/mdstat

With the result of ‘cat /proc/mdstat’ you can see the device name in that raid array, so also post the output of dmesg | grep mdXXX where you replace the ‘mdXXX’ with the actual name of that device.

Then also add the output of mdadm --detail --scan -vv

bugthing · February 1, 2022, 8:01am

thanks for the response.

The CoreOS install I am trying attain includes 1TB hdd, 2x2TB (raid0) and the trouble some 2x10TB (raid1)

Using the commands you mention, I can not see the raid1 array at all… but the disks are present when I do lsblk

bugthing · February 1, 2022, 8:02am

FYI: this is my butane yaml file: coreos-home-server/home-server.yml at main · bugthing/coreos-home-server · GitHub

bugthing · February 1, 2022, 8:13am

when I tried to assemble the missing raid1 … fails:

so I then tried to create the missing raid1
mdadm --create --verbose /dev/md126 --level=1 --raid-devices=2 /dev/sdc1 /dev/sdd1
… which works

Now the raid1 exists:

is there away to get the boot process to now “continue”… because when I reboot it, still fails in the same way

oprizal · February 1, 2022, 9:02am

Hi, I read you Github yaml file, from part filesystems → device, some of them look like using your raid config. But from your systemd, I can’t see the service to mount that devices *.mount as example given from CoreOS: Create a RAID-enabled Data Volume as bellow (copy paste from the example):

[Mount]
What=/dev/md/data
Where=/var/lib/data
Type=ext4

[Install]
WantedBy=local-fs.target

bugthing · February 1, 2022, 10:46am

thanks alot for taking the time to have a look.
The butane yaml here means that it will generate the mount unit for me … which it appears to do so as the ignition file contains

# Generated by Butane
[Unit]
Before=local-fs.target
Requires=systemd-fsck@dev-md-data.service
After=systemd-fsck@dev-md-data.service

Mount]
Where=/var/lib/data
What=/dev/md/data
Type=ext4

[Install]
RequiredBy=local-fs.target

computersavvy · February 1, 2022, 11:43am

I see a bit of confusion in what you posted.
The very first command cat /proc/mdstat shows the device as md127.
at least 2 of your commands use md126 instead, both the assemble and the create options, plus the 'dmesg | grep md126`. Thus the device name you are using does not match what the system assigned as the actual device.

Additionally, the output of mdadm --detail --scan -vv shows md127 to consist of /dev/sde1 and /dev/sdf1, yet your assemble and create commands use /dev/sdc1 and /dev/sdd1

When you did the create step it warned you that /dev/sdd1 seemed part of an array but that /dev/sdc1 was not.
The final command you showed with ‘cat /proc/mdstat’ shows 2 arrays, md126 that you newly created and md127 that previously existed. the new array (md126) shows it needs to wait 134 minutes (more than 2 hours) to complete the resync process.

Lastly, just creating the array does not make it usable. Once the array is created it needs to, at a minimum, be formatted to a file system before it can be mounted or used. It is, after all, a raw device just as a new drive or partition.

bugthing · February 1, 2022, 1:04pm

thanks for taking a deeper look.
I have 2 raid arrays, 1 raid0 and 1 raid1… the raid1 is the one that is causing problems (sorry I should have mentioned that up front)
raid0 is made of /dev/sde1 /dev/sdf1
raid1 is made of /dev/sdc1 /dev/sdd1

After your pointers to commands I can diagnose the issue abit more… I am seeing the raid1 array just disappear every time I reboot.

In the maintenance terminal I am forced into, I only ever see details about my raid0 array… the raid1 just disappears each boot… even when I create and format the raid1 myself like so

mdadm --create --verbose /dev/md/data --level=1 --raid-devices=2 /dev/sdc1 /dev/sdd1
mkfs.ext4 -L DATA /dev/md/data
reboot

when I arrive back in maintenance terminal, there is only a raid0

I can see in systemd the dev-md-data.device is timeing out… which I guess makes sense if the raid1 array is just disappearing.

Any clues how I would diagnose the disappearing raid1 issue?.. where/how does CoreOS store its data about raid arrays?

bugthing · February 1, 2022, 2:18pm

my issue sounds the same as this

computersavvy · February 1, 2022, 6:48pm

If you give it a name such as you did, (/dev/md/data), then you note in the pic you posted, the actual raid device created was md126.

It appears you are using the entire disk, so my approach to this would be slightly different.

I would first make certain there is no partition table or raid array data on those drives. sudo dd if=/dev/zero of=/dev/sdc bs=1M count=1 and repeat the same for /devsdd. (you can do the same for the partitions /dev/sdc1 & /dev/sdd1 if you choose to continue to use the partitions)
Then I would not create a partition for the array, but would instead use the entire disk such as mdadm --create --verbose /dev/md/data --level=1 --raid-devices=2 /dev/sdc /dev/sdd
I then would partition the device created, sudo gdisk /dev/md126 and create an appropriate size partition for my needs. Here you can use the entire array size, or less as you choose.
Finally I would format the partition and it should work.

What I did on mine (I am using raid 5 with LVM) was exactly the above, but instead of directly creating the partition on the array, I added md127 as the physical volume to a new VG (which created the LVM partition), then created my /home LV on that VG, formatted the LV (partition) and was good to go.

Since you are doing all this in the emergency shell, I am wondering if the array actually has its structure saved anywhere. What about booting to the live USB image, creating the array there, then rebooting

Lastly, it seems you may be relying on udev to mount the device. I would instead use an entry in /etc/fstab to mount it during boot instead of relying on udev to always do what I want (it sometimes does different than expected).

oprizal · February 2, 2022, 12:19am

Hi again, I’m sorry I forget to mention that I just give my though without any particular experience with all what you did.

Btw, since it generate *.mount with RequiredBy instead of WantedBy, I believe it’s depending on what required service defined on unit. May be you also need to check the systemctl status of systemd-fsck@dev-md-data.service after the system failed to boot.

My guess systemd-fsck@dev-md-data.service failed.

I check on my regular workstation installation, the systemd-fsck@<any-slice>.service only for checking the EFI and /boot partition. If you need to run fsck, maybe you can create the service else where without make it as dependency before mounting your disk.

bugthing · February 2, 2022, 10:38am

thanks for pointing that out. I have switched to using a custom .mount unit instead so a failed raid mount does not put me into maintenance terminal… much better

… I continue to battle the raid1 array issue…

bugthing · February 2, 2022, 11:02am

thanks for tip how to clear the partition table. I have cleared and repartitioned the disks.
… no matter what I try (manual creation of the raid1 setup or using the ignition file) the raid1 disappears after 2nd boot

This feels SO close…

What seems strange is that when I try to mount the partitions that the coreos installer creates (as per the ignition file) I get the following

mount /dev/sdc1 /mnt/
mount: /var/mnt: wrong fs type, bad option, bad superblock on /dev/sdc1, missing codepage or helper program, or other error.

oprizal · February 2, 2022, 12:10pm

Hi again. In the past I saw my colleague setting up raid 1 on (if I can recall) Dell Optiplex with 2 WD drive from BIOS. After that he create partition (creating filesystem, not sure ntfs or fat or other) to install Windows XP SP2.

May be it’s different with linux that we can create the filesystem mkfs before creating the raid 1. But I just want to make sure if you already did that since the messages is wrong fs type.

Update:

I read again your yaml file it’s already mention ext4. My bad, I’m just too curious with something new for me .

bugthing · February 2, 2022, 12:53pm

the machines raid setup is perfect after the 1st boot after installation…

$ sudo mdadm --detail --scan -vv
/dev/md/media:
           Version : 1.2
     Creation Time : Wed Feb  2 12:41:45 2022
        Raid Level : raid0
        Array Size : 2930008064 (2.73 TiB 3.00 TB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

       Update Time : Wed Feb  2 12:41:45 2022
             State : clean
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

            Layout : -unknown-
        Chunk Size : 512K

Consistency Policy : none

              Name : any:media
              UUID : fec53b15:57d05e3f:7ae6f370:22193e39
            Events : 0

    Number   Major   Minor   RaidDevice State
       0       8       65        0      active sync   /dev/sde1
       1       8       81        1      active sync   /dev/sdf1
/dev/md/backup:
           Version : 1.2
     Creation Time : Wed Feb  2 12:41:45 2022
        Raid Level : raid1
        Array Size : 9766303680 (9.10 TiB 10.00 TB)
     Used Dev Size : 9766303680 (9.10 TiB 10.00 TB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Wed Feb  2 12:46:49 2022
             State : clean, resyncing
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

     Resync Status : 1% complete

              Name : any:backup
              UUID : 16a14f3e:db682bc8:4dd60bd9:0c6c93c6
            Events : 244

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       8       49        1      active sync   /dev/sdd1

… but after 2nd reboot, raid1 is gone and the partition table on the disk seems to have changed…
not sure what to try next TBH…
I might try to setup another OS in a similar configuration to double check its not the disks or something else preventing this config

oprizal · February 2, 2022, 12:56pm

~~Above are filesystem?~~

~~Would you like to read this, CoreOs: Partition Reuse Semantics first. I’m not sure but I think it’s related.~~

Update:

Above are not relevant since the ignition success and the raid disks also created successfully.

jakfrost · February 2, 2022, 1:10pm

and …

Seems to indicate the Raid1 array was in an rsync operation, I assume it finished before reboot was attempted.
I’ll take a look at your ignition file and see if I see anything obvious. I’ve done a couple of bare metal installs of FCOS, just not recently.

- path: /var/lib/backup
      device: /dev/md/backup
      format: ext4
      label: BACKUP
      wipe_filesystem: false
      with_mount_unit: false

Wouldn’t with_mount_unit: false mean don’t mount the device?

bugthing · February 2, 2022, 3:15pm

thanks for taking a look

with_mount_unit: false mean dont create the mount unit … I do this myself later on in the config

bugthing · February 2, 2022, 3:31pm

I gave the server install ago with Arch Linux… I installed and manually setup the disks in the same way I have been doing so in CoreOS…

mdadm --create --verbose /dev/md/backup --level=1 --raid-devices=2 /dev/sdc1 /dev/sdd1
mkfs.ext4 -L BACKUP /dev/md/backup
mount /dev/md/backup /backup # added in fstab

to my surprise when I rebooted the raid1 array had also disappeared!!

So i tried creating array using the device (not the partition) … ie. /dev/sdc instead of /dev/sdc1

mdadm --create --verbose /dev/md/backup --level=1 --raid-devices=2 /dev/sdc /dev/sdd
gdisk /dev/md/backup # create 1 partition
mkfs.ext4 -L BACKUP /dev/md/backup1
mount /dev/md/backup1 /backup # added in fstab

when I reboot the raid array was still present!.. so I took it a part

mdadm --stop /dev/md/backup
mdadm --zero-superblock /dev/sdc
mdadm --zero-superblock /dev/sdd

I think the above zero’ing must be what solved my problem… I think there was some data in the wrong place on the disk that was confusing the software raid how to build the array of disks…

… built it again using partitions (as coreos want to do)… and after a reboot the raid remained!

I straight away gave the CoreOS install another try… and this time the both Raid arrays are present after multiple reboots!

Topic		Replies	Views
Software RAID drive issues Ask Fedora	5	66	January 5, 2025
Md raid usage problem Ask Fedora f37 , mdadm , coreos	4	699	March 30, 2023
Diskless and Raid Project Discussion coreos-wg	1	406	November 6, 2020
Trying to setup full Raid1 installation Project Discussion coreos-wg	10	3819	December 7, 2022
Fedora CoreOs degraded RAID1 encrypted with LUKS not booting Ask Fedora luks2 , raid , coreos	11	163	March 21, 2025

CoreOS - Raid1 /var/lib/data wont boot after boot

Related topics