Hello,
I am trying to get CoreOS on bare metal with a 10TB raid1 as per this
Seems to install and I reboot successfully after installation.
My issue is that when I do another restart the machine does not come back up.
I thought perhaps the raid1 was the problem (hard to get error messages and they go off the screen) so I have waited for resync to complete (cat /proc/mdstat) … but waiting makes no difference.
The boot process hangs and eventually asks to “press Control+D or Enter for maintenance”
When in maintenance the /dev/md/data is not present… the disks seems ok thou
I am abit lost, not sure what/how to check … wonder if anyone has any pointers for me?
With the result of ‘cat /proc/mdstat’ you can see the device name in that raid array, so also post the output of dmesg | grep mdXXX where you replace the ‘mdXXX’ with the actual name of that device.
Then also add the output of mdadm --detail --scan -vv
Hi, I read you Github yaml file, from part filesystems → device, some of them look like using your raid config. But from your systemd, I can’t see the service to mount that devices *.mount as example given from CoreOS: Create a RAID-enabled Data Volume as bellow (copy paste from the example):
thanks alot for taking the time to have a look.
The butane yaml here means that it will generate the mount unit for me … which it appears to do so as the ignition file contains
I see a bit of confusion in what you posted.
The very first command cat /proc/mdstat shows the device as md127.
at least 2 of your commands use md126 instead, both the assemble and the create options, plus the 'dmesg | grep md126`. Thus the device name you are using does not match what the system assigned as the actual device.
Additionally, the output of mdadm --detail --scan -vv shows md127 to consist of /dev/sde1 and /dev/sdf1, yet your assemble and create commands use /dev/sdc1 and /dev/sdd1
When you did the create step it warned you that /dev/sdd1 seemed part of an array but that /dev/sdc1 was not.
The final command you showed with ‘cat /proc/mdstat’ shows 2 arrays, md126 that you newly created and md127 that previously existed. the new array (md126) shows it needs to wait 134 minutes (more than 2 hours) to complete the resync process.
Lastly, just creating the array does not make it usable. Once the array is created it needs to, at a minimum, be formatted to a file system before it can be mounted or used. It is, after all, a raw device just as a new drive or partition.
thanks for taking a deeper look.
I have 2 raid arrays, 1 raid0 and 1 raid1… the raid1 is the one that is causing problems (sorry I should have mentioned that up front)
raid0 is made of /dev/sde1 /dev/sdf1
raid1 is made of /dev/sdc1 /dev/sdd1
After your pointers to commands I can diagnose the issue abit more… I am seeing the raid1 array just disappear every time I reboot.
In the maintenance terminal I am forced into, I only ever see details about my raid0 array… the raid1 just disappears each boot… even when I create and format the raid1 myself like so
If you give it a name such as you did, (/dev/md/data), then you note in the pic you posted, the actual raid device created was md126.
It appears you are using the entire disk, so my approach to this would be slightly different.
I would first make certain there is no partition table or raid array data on those drives. sudo dd if=/dev/zero of=/dev/sdc bs=1M count=1 and repeat the same for /devsdd. (you can do the same for the partitions /dev/sdc1 & /dev/sdd1 if you choose to continue to use the partitions)
Then I would not create a partition for the array, but would instead use the entire disk such as mdadm --create --verbose /dev/md/data --level=1 --raid-devices=2 /dev/sdc /dev/sdd
I then would partition the device created, sudo gdisk /dev/md126 and create an appropriate size partition for my needs. Here you can use the entire array size, or less as you choose.
Finally I would format the partition and it should work.
What I did on mine (I am using raid 5 with LVM) was exactly the above, but instead of directly creating the partition on the array, I added md127 as the physical volume to a new VG (which created the LVM partition), then created my /home LV on that VG, formatted the LV (partition) and was good to go.
Since you are doing all this in the emergency shell, I am wondering if the array actually has its structure saved anywhere. What about booting to the live USB image, creating the array there, then rebooting
Lastly, it seems you may be relying on udev to mount the device. I would instead use an entry in /etc/fstab to mount it during boot instead of relying on udev to always do what I want (it sometimes does different than expected).
Hi again, I’m sorry I forget to mention that I just give my though without any particular experience with all what you did.
Btw, since it generate *.mount with RequiredBy instead of WantedBy, I believe it’s depending on what required service defined on unit. May be you also need to check the systemctl status of systemd-fsck@dev-md-data.service after the system failed to boot.
My guess systemd-fsck@dev-md-data.service failed.
I check on my regular workstation installation, the systemd-fsck@<any-slice>.service only for checking the EFI and /boot partition. If you need to run fsck, maybe you can create the service else where without make it as dependency before mounting your disk.
thanks for pointing that out. I have switched to using a custom .mount unit instead so a failed raid mount does not put me into maintenance terminal… much better
thanks for tip how to clear the partition table. I have cleared and repartitioned the disks.
… no matter what I try (manual creation of the raid1 setup or using the ignition file) the raid1 disappears after 2nd boot
This feels SO close…
What seems strange is that when I try to mount the partitions that the coreos installer creates (as per the ignition file) I get the following
mount /dev/sdc1 /mnt/
mount: /var/mnt: wrong fs type, bad option, bad superblock on /dev/sdc1, missing codepage or helper program, or other error.
Hi again. In the past I saw my colleague setting up raid 1 on (if I can recall) Dell Optiplex with 2 WD drive from BIOS. After that he create partition (creating filesystem, not sure ntfs or fat or other) to install Windows XP SP2.
May be it’s different with linux that we can create the filesystem mkfs before creating the raid 1. But I just want to make sure if you already did that since the messages is wrong fs type.
Update:
I read again your yaml file it’s already mention ext4. My bad, I’m just too curious with something new for me .
the machines raid setup is perfect after the 1st boot after installation…
$ sudo mdadm --detail --scan -vv
/dev/md/media:
Version : 1.2
Creation Time : Wed Feb 2 12:41:45 2022
Raid Level : raid0
Array Size : 2930008064 (2.73 TiB 3.00 TB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Wed Feb 2 12:41:45 2022
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Layout : -unknown-
Chunk Size : 512K
Consistency Policy : none
Name : any:media
UUID : fec53b15:57d05e3f:7ae6f370:22193e39
Events : 0
Number Major Minor RaidDevice State
0 8 65 0 active sync /dev/sde1
1 8 81 1 active sync /dev/sdf1
/dev/md/backup:
Version : 1.2
Creation Time : Wed Feb 2 12:41:45 2022
Raid Level : raid1
Array Size : 9766303680 (9.10 TiB 10.00 TB)
Used Dev Size : 9766303680 (9.10 TiB 10.00 TB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Wed Feb 2 12:46:49 2022
State : clean, resyncing
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Consistency Policy : bitmap
Resync Status : 1% complete
Name : any:backup
UUID : 16a14f3e:db682bc8:4dd60bd9:0c6c93c6
Events : 244
Number Major Minor RaidDevice State
0 8 33 0 active sync /dev/sdc1
1 8 49 1 active sync /dev/sdd1
… but after 2nd reboot, raid1 is gone and the partition table on the disk seems to have changed…
not sure what to try next TBH…
I might try to setup another OS in a similar configuration to double check its not the disks or something else preventing this config
Seems to indicate the Raid1 array was in an rsync operation, I assume it finished before reboot was attempted.
I’ll take a look at your ignition file and see if I see anything obvious. I’ve done a couple of bare metal installs of FCOS, just not recently.
I think the above zero’ing must be what solved my problem… I think there was some data in the wrong place on the disk that was confusing the software raid how to build the array of disks…
… built it again using partitions (as coreos want to do)… and after a reboot the raid remained!
I straight away gave the CoreOS install another try… and this time the both Raid arrays are present after multiple reboots!