CoreOS pxeboot: coreos-installer can't format /dev/vda or /dev/sda

After much experimentation, I have CoreOS mostly working via tftpboot. I added a path to the initrd file and the root filesystem image. I’m able to boot into an emergency environment.

When I try installing CoreOS to a libvirt VM using the below pxelinux entry, I get an error message from the coreos-installer utility.

LABEL coreos-default-vda
MENU LABEL Install Fedora CoreOS Default (vda)
KERNEL images/fedora/coreos/fedora-coreos-33.20201201.3.0-live-kernel-x86_64
APPEND initrd=images/fedora/coreos/fedora-coreos-33.20201201.3.0-live-initramfs.x86_64.img,images/fedora/coreos/fedora-coreos-33.20201201.3.0-live-rootfs.x86_64.img coreos.inst=yes coreos.inst.install_dev=/dev/vda coreos.inst.ignition_url=http://192.168.4.24/pub/ignition/coreos_default.ign
IPAPPEND 2

The coreos-installer cannot read /dev/vda. “Error saving partitions from /dev/vda” this is due do a “generic I/O error” and the error code is error 22.
I’ve booted to a rhel 8 rescue image using the same pxeboot system. /dev/vda is still a raw partition.

Any idea what is going on? I’ve thought that maybe the target platform “metal” was to blame but that isn’t the case.

Yes I know I can use the qcow image. I used it on my Fedora 33 workstation to take the training.

I’ve added this section to my ignition file:
storage:
disks:

 device: /dev/vda
 wipe_table: true

I’m out of ideas. I’m hoping to use what I learn here to build an OKD4 on brare metal.

Can you copy the exact error message from coreos-installer? Also, do you get the same error if you run it manually from the console yourself?

I think that I’ve tracked it down to the use of libvirt. I can’t get CoreOS to install on a /dev/vda or a /dev/sda.
The error message is:

I’ve also tried booting from an ISO image and entering the coreos.inst.image_url, coreos.inst.ignition_url, and coreos.inst.install_dev entries.
I had tried installing OKD install the fedora-coreos-32.20200715.3.0-metal.x86_64.raw.xz image as directed in the instructions, but there were so many error messages I couldn’t tell if the CoreOS part of the installation was successful or not.

I’m hoping to have the CoreOS installation piece down before trying OKD or OCP again.

I should also add, that I can boot into emergency mode and partition /dev/sda or /dev/vda via fdisk. I assume this means that CoreOS has the necessary tooling, and that this means there is a problem with the coreos-installer utility.

What are everyone’s thoughts?

Hmm, OK so it looks like the GPT library we use is failing to query the disk to see if there are partitions it should backup before starting the install. Not sure why it’s getting EINVAL though.

How are you preparing the virtual block device and connecting it to the VM? Is it blank?

One thing we could do is have a switch like --skip-part-save or something which skips trying to read a partition table and just directly writes the image to disk. Though ideally we’d know why it’s failing to (not) find a GPT table so we could handle that more gracefully (either in the GPT library itself, or directly in coreos-installer).

A workaround might be to run wipefs -a -f $dev before running coreos-installer (this requires installing interactively, or using an Ignition config which runs this command before the install starts).

Note BTW that you don’t need coreos.inst.image_url anymore when using kargs.

The partition /dev/vda (or /dev/sda) is a raw device. This is a brand new libvirt VM. There isn’t a GTP or MBR to wipe.
I tried an older v 32 ISO and disk image with the same result. I also tried the wipefs command, then ran coreos-installer manually. Same result.
If it helps, the hypervisor OS is RHEL 8.3, but this is libvirt. Not exactly a new platform.

I wouldn’t expect wipefs to make a difference. That EINVAL shouldn’t be happening, and I think the key question is where it’s coming from. Could you get an strace?

sudo rpm-ostree install strace
sudo rpm-ostree ex livefs --i-like-danger
sudo strace coreos-installer install /dev/vda

(This is completely tangential but until the newer apply-live makes it out, for strace I’d recommend instead rpm-ostree usroverlay && rpm -Uvh https://kojipkgs.fedoraproject.org//packages/strace/5.10/1.fc33/x86_64/strace-5.10-1.fc33.x86_64.rpm or so)

Sadly, I can’t do any of that on a live file system. I get a “remounting /sysroot premissiong denied” error message.
I also tried creating /dev/sda1 and installing coreos that way, which of course also fails.
“Error checking for exclusive access to /dev/sda1
Caused by: couldn’t find /sys/block directory for partition /dev/sda1 of /dev/sda1”

I rebooted after creating the partition and tried it again with the same result.

Note: My libvirt partitions are coming from a local LVM logical volume, but that shouldn’t make any difference, should it? If I’m able to install all flavors of linux OS what makes CoreOS special?

coreos-installer (and Fedora CoreOS) can’t install to a partition; it needs an entire disk. It should be possible to install, though; it seems like this might be a bug in the installer.

It might be relevant that the partitions are backed by LVM. I’ve just tried that locally and was able to install successfully, but my setup might be different from yours.

Okay, fallback plan for strace. Within the live system, you can manually extract the strace rpm:

curl https://kojipkgs.fedoraproject.org/packages/strace/5.10/1.fc33/x86_64/strace-5.10-1.fc33.x86_64.rpm > strace.rpm
rpm2cpio strace.rpm | cpio -i --make-directories
# note no leading slash
sudo usr/bin/strace coreos-installer install /dev/vda

(I filed Don't try to rewrite /sysroot rw for `ostree admin unlock` · Issue #2262 · ostreedev/ostree · GitHub for the rpm-ostree usroverlay issue.)

Nice. That will help with running an strace. Normally, isn’t a live OS running in RAM?

It is running in RAM, but the FCOS live system mounts the root filesystem read-only (from a squashfs) and unlocking it requires applying a writable overlay.

That’s what I was thinking.
Still, I’m wondering if coreos-installer is incorrectly identifying the raw device type. I’ve confirmed that this works under vSphere with someone else.

I’m hoping the strace will shed some light on that, if you get a chance to run it.

Meanwhile, this bug showed up, and a broad fix for that issue might also fix this one. But it’d still be good to understand the details of what’s happening here.