Fedora CoreOS installation works, "Preparing /etc: Invalid byte sequence in conversion input" on rpm-ostree install afterwards

I have one server where something after installation is really wrong:

rpm-ostree install wget
Checking out tree 83cb950... done
Enabled rpm-md repositories: fedora-cisco-openh264 fedora-modular updates-modular updates fedora updates-archive
Importing rpm-md... done
rpm-md repo 'fedora-cisco-openh264' (cached); generated: 2021-09-21T18:07:30Z solvables: 4
rpm-md repo 'fedora-modular' (cached); generated: 2021-10-26T05:08:36Z solvables: 1283
rpm-md repo 'updates-modular' (cached); generated: 2022-04-29T06:02:15Z solvables: 1403
rpm-md repo 'updates' (cached); generated: 2022-05-02T07:00:22Z solvables: 30040
rpm-md repo 'fedora' (cached); generated: 2021-10-26T05:31:27Z solvables: 65732
rpm-md repo 'updates-archive' (cached); generated: 2022-05-02T07:36:08Z solvables: 40545
Resolving dependencies... done
Checking out packages... done
Running pre scripts... done
Running post scripts... done
Running posttrans scripts... done
Writing rpmdb... done
Writing OSTree commit... done
Staging deployment... done
error: Preparing /etc: Invalid byte sequence in conversion input

the strange thing is, installing the same server without

variant: fcos
version: 1.4.0
boot_device:
  mirror:
    devices:
      - /dev/nvme0n1
      - /dev/nvme1n1

works normally.
It’s really only adding the raid, the breaks “rpm-ostree install”

Any ideas where to look at? Any Idea how to find the broken file (if there is one)

This is puzzling. It sounds like possibly some kind of data corruption, but I don’t have any good suggestions about where to start looking.

Alternatively, this may be the symptom of some configuration files with a not UTF-8 name.
If that’s the case, perhaps ostree admin config-diff may be able to catch it with more details.

I just tried this on latest testing-devel and wasn’t able to reproduce the issue. What FCOS version are you on?

As @lucab said maybe sudo ostree admin config-diff can show us a file that shouldn’t be there.

Can you reproduce this reliably on other hardware? Maybe an issue with the disks?

This is reliably reproducible on that host ( a Dedicated Root Server Hosting - Hetzner Online GmbH ). All other instance types I tried so far, did not have that issue.
I wipefs -a /dev/nvme0n1 /dev/nvme1n1 between tries. Maybe I should order another host of that type to rule out hardware defects.
As said, single disc install on that host works as expected.

Yeah. That might be an easy way to further the investigation.

Installing on new instance:
Even more broken:

Fedora CoreOS 35.20220410.3.1
(...)
[koelle@xxx ~]$ vim
-bash: vim: command not found
[koelle@xx ~]$ sudo su
root@xx /var/home/koelle # htop
Error opening terminal: xterm-256color.
root@xxx /var/home/koelle # vim
vim: error while loading shared libraries: /lib64/libtdb.so.1: invalid ELF header

a bit later

sudo su
sudo: error while loading shared libraries: /lib64/libcap-ng.so.0: invalid ELF header

sudo dmesg
sudo: error while loading shared libraries: /lib64/libcap-ng.so.0: invalid ELF header

force reboot.
ping available but no ssh, let’s wait a bit…

later:

Have booted the fully broken install in qemu:

There are three very strange things:

  • the “?” in “disk-by”
  • LVM “direct activation… FAILED”

There’s also messages in there about a degraded disk array. I’m not sure exactly what’s happening but definitely sounds like the raid isn’t happy. Are these messages all from the first boot of the machine?

From your link above to the type of instance it says:

2 x 3.84 TB NVMe SSD Datacenter Edition
(software RAID 1) 

So we are essentially having two 3.84TB disks with the boot and root filesystems in a raid1?

Yes, it’s raid 1 on two 3.84 TB NVMes. There are no other drives.
The strange thing is, that first boot works, even ssh login. Then something strange happens and everything is defects. No further login possible.
Maybe something with the auto partition resize on first boot?

what does cat /proc/mdstat say about the integrity of the software RAID?

Re-installed the server.
I could catch the couple of seconds where ssh login is possible and opened journalctl -f

This might help. We can see the systemd rpm service doing its thing successfully, then something happens I don’t understand, then rpm-ostree is logging again (invalid ELF header), after that there is no other ssh login possible.

May 09 07:45:31 rs116 rpm-ostree[3021]: sanitycheck(/usr/bin/true) successful
May 09 07:45:32 rs116 rpm-ostree[3021]: Regenerating rpmdb for target
May 09 07:45:32 rs116 rpm-ostree[2956]: Writing rpmdb...done
May 09 07:45:37 rs116 promtail[4345]: level=warn ts=2022-05-09T07:45:37.920920446Z caller=client.go:349 component=client host=mon1.REMOVED.de:3100 msg="error sending batch, will retry" status=-1 error="Post \"http://mon1.REMOVED.de:3100/loki/api/v1/push\": context deadline exceeded"
May 09 07:45:42 rs116 systemd[1]: systemd-hostnamed.service: Deactivated successfully.
May 09 07:45:42 rs116 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'                                                                                                                                                                                                                                      
May 09 07:45:42 rs116 kernel: kauditd_printk_skb: 16 callbacks suppressed
May 09 07:45:42 rs116 kernel: audit: type=1131 audit(1652082342.710:376): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 09 07:45:42 rs116 audit: BPF prog-id=0 op=UNLOAD
May 09 07:45:42 rs116 audit: BPF prog-id=0 op=UNLOAD
May 09 07:45:42 rs116 kernel: audit: type=1334 audit(1652082342.762:377): prog-id=0 op=UNLOAD
May 09 07:45:42 rs116 kernel: audit: type=1334 audit(1652082342.762:378): prog-id=0 op=UNLOAD
May 09 07:45:43 rs116 rpm-ostree[3021]: Wrote commit: e6b98acfc8fdccfdf08ec673f299e1c629c15b19da90a80ed2cac77ce8658fc9; New objects: meta:1284 content:16 totaling 20.9 MB)
May 09 07:45:43 rs116 rpm-ostree[2956]: Writing OSTree commit...done
May 09 07:45:44 rs116 rpm-ostree[3021]: note: Deploying commit e6b98acfc8fdccfdf08ec673f299e1c629c15b19da90a80ed2cac77ce8658fc9 which contains content in /var/lib that will be ignored.
May 09 07:45:44 rs116 systemd[1]: Finished OSTree Finalize Staged Deployment.
May 09 07:45:44 rs116 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=ostree-finalize-staged comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'                                                                                                                                                                                                                                
May 09 07:45:44 rs116 kernel: audit: type=1130 audit(1652082344.153:379): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=ostree-finalize-staged comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 09 07:45:44 rs116 rpm-ostree[2956]: Staging deployment...done
May 09 07:45:44 rs116 rpm-ostree[3021]: Created new deployment /ostree/deploy/fedora-coreos/deploy/e6b98acfc8fdccfdf08ec673f299e1c629c15b19da90a80ed2cac77ce8658fc9.0
May 09 07:45:44 rs116 systemd[1]: Started /usr/bin/ostree admin unlock --transient.
May 09 07:45:44 rs116 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=rpm-ostree-unlock comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'                                                                                                                                                                                                                                     
May 09 07:45:44 rs116 kernel: audit: type=1130 audit(1652082344.801:380): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=rpm-ostree-unlock comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 09 07:45:44 rs116 kernel: xfs filesystem being remounted at /sysroot supports timestamps until 2038 (0x7fffffff)
May 09 07:45:44 rs116 kernel: EXT4-fs (md127): re-mounted. Opts: . Quota mode: none.
May 09 07:45:44 rs116 kernel: ext4 filesystem being remounted at /boot supports timestamps until 2038 (0x7fffffff)
May 09 07:45:44 rs116 rpm-ostree[6232]: A writable overlayfs is prepared for /usr, but is mounted read-only by default.
May 09 07:45:44 rs116 rpm-ostree[6232]: All changes there will be discarded on reboot.
May 09 07:45:44 rs116 systemd[1]: rpm-ostree-unlock.service: Deactivated successfully.
May 09 07:45:44 rs116 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=rpm-ostree-unlock comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'                                                                                                                                                                                                                                      
May 09 07:45:44 rs116 kernel: audit: type=1131 audit(1652082344.883:381): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=rpm-ostree-unlock comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 09 07:45:45 rs116 rpm-ostree[3021]: Computed /usr diff: files(added:524 removed:0 changed:2939) dirs(added:107 removed:0 changed:0)
May 09 07:45:45 rs116 rpm-ostree[3021]: Computed pkg diff: 110 added, 0 changed, 0 removed
May 09 07:45:45 rs116 rpm-ostree[3021]: Computed /etc diff: files(added:45 removed:0 changed:13) dirs(added:4 removed:0 changed:0)
May 09 07:45:45 rs116 rpm-ostree[2956]: Computing /etc diff to preserve...done
May 09 07:45:47 rs116 rpm-ostree[2956]: Updating /usr...done
May 09 07:45:47 rs116 dbus-broker-launch[2878]: Noticed file-system modification, trigger reload.
May 09 07:45:47 rs116 dbus-broker-launch[2878]: Looking up NSS user entry for 'dbus'...
May 09 07:45:47 rs116 dbus-broker-launch[2878]: NSS returned NAME 'dbus' and UID '81'
May 09 07:45:47 rs116 dbus-broker-launch[2878]: Looking up NSS user entry for 'systemd-resolve'...
May 09 07:45:47 rs116 dbus-broker-launch[2878]: NSS returned NAME 'systemd-resolve' and UID '990'
May 09 07:45:47 rs116 dbus-broker-launch[2878]: Looking up NSS user entry for 'systemd-timesync'...
May 09 07:45:47 rs116 dbus-broker-launch[2878]: NSS returned NAME 'systemd-timesync' and UID '993'
May 09 07:45:47 rs116 audit[2878]: AVC avc:  denied  { open } for  pid=2878 comm="dbus-broker-lau" path="/usr/share/dbus-1/system.d/org.freedesktop.home1.conf" dev="overlay" ino=6444844622 scontext=system_u:system_r:system_dbusd_t:s0-s0:c0.c1023 tcontext=system_u:object_r:modules_object_t:s0 tclass=file permissive=0                                                                                                                                                     
May 09 07:45:47 rs116 dbus-broker-launch[2878]: Access denied in /usr/share/dbus-1/system.d/org.freedesktop.home1.conf +1: /usr/share/dbus-1/system.d/org.freedesktop.home1.conf
May 09 07:45:47 rs116 audit[2878]: SYSCALL arch=c000003e syscall=257 success=no exit=-13 a0=ffffff9c a1=556f57a92371 a2=80100 a3=0 items=0 ppid=1 pid=2878 auid=4294967295 uid=81 gid=81 euid=81 suid=81 fsuid=81 egid=81 sgid=81 fsgid=81 tty=(none) ses=4294967295 comm="dbus-broker-lau" exe="/usr/bin/dbus-broker-launch" subj=system_u:system_r:system_dbusd_t:s0-s0:c0.c1023 key=(null)                                                                                     
May 09 07:45:47 rs116 kernel: audit: type=1400 audit(1652082347.726:382): avc:  denied  { open } for  pid=2878 comm="dbus-broker-lau" path="/usr/share/dbus-1/system.d/org.freedesktop.home1.conf" dev="overlay" ino=6444844622 scontext=system_u:system_r:system_dbusd_t:s0-s0:c0.c1023 tcontext=system_u:object_r:modules_object_t:s0 tclass=file permissive=0
May 09 07:45:47 rs116 kernel: audit: type=1300 audit(1652082347.726:382): arch=c000003e syscall=257 success=no exit=-13 a0=ffffff9c a1=556f57a92371 a2=80100 a3=0 items=0 ppid=1 pid=2878 auid=4294967295 uid=81 gid=81 euid=81 suid=81 fsuid=81 egid=81 sgid=81 fsgid=81 tty=(none) ses=4294967295 comm="dbus-broker-lau" exe="/usr/bin/dbus-broker-launch" subj=system_u:system_r:system_dbusd_t:s0-s0:c0.c1023 key=(null)
May 09 07:45:47 rs116 kernel: audit: type=1327 audit(1652082347.726:382): proctitle=2F7573722F62696E2F646275732D62726F6B65722D6C61756E6368002D2D73636F70650073797374656D002D2D6175646974
May 09 07:45:47 rs116 audit: PROCTITLE proctitle=2F7573722F62696E2F646275732D62726F6B65722D6C61756E6368002D2D73636F70650073797374656D002D2D6175646974
May 09 07:45:47 rs116 dbus-broker-launch[2878]: Invalid configuration, ignored.
May 09 07:45:47 rs116 rpm-ostree[2956]: Updating /etc...done
May 09 07:45:47 rs116 rpm-ostree[6276]: systemd-run: error while loading shared libraries: /lib64/libgcc_s.so.1: invalid ELF header

for completeness

cat /proc/mdstat
Personalities : [raid1]
md126 : active raid1 nvme1n1p4[1] nvme0n1p4[0]
3750080832 blocks super 1.2 [2/2] [UU]
[>....................]  resync =  1.7% (65739136/3750080832) finish=298.1min speed=205940K/sec
bitmap: 28/28 pages [112KB], 65536KB chunk

md127 : active raid1 nvme1n1p3[1] nvme0n1p3[0]
393152 blocks super 1.0 [2/2] [UU]

unused devices: <none>

btw. those drives are capable of doing a full resync in like 7 minutes with more aggressive configuration.

journal entry for failed ssh login

May 09 08:12:09 rs116 sshd[6590]: /usr/sbin/sshd: error while loading shared libraries: libsystemd.so.0: cannot dynamically load position-independent executable

Solution:
Reduce size of boot partition to 2TB

variant: fcos
version: 1.4.0
boot_device:
  mirror:
    devices:
      - /dev/sda
      - /dev/sdb
storage:
  disks:
    - device: /dev/sda
      partitions:
        - label: root-1
          #defect 2621000 => error: Running tmpfiles for /run and /var: Running systemd worker: exit status: 127
          #ok 2097000
          size_mib: 2097000
    - device: /dev/sdb
      partitions:
        - label: root-2
          size_mib: 2097000


2TB (2097000 MIB) OK
2.5TB (2621000 MIB) Failure

@st.koelle - thanks for the information. I don’t know if I’d exactly call it a solution because there’s still a problem. You’ve just found a good workaround.

Hopefully you can open and issue and we can find the real root cause.