Fedora Silverblue 29 does not boot after upgrade from 28


#1

Today I upgraded from Silverblue 28 (28.20181027.0) to Silverblue 29 (29.20181102.0) using the commands:

$ ostree remote gpg-import fedora-workstation -k /etc/pki/rpm-gpg/RPM-GPG-KEY-fedora-29-primary

$ rpm-ostree rebase fedora/29/x86_64/silverblue

as described here https://docs.fedoraproject.org/en-US/fedora-silverblue/upgrades/ .

The upgrade seemed to work fine.
When I try to boot into 29, the boot process does not finish and hangs after the input of the disk encryption password.
Booting in the 28 installation still works fine.
I did a photo of the output during the boot process.


#2

Can you try booting into 28 and running journalctl -b-1 -n200, then posting the output? That should show the logs from the previous boot, so you can see what’s going on.


#3

I posted the output here: https://paste.fedoraproject.org/paste/S0jCwiK0utgt~4vhkzDVQg


#4

Well, this is surely odd…

Looks like a bunch of device- and network-related systems time out when starting, then NetworkManager crashes due to memory corruption, followed by systemd itself completely crashing in asprintf, again due to memory corruption…

Maybe there’s some weird interaction between devices and LUKS? Not exactly sure…


#5

Could this be caused by a layered package ( I only layered snapd ) ?
Ok it seems not. I just used rpm-ostree ex reset to ensure, that everything is, as it should be.
However the boot process still hangs.


#6

Out of curiosity, does rawhide work? Maybe it’s been a bug that’s since been fixed?


#7

With Rawhide the boot process hangs at the same point.


#8

Are there any logs which I could provide to further debug this issue ?


#9

Okay, so this is admittedly a really…interesting issue, largely because systemd itself is crashing, which really should never happen (and is also incredibly rare). Since that’s kind of the elephant in the room, let’s start there.

Make sure the 29 installation is still available, then run rpm-ostree install systemd-debuginfo. (If not, re-run rebase with --install=systemd-debuginfo.) This will install the systemd debugging symbols, which will probably be useful. Try this:

When you reach the GRUB menu, select Fedora 29 and press e. Use the arrow keys to move to the line starting with linux (sometimes linux16 or linuxefi or similar), then add a space followed by:

systemd.log_level=debug systemd.log_target=kmsg log_buf_len=1M printk.devkmsg=on

Let it fail, then boot back into 28, run journalctl -b -1, and save and upload the output.


#10

Here is the output: https://paste.fedoraproject.org/paste/gEoSKTwShmigPei6jaqrvw


#11

Hmm, there are actually a few interesting things here. Most notably:

Nov 06 09:32:55 localhost.localdomain systemd-rfkill[906]: All events read and idle, exiting.
Nov 06 09:32:55 localhost.localdomain systemd-rfkill[906]: Saved state '0' to /var/lib/systemd/rfkill/platform-thinkpad_acpi:bluetooth.
Nov 06 09:34:24 localhost.localdomain systemd[1]: Failed to get initial list of names: Connection timed out
Nov 06 09:34:24 localhost.localdomain systemd[1]: Event source manager-sync-bus-names (type defer) returned error, disabling: Connection timed out
Nov 06 09:34:24 localhost.localdomain systemd[1]: bluetooth.service: Got notification message from PID 1017 (STATUS=Starting up)
Nov 06 09:34:24 localhost.localdomain systemd[1]: systemd-journald.service: Got notification message from PID 770 (WATCHDOG=1)
Nov 06 09:34:24 localhost.localdomain systemd[1]: systemd-udevd.service: Got notification message from PID 800 (WATCHDOG=1)
Nov 06 09:34:24 localhost.localdomain systemd[1]: Received SIGCHLD from PID 1032 (chronyd).
Nov 06 09:34:24 localhost.localdomain systemd[1]: Child 906 (systemd-rfkill) died (code=exited, status=0/SUCCESS)
Nov 06 09:34:24 localhost.localdomain systemd[1]: systemd-rfkill.service: Child 906 belongs to systemd-rfkill.service.
Nov 06 09:34:24 localhost.localdomain systemd[1]: systemd-rfkill.service: Main process exited, code=exited, status=0/SUCCESS
Nov 06 09:34:24 localhost.localdomain systemd[1]: systemd-rfkill.service: Changed running -> dead
Nov 06 09:34:24 localhost.localdomain audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-rfkill comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Nov 06 09:34:24 localhost.localdomain systemd[1]: systemd-rfkill.socket: Changed running -> listening
Nov 06 09:34:24 localhost.localdomain systemd[1]: Child 1018 (chronyd) died (code=exited, status=0/SUCCESS)
Nov 06 09:34:24 localhost.localdomain systemd[1]: chronyd.service: Child 1018 belongs to chronyd.service.
Nov 06 09:34:24 localhost.localdomain systemd[1]: chronyd.service: Control process exited, code=exited status=0
Nov 06 09:34:24 localhost.localdomain systemd[1]: chronyd.service: Got final SIGCHLD for state start.
Nov 06 09:34:24 localhost.localdomain systemd[1]: chronyd.service: Permission denied while opening PID file or potentially unsafe symlink chain, will now retry with relaxed checks: /var/run/chrony/chronyd.pid
Nov 06 09:34:24 localhost.localdomain systemd[1]: chronyd.service: New main PID 1033 belongs to service, we are happy.
Nov 06 09:34:24 localhost.localdomain systemd[1]: chronyd.service: Main PID loaded: 1033
Nov 06 09:34:24 localhost.localdomain systemd[1]: chronyd.service: About to execute: /usr/libexec/chrony-helper update-daemon
Nov 06 09:34:24 localhost.localdomain systemd[1]: chronyd.service: Forked /usr/libexec/chrony-helper as 1040
Nov 06 09:34:24 localhost.localdomain systemd[1]: chronyd.service: Changed start -> start-post
Nov 06 09:34:24 localhost.localdomain systemd[1]: Child 1032 (chronyd) died (code=exited, status=0/SUCCESS)
Nov 06 09:34:24 localhost.localdomain systemd[1]: chronyd.service: Child 1032 belongs to chronyd.service.
Nov 06 09:34:24 localhost.localdomain systemd[1]: firewalld.service: Start operation timed out. Terminating.
Nov 06 09:34:24 localhost.localdomain systemd[1]: firewalld.service: Changed start -> stop-sigterm
Nov 06 09:34:24 localhost.localdomain systemd[1]: avahi-daemon.service: Start operation timed out. Terminating.
Nov 06 09:34:24 localhost.localdomain systemd[1]: avahi-daemon.service: Changed start -> stop-sigterm
Nov 06 09:34:24 localhost.localdomain systemd[1]: bluetooth.service: Start operation timed out. Terminating.
Nov 06 09:34:24 localhost.localdomain systemd[1]: bluetooth.service: Changed start -> stop-sigterm

Basically, everything’s all fine and dandy, but then systemd gets stuck waiting for a minute and a half for D-Bus. Once that times out, everything else goes downhill and systemd eventually crashes. This seems to maybe imply an issue with D-Bus…

Try booting back into 28, then run coredumpctl -r dump 1048 -o systemd.core and save the systemd.core file for future reference. If this can’t be solved, I’ll attach that to a systemd bug report (it shouldn’t be crashing due to D-Bus issues anyway).

Now, rebase back to Fedora 28 while installing dbus-broker (the command should be something like rpm-ostree rebase --install=dbus-broker -b fedora/28/x86_64/workstation), then reboot. Once rebooted, run:

systemctl enable dbus-broker.service
systemctl --global dbus-broker.service

Now reboot again. Once rebooted, ensure the system is fully functional, then redo the rebase to Fedora 29 (rpm-ostree rebase -b fedora/29/x86_64/silverblue). Reboot and see if it works, or at least gets a little further along than before. If it works, then the issue was definitely with D-Bus, and you can probably leave the system as-is (Fedora 30 is moving to dbus-broker anyway). Otherwise, I’ll file a systemd bug report with all this information.

There might be a shorter way to do this (via rescue.target), but I’ll have to try myself to see how it interacts with OSTree, which I won’t be able to do until later today or tomorrow.


#12

I’ve been experiencing this, I believe it’s selinux-related. I’ve had to tack on enforcing=0 in grub to boot. One specific denial I was seeing was w/ boltd (maybe https://bugzilla.redhat.com/show_bug.cgi?id=1632230). There’s an selinux policy update that wasn’t yet in the silverblue 29 compose last week – I’m going to reboot shortly and see if the latest update is working.


#13

After adding enforcing=0 to grub silverblue 29 boots indeed. But this is probably not a good fix, as this will deactivate SELinux, right?

I don’t have the time to check, if the boot problems could also be related to D-Bus right now. I will try to check this within the next days.


#14

There’s a new selinux policy in the latest commit, but I still can’t boot w/o enforcing=0.


#15

I got my install working w/ selinux enforcing. I did a couple of things. I had a few selinux policy modules that I’d created at some point w/ audit2allow, like “my-pulseaudio,” and I removed those. I sought to remove all module object local customizations with semanage modules -D. I also tried restorecon -Rv /var/.

Anyway, I’m back and in enforcing mode, running 29.20181107.0.


#16

Ohh, this might be https://github.com/ostreedev/ostree/issues/1026 - sorry about that - really old issue.
Your customizations would probably work if you booted into the new deployment and got a shell and rebuilt policy with semodule -B. But dropping your customizations is also a good idea if they’re no longer necessary.


#17

Thanks. This seemed to solve the issue for me, too.


#18

Out of curiosity, would there still be a bug here with systemd? I don’t see why it should be segfaulting either way.


#19

Just a short question: I ran this command as root. Can this lead to undesired results ?


#20

It probably shouldn’t SEGV I agree, although I suspect the F28 policy is denying something fundamental that the F29 systemd really needs and has no reason to expect would fail.