Podman systemd services shutdown order

While following the examples of running podman containers as systemd services, I discovered I’m having issues cleanly rebooting. I went down a rabbit hole and now got something that works, but I’m wondering if this is really how it’s suppose to work and why nobody else would be having these issues :sweat_smile:

So I’m currently following https://docs.fedoraproject.org/en-US/fedora-coreos/running-containers/ which works fine, that is until you try to do an ordered clean shutdown/reboot. This is a different way then what’s in the podman docs which seems to be more oriented to turning existing containers into unit. I would love to stay away from PID files though… I’m having the shutdown order issues with both methods though.

For example let’s take the etcd.service running a podman container from the FCOS documentation:

[Unit]
Description=Run single node etcd
After=network-online.target
Wants=network-online.target

[Service]
ExecStartPre=mkdir -p /var/lib/etcd
ExecStartPre=-/bin/podman kill etcd
ExecStartPre=-/bin/podman rm etcd 
ExecStartPre=-/bin/podman pull quay.io/coreos/etcd
ExecStart=/bin/podman run --name etcd --volume /var/lib/etcd:/etcd-data:z --net=host quay.io/coreos/etcd:latest /usr/local/bin/etcd --data-dir /etcd-data --name node1 \ 
        --initial-advertise-peer-urls http://127.0.0.1:2380 --listen-peer-urls http://127.0.0.1:2380 \
        --advertise-client-urls http://127.0.0.1:2379 \
        --listen-client-urls http://127.0.0.1:2379 \
        --initial-cluster node1=http://127.0.0.1:2380

ExecStop=/bin/podman stop etcd

[Install]
WantedBy=multi-user.target

Then add another unit that depends on it:

[Unit]
Description=Etcd consumer
After=etcd.service
Wants=etcd.service

A systemctl reboot should stop the consumer before stopping the etcd pod. And when reviewing the systemd debug logs, at first sight this seems to be happening, only the etcd.service cgroup is already empty when the etcd.service is stopped. All processes in the podman container have already been killed well before the service unit is executing it’s stop command.

What I’m seeing is that podman creates a transient unit libpod-<id>.scope under the machine.slice for the etcd pod. This unit has the DefaultDependencies=yes and therefore conflicts on the shutdown.target. So when the reboot is initiated, a stop job for the scope unit is scheduled and executed almost immediately. Since the scope unit has no dependencies other than the machine.slice it’s stopped very early. And if I understood the systemd docs correctly all processes in a scope get a SIGTERM when a scope unit is stopped. :worried:

In addition to the stop job from the default dependencies. The machine.slice unit is also stopped because of the shutdown.target. This causes systemd to pull in stop jobs for all the podman .scope units under it. Again, since there is no relation between the scope units and the services these scope units are stopped immediately.

I’m able to work around these issues by, creating a new slice without default dependencies, run the pods in it and have the service units schedule after it.

/etc/systemd/system/system-containers.slice:

[Unit]
Description=System Containers Slice
# Do not conflict on the shutdown.target, so it will not get stopped on shutdown.
DefaultDependencies=no

/etc/systemd/system/etcd.service:

[Unit]
Description=Run single node etcd
Wants=system-containers.slice
After=system-containers.slice

[Service]
ExecStart=/bin/podman run --name etcd \
                          --cgroup-parent system-containers.slice \
...

This solves the issue that a shutdown causes stop jobs for the container’s scope units to be pulled in from the stop of the machine.slice. Note that just adding an After=machine.slice to the service units is definitely not enough here, since the machine.slice is stopped explicitly it pulls in the additional jobs like this:

systemd[1]: Pulling in machine.slice/stop from shutdown.target/start
systemd[1]: Added job machine.slice/stop to transaction.
systemd[1]: Pulling in libpod-f558a3cffaacef61b4846aba7c98da0a8a4bcf95af6227df70a722586ac92d33.scope/stop from machine.slice/stop
systemd[1]: Added job libpod-f558a3cffaacef61b4846aba7c98da0a8a4bcf95af6227df70a722586ac92d33.scope/stop to transaction.

In addition these scope units have DefaultDependencies that conflict with the shutdown.target and also cause a stop job to be scheduled. So I’m also adding --annotation=org.systemd.property.DefaultDependencies=false to the run command. This annotation is a rather obscure runc feature that allows me to pass extra systemd properties into the transient scope units.

So I end up with podman units that look like this:

[Unit]
Description=Run single node etcd
After=system-containers.slice
Wants=system-containers.slice
After=network-online.target
Wants=network-online.target

[Service]
ExecStartPre=mkdir -p /var/lib/etcd
ExecStartPre=-/bin/podman kill etcd
ExecStartPre=-/bin/podman rm etcd 
ExecStartPre=-/bin/podman pull quay.io/coreos/etcd
ExecStart=/bin/podman run --name etcd \
        --cgroup-parent system-containers.slice \
        --annotation=org.systemd.property.DefaultDependencies=false \
        --volume /var/lib/etcd:/etcd-data:z --net=host quay.io/coreos/etcd:latest /usr/local/bin/etcd --data-dir /etcd-data --name node1 \ 
        --initial-advertise-peer-urls http://127.0.0.1:2380 --listen-peer-urls http://127.0.0.1:2380 \
        --advertise-client-urls http://127.0.0.1:2379 \
        --listen-client-urls http://127.0.0.1:2379 \
        --initial-cluster node1=http://127.0.0.1:2380

ExecStop=/bin/podman stop etcd

[Install]
WantedBy=multi-user.target

But this all feels way more convoluted than it ought to be. Is the not having DefaultDependencies=no in podman’s scope unit a bug? Do we need to change the documentation to recommend using your own slice or should the machine slice not be stopped? Or is there some way of adding a dependency between the transient scope unit and the service.

I would love some insight :slight_smile:

Copy/pasting some IRC response:

mheon> Matt Heon walters: looking
13:36 ah, this is known
13:37 we've had an action item about this for quite some time, but never had the time and resources to devote to fixing it because, as the post points out, it's easy to work around with some creative dependency structure in the units
13:37 well, not easy, but at least possible

Hehe, definitely not easy :sweat_smile:

I do think this is especially relevant on Fedora CoreOS, since running containers is the primary use-case (encouraging the use of podman and systemd) and we automatically reboot often for updates.

FWIW I also tried adding an explicit dependency by injecting an Before= into the transient scope unit. Sadly this always results in an OCI error. I’m not surprised since being able to pass these options to the container runtime feels like stepping on a container manager’s toes :slight_smile:

sudo podman run --annotation="org.systemd.property.Before='etcd.service'" nginx:latest
ERRO[0000] sd-bus call: No such device or address: OCI runtime error

If we were able to do something like this, that should probably fix the shutdown order. But this would have to be done in the transient unit, since adding a After=libpod-x.scope in the service unit is just not possible. Unless there is a way to get predictable scope unit names :thinking:

Heh. It’s easily worked around, in much the same way that building a second nuclear bomb is easy once you’ve got one under your belt.

That first one might give you a bit of trouble, though…

the only working solution for us was the following

        [Unit]
        Description=Hello World Container
        After=network-online.target
        Wants=network.target

        [Service]
        Environment=PODMAN_SYSTEMD_UNIT=%n
        TimeoutStartSec=0
        Restart=on-failure
        ExecStartPre=/bin/podman rm -i -f %N
        ExecStartPre=/usr/bin/rm -f %t/%N.pid
        ExecStart=/bin/podman run --name %N --conmon-pidfile %t/%N.pid --cgroups disabled --log-driver journald --pull always -d -q hello-world
        ExecStop=/bin/podman stop -i -t 10 %N
        ExecStopPost=-/bin/podman image prune -f
        PIDFile=%t/%N.pid
        KillMode=none
        Type=forking

        [Install]
        WantedBy=multi-user.target

this was based on the finding that systemd shuts down the container process in it’s dynamic cgroup that has no dependences first, so the conmon process fails stopping the container when triggered from the unit file