While following the examples of running podman containers as systemd services, I discovered I’m having issues cleanly rebooting. I went down a rabbit hole and now got something that works, but I’m wondering if this is really how it’s suppose to work and why nobody else would be having these issues
So I’m currently following Running Containers :: Fedora Docs which works fine, that is until you try to do an ordered clean shutdown/reboot. This is a different way then what’s in the podman docs which seems to be more oriented to turning existing containers into unit. I would love to stay away from PID files though… I’m having the shutdown order issues with both methods though.
For example let’s take the etcd.service
running a podman container from the FCOS documentation:
[Unit]
Description=Run single node etcd
After=network-online.target
Wants=network-online.target
[Service]
ExecStartPre=mkdir -p /var/lib/etcd
ExecStartPre=-/bin/podman kill etcd
ExecStartPre=-/bin/podman rm etcd
ExecStartPre=-/bin/podman pull quay.io/coreos/etcd
ExecStart=/bin/podman run --name etcd --volume /var/lib/etcd:/etcd-data:z --net=host quay.io/coreos/etcd:latest /usr/local/bin/etcd --data-dir /etcd-data --name node1 \
--initial-advertise-peer-urls http://127.0.0.1:2380 --listen-peer-urls http://127.0.0.1:2380 \
--advertise-client-urls http://127.0.0.1:2379 \
--listen-client-urls http://127.0.0.1:2379 \
--initial-cluster node1=http://127.0.0.1:2380
ExecStop=/bin/podman stop etcd
[Install]
WantedBy=multi-user.target
Then add another unit that depends on it:
[Unit]
Description=Etcd consumer
After=etcd.service
Wants=etcd.service
A systemctl reboot
should stop the consumer before stopping the etcd pod. And when reviewing the systemd debug logs, at first sight this seems to be happening, only the etcd.service
cgroup is already empty when the etcd.service
is stopped. All processes in the podman container have already been killed well before the service unit is executing it’s stop command.
What I’m seeing is that podman creates a transient unit libpod-<id>.scope
under the machine.slice
for the etcd pod. This unit has the DefaultDependencies=yes
and therefore conflicts on the shutdown.target
. So when the reboot is initiated, a stop job for the scope unit is scheduled and executed almost immediately. Since the scope unit has no dependencies other than the machine.slice
it’s stopped very early. And if I understood the systemd docs correctly all processes in a scope get a SIGTERM when a scope unit is stopped.
In addition to the stop job from the default dependencies. The machine.slice
unit is also stopped because of the shutdown.target
. This causes systemd to pull in stop jobs for all the podman .scope
units under it. Again, since there is no relation between the scope units and the services these scope units are stopped immediately.
I’m able to work around these issues by, creating a new slice without default dependencies, run the pods in it and have the service units schedule after it.
/etc/systemd/system/system-containers.slice:
[Unit]
Description=System Containers Slice
# Do not conflict on the shutdown.target, so it will not get stopped on shutdown.
DefaultDependencies=no
/etc/systemd/system/etcd.service:
[Unit]
Description=Run single node etcd
Wants=system-containers.slice
After=system-containers.slice
[Service]
ExecStart=/bin/podman run --name etcd \
--cgroup-parent system-containers.slice \
...
This solves the issue that a shutdown causes stop jobs for the container’s scope units to be pulled in from the stop of the machine.slice
. Note that just adding an After=machine.slice
to the service units is definitely not enough here, since the machine.slice
is stopped explicitly it pulls in the additional jobs like this:
systemd[1]: Pulling in machine.slice/stop from shutdown.target/start
systemd[1]: Added job machine.slice/stop to transaction.
systemd[1]: Pulling in libpod-f558a3cffaacef61b4846aba7c98da0a8a4bcf95af6227df70a722586ac92d33.scope/stop from machine.slice/stop
systemd[1]: Added job libpod-f558a3cffaacef61b4846aba7c98da0a8a4bcf95af6227df70a722586ac92d33.scope/stop to transaction.
In addition these scope units have DefaultDependencies that conflict with the shutdown.target
and also cause a stop job to be scheduled. So I’m also adding --annotation=org.systemd.property.DefaultDependencies=false
to the run command. This annotation is a rather obscure runc feature that allows me to pass extra systemd properties into the transient scope units.
So I end up with podman units that look like this:
[Unit]
Description=Run single node etcd
After=system-containers.slice
Wants=system-containers.slice
After=network-online.target
Wants=network-online.target
[Service]
ExecStartPre=mkdir -p /var/lib/etcd
ExecStartPre=-/bin/podman kill etcd
ExecStartPre=-/bin/podman rm etcd
ExecStartPre=-/bin/podman pull quay.io/coreos/etcd
ExecStart=/bin/podman run --name etcd \
--cgroup-parent system-containers.slice \
--annotation=org.systemd.property.DefaultDependencies=false \
--volume /var/lib/etcd:/etcd-data:z --net=host quay.io/coreos/etcd:latest /usr/local/bin/etcd --data-dir /etcd-data --name node1 \
--initial-advertise-peer-urls http://127.0.0.1:2380 --listen-peer-urls http://127.0.0.1:2380 \
--advertise-client-urls http://127.0.0.1:2379 \
--listen-client-urls http://127.0.0.1:2379 \
--initial-cluster node1=http://127.0.0.1:2380
ExecStop=/bin/podman stop etcd
[Install]
WantedBy=multi-user.target
But this all feels way more convoluted than it ought to be. Is the not having DefaultDependencies=no
in podman’s scope unit a bug? Do we need to change the documentation to recommend using your own slice or should the machine slice not be stopped? Or is there some way of adding a dependency between the transient scope unit and the service.
I would love some insight