Hi all,
The question I have is about an inconsistency I’m seeing with regard to the /proc
mount, specifically dealing with the /proc/sys/kernel/ns_last_pid
file.
However, to give some background, I’m an Eclipse OpenJ9 developer; we’re working on using CRIU to provide an “instant on” feature. Essentially the JVM exposes a Java API that an application can use to self checkpoint. On the restore run, an external script then invokes CRIU to restore the process.
Our test is set up as follows:
- Build CRIU as part of container image build
- As part of container image build, run:
setcap cap_checkpoint_restore,cap_net_admin,cap_sys_ptrace=eip /usr/sbin/criu
- Start container, which launches OpenLiberty, which invokes the API to self-checkpoint; this ends the process.
- Commit this container (which contains all the checkpointed state) to create a new container image that, when run, will invoke the restore script.
- Run the restore image elsewhere.
I’ve been running a local K8s cluster using the script found in the repo. I’ve done so by building CRI-O v1.23, Kubernetes v1.23, CNI Plugins v1.1.1. The deployment’s SecurityContext field is:
securityContext:
capabilities:
add: [ "CHECKPOINT_RESTORE", "NET_ADMIN", "SYS_PTRACE" ]
Now, on RHEL8.5, if I dont use a privileged container, the restore step fails because /proc
is mounted as read-only; CRIU needs to write to /proc/sys/kernel/ns_last_pid
as part of the restore step (and in fact, the ability to write to ns_last_pid
is one of the privileges of the CAP_CHECKPOINT_RESTORE
capability).
This happens because the CRI-O code explicitly hardcodes certain paths to be Masked or Read-Only unless the container is privileged.
The reason I’m posting here is because on my Fedora 35 VirtualBox VM, I’m able to successfully run an unprivileged container that restores the java process. The config.json
generated by runc
shows that the paths should be Masked/Read-Only:
"maskedPaths": [
"/proc/acpi",
"/proc/kcore",
"/proc/keys",
"/proc/latency_stats",
"/proc/timer_list",
"/proc/timer_stats",
"/proc/sched_debug",
"/proc/scsi",
"/sys/firmware"
],
"readonlyPaths": [
"/proc/asound",
"/proc/bus",
"/proc/fs",
"/proc/irq",
"/proc/sys",
"/proc/sysrq-trigger"
],
and when I exec
into the container and open the ns_last_pid
file, vim does say it’s readonly.
Everything looks consistent with how things look on RHEL8.5, except that for some reason, CRIU is able to open ns_last_pid
and update it.
Either this is a bug, or there’s something about in my local Fedora environment that allows this behaviour. I’d appreciate any suggestions or insights into figuring out why there’s this inconsistency.