The question I have is about an inconsistency I’m seeing with regard to the
/proc mount, specifically dealing with the
However, to give some background, I’m an Eclipse OpenJ9 developer; we’re working on using CRIU to provide an “instant on” feature. Essentially the JVM exposes a Java API that an application can use to self checkpoint. On the restore run, an external script then invokes CRIU to restore the process.
Our test is set up as follows:
- Build CRIU as part of container image build
- As part of container image build, run:
setcap cap_checkpoint_restore,cap_net_admin,cap_sys_ptrace=eip /usr/sbin/criu
- Start container, which launches OpenLiberty, which invokes the API to self-checkpoint; this ends the process.
- Commit this container (which contains all the checkpointed state) to create a new container image that, when run, will invoke the restore script.
- Run the restore image elsewhere.
I’ve been running a local K8s cluster using the script found in the repo. I’ve done so by building CRI-O v1.23, Kubernetes v1.23, CNI Plugins v1.1.1. The deployment’s SecurityContext field is:
securityContext: capabilities: add: [ "CHECKPOINT_RESTORE", "NET_ADMIN", "SYS_PTRACE" ]
Now, on RHEL8.5, if I dont use a privileged container, the restore step fails because
/proc is mounted as read-only; CRIU needs to write to
/proc/sys/kernel/ns_last_pid as part of the restore step (and in fact, the ability to write to
ns_last_pid is one of the privileges of the
This happens because the CRI-O code explicitly hardcodes certain paths to be Masked or Read-Only unless the container is privileged.
The reason I’m posting here is because on my Fedora 35 VirtualBox VM, I’m able to successfully run an unprivileged container that restores the java process. The
config.json generated by
runc shows that the paths should be Masked/Read-Only:
"maskedPaths": [ "/proc/acpi", "/proc/kcore", "/proc/keys", "/proc/latency_stats", "/proc/timer_list", "/proc/timer_stats", "/proc/sched_debug", "/proc/scsi", "/sys/firmware" ], "readonlyPaths": [ "/proc/asound", "/proc/bus", "/proc/fs", "/proc/irq", "/proc/sys", "/proc/sysrq-trigger" ],
and when I
exec into the container and open the
ns_last_pid file, vim does say it’s readonly.
Everything looks consistent with how things look on RHEL8.5, except that for some reason, CRIU is able to open
ns_last_pid and update it.
Either this is a bug, or there’s something about in my local Fedora environment that allows this behaviour. I’d appreciate any suggestions or insights into figuring out why there’s this inconsistency.