Inconsistency with /proc mount in container

Hi all,

The question I have is about an inconsistency I’m seeing with regard to the /proc mount, specifically dealing with the /proc/sys/kernel/ns_last_pid file.

However, to give some background, I’m an Eclipse OpenJ9 developer; we’re working on using CRIU to provide an “instant on” feature. Essentially the JVM exposes a Java API that an application can use to self checkpoint. On the restore run, an external script then invokes CRIU to restore the process.

Our test is set up as follows:

  1. Build CRIU as part of container image build
  2. As part of container image build, run:
    setcap cap_checkpoint_restore,cap_net_admin,cap_sys_ptrace=eip /usr/sbin/criu
  3. Start container, which launches OpenLiberty, which invokes the API to self-checkpoint; this ends the process.
  4. Commit this container (which contains all the checkpointed state) to create a new container image that, when run, will invoke the restore script.
  5. Run the restore image elsewhere.

I’ve been running a local K8s cluster using the script found in the repo. I’ve done so by building CRI-O v1.23, Kubernetes v1.23, CNI Plugins v1.1.1. The deployment’s SecurityContext field is:

securityContext:
  capabilities:
    add: [ "CHECKPOINT_RESTORE", "NET_ADMIN", "SYS_PTRACE" ]

Now, on RHEL8.5, if I dont use a privileged container, the restore step fails because /proc is mounted as read-only; CRIU needs to write to /proc/sys/kernel/ns_last_pid as part of the restore step (and in fact, the ability to write to ns_last_pid is one of the privileges of the CAP_CHECKPOINT_RESTORE capability).

This happens because the CRI-O code explicitly hardcodes certain paths to be Masked or Read-Only unless the container is privileged.

The reason I’m posting here is because on my Fedora 35 VirtualBox VM, I’m able to successfully run an unprivileged container that restores the java process. The config.json generated by runc shows that the paths should be Masked/Read-Only:

"maskedPaths": [
         "/proc/acpi",
         "/proc/kcore",
         "/proc/keys",
         "/proc/latency_stats",
         "/proc/timer_list",
         "/proc/timer_stats",
         "/proc/sched_debug",
         "/proc/scsi",
         "/sys/firmware"
 ],
 "readonlyPaths": [
         "/proc/asound",
         "/proc/bus",
         "/proc/fs",
         "/proc/irq",
         "/proc/sys",
         "/proc/sysrq-trigger"
 ],

and when I exec into the container and open the ns_last_pid file, vim does say it’s readonly.

Everything looks consistent with how things look on RHEL8.5, except that for some reason, CRIU is able to open ns_last_pid and update it.

Either this is a bug, or there’s something about in my local Fedora environment that allows this behaviour. I’d appreciate any suggestions or insights into figuring out why there’s this inconsistency.

FWIW, the flagged replies above from me were various github links that I had to separate out because Discourse only lets me post two links per post. They were relevant to the post, but in the interest of reducing noise I just deleted the replies; I can provide those links if needed.

2 Likes

I think the reason for it is because in newer kernels that have clone3(), CRIU does not need to write to ns_last_pid (see POC for CAP_CHECKPOINT_RESTORE by dsouzai · Pull Request #5776 · cri-o/cri-o · GitHub).