Create toolbox with GPU via nvidia container toolkit

Hi all (and happy new year!),

I’ve been trying to get my feet wet with containers in Fedora Silverblue 39.

I am trying to create a toolbox with GPU support to run some CUDA ML workflows.

Here’s what I have done:

  1. Install nvidia drivers (as per Howto/NVIDIA - RPM Fusion)
  2. Install nvidia container toolkit (as per Installing the NVIDIA Container Toolkit — NVIDIA Container Toolkit 1.14.3 documentation)
  3. Configure nvidia container toolkit for podman via CDI (as per Installing the NVIDIA Container Toolkit — NVIDIA Container Toolkit 1.14.3 documentation and therefore Support for Container Device Interface — NVIDIA Container Toolkit 1.14.3 documentation)

All of this seems to have gone well.

I can run their little test container:

$ podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable ubuntu nvidia-smi -L

And I see

GPU 0: NVIDIA GeForce RTX 3080 (UUID: GPU-f6c738c7-1a57-93c4-2eb5-e25cb4c39379)

Seems all good.

Where I am running into issues is creating a toolbox with the --device nvidia.com/gpu=all parameter.

Using podman inspect (as per this thread: Creating toolbx with custom settings) I get a CLI like:

podman \
 --log-level error \
   run \
 --rm \
 --replace \
 --device nvidia.com/gpu=all \
 --cgroupns host \
 --dns none \
 --env TOOLBOX_PATH=/usr/bin/toolbox \
 --env XDG_RUNTIME_DIR=/run/user/1000 \
 --hostname toolbox \
 --ipc host \
 --label com.github.containers.toolbox=true \
 --mount type=devpts,destination=/dev/pts \
 --name my_toolbox \
 --network host \
 --no-hosts \
 --pid host \
 --privileged \
 --security-opt label=disable \
 --ulimit host \
 --userns keep-id \
 --user root:root \
 --volume /:/run/host:rslave \
 --volume /dev:/dev:rslave \
 --volume /run/dbus/system_bus_socket:/run/dbus/system_bus_socket \
 --volume /var/home/romanp:/var/home/romanp:rslave \
 --volume /usr/bin/toolbox:/usr/bin/toolbox:ro \
 --volume /run/user/1000:/run/user/1000 \
 --volume /run/avahi-daemon/socket:/run/avahi-daemon/socket \
 --volume /run/.heim_org.h5l.kcm-socket:/run/.heim_org.h5l.kcm-socket \
 --volume /run/pcscd/pcscd.comm:/run/pcscd/pcscd.comm \
 --volume /run/media:/run/media:rslave \
 --volume /etc/profile.d/toolbox.sh:/etc/profile.d/toolbox.sh:ro \
   ubuntu toolbox \
 --log-level debug init-container \
 --gid 1000 \
 --home /var/home/romanp \
 --shell /bin/bash \
 --uid 1000 \
 --user romanp \
 --home-link \
 --media-link \
 --mnt-link

(replaced create with run --rm --replace, and added --device)

However this returns

Error: OCI runtime error: crun: {"msg":"error executing hook `/usr/bin/nvidia-ctk` (exit code: 1)","level":"error","time":"2024-01-01T17:03:53.796571Z"}

I have tried:

  1. Editing /etc/nvidia-container-runtime/config.toml and setting no-cgroups = true (also false), as I saw this in a few older threads.
  2. Uncommenting both the debug= lines in that file (as per Troubleshooting — NVIDIA Container Toolkit 1.14.3 documentation)
    • Nothing showed up in /var/log/ (that wasn’t already there) after running the failing podman run again.

My question is twofold

  1. Where are the logs that might indicate what the actual error was and
  2. What in the toolbox arguments could be causing this?

Thank you for any help!

I’m trying to follow what you did, so bear with me here, but did you do any of these:

You created the toolbox toolbox create my-nvidia-toolbox
then enetered the toolbox toolbox enter
then installed the Nvidia gpu drivers inside ? sudo dnf install akmod-nvidia xorg-x11-drv-nvidia-cuda
Run nvidia-smi inside the container ?

The toolbox I am creating is Ubuntu based, rather than fedora based. So, I am not sure if installing the drivers quite applies the same way.

To do this through toolbox I first create an ubuntu image from this toolbox/images/ubuntu/22.04 at main · containers/toolbox · GitHub

$ podman build . -t localhost/toolbox/ubuntu:22.04
...
$ toolbox create --image localhost/toolbox/ubuntu:22.04 my-nvidia-toolbox

$ toolbox enter

but in this case nvidia-smi is not recognized as a command (presumably because the nvidia container toolkit mounts it somehow, and toolbox is not triggering the toolkit.

I am not sure how to install the nvidia drivers in an ubuntu toolbox (though given the podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable localhost/toolbox/ubuntu:22.04 nvidia-smi -L works just fine, I am not sure that it is necessary at all)

I’m not able to test this for you, I’m literally in a :coffee: shop on a Laptop. . . Although I do have a AMD Laptop with NvidiaGPU
Off of memory, when I did this in the past i needed to add the drivers inside the container,
if Ubuntu, you need to sudo apt update then sudo apt install nvidia-driver-Xxx

nvidia-smi is a command that comes with the driver . . I believe

. . sorry i can’t test this for you, worst case scenario rebuild your container right?!

nvidia-driver-545 (which would match the one I have on the Fedora host) does not seem to be available.

Is having a Ubuntu container mission critical here? I’m just saying for the sake of compatibility. although I think 545.29.06 should be available for Ubuntu?

Just to close this out -

After much tinkering I decided to give up on toolbox in favour of using NVIDIA container toolkit.

The biggest problem with this driver installation approach is that the driver version has to exactly match the host, and with enough containers it’s gonna be impossible to keep juggling these versions and keep them in sync.

NVIDIA container toolkit seems to allow me to bypass these shenanigans and only cares about the driver on the host.

I don’t think toolbox currently leverages NVCT but maybe I’ll reach out and see if they’d be interested in me adding it in.

Yep that’s exactly the doc I followed and everything is working. Sorry, should have linked it myself, thanks for dropping it in the thread.

Hope this discussion helps others looking into it!

Hey Roman P,

thank you for sharing your attempt at setting up the Nvidia Container Toolkit (NVCT) for toolbox. I’ve followed your steps, and found that one can use the NVCT via toolbox by commenting out the following line in your podman cmd:

# --volume /dev:/dev:rslave \

So far I haven’t found any problem using my toolbox without the option above, but I have no idea of any potential risk of removing the line for mounting.