I have one container that is nominally derived from registry.fedoraproject.org/fedora-toolbox:41 that works fine and one that doesn’t. Even more annoyingly, if I use the one that works as a base for new containers, those also work. If, however, I base things off of the fedora-toolbox:41 directly and then install the container toolkit, vulkaninfo --summary fails.
Irritatingingly, cuda works fine in both containers as does “glxinfo -B”. I’m really mystified.
I have literally reduced the dnfs so that there is no difference at all between the the two containers if I look using “dnf list --installed”.
I’m really at a loss. There has to be a difference somewhere. Any suggestions for how I run it down?
can you “ls -l /usr/lib64/ | grep nvidia” in the host to see what the drivers point to? I’m no expert on toolbx but it could be using the CDI spec. You can generate it with “sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml”
Your container really doesn’t need anything installed on it. You can also refer to my previous post on how to verify the driver installed on the container.
For reference,
lrwxrwxrwx. 1 root root 26 Nov 25 16:56 libcudadebugger.so.1 -> libcudadebugger.so.550.135
-rwxr-xr-x. 3 nobody nobody 10524136 Dec 31 1969 libcudadebugger.so.550.135
lrwxrwxrwx. 1 root root 18 Nov 25 16:56 libcuda.so.1 -> libcuda.so.550.135
-rwxr-xr-x. 3 nobody nobody 28712096 Dec 31 1969 libcuda.so.550.135
lrwxrwxrwx. 1 root root 24 Nov 25 16:56 libEGL_nvidia.so.0 -> libEGL_nvidia.so.550.135
-rwxr-xr-x. 3 nobody nobody 1345696 Dec 31 1969 libEGL_nvidia.so.550.135
lrwxrwxrwx. 1 root root 30 Nov 25 16:56 libGLESv1_CM_nvidia.so.1 -> libGLESv1_CM_nvidia.so.550.135
-rwxr-xr-x. 3 nobody nobody 68000 Dec 31 1969 libGLESv1_CM_nvidia.so.550.135
lrwxrwxrwx. 1 root root 27 Nov 25 16:56 libGLESv2_nvidia.so.2 -> libGLESv2_nvidia.so.550.135
-rwxr-xr-x. 3 nobody nobody 117144 Dec 31 1969 libGLESv2_nvidia.so.550.135
lrwxrwxrwx. 1 root root 24 Nov 25 16:56 libGLX_nvidia.so.0 -> libGLX_nvidia.so.550.135
-rwxr-xr-x. 3 nobody nobody 1203776 Dec 31 1969 libGLX_nvidia.so.550.135
lrwxrwxrwx. 1 root root 21 Nov 25 16:56 libnvcuvid.so.1 -> libnvcuvid.so.550.135
-rwxr-xr-x. 3 nobody nobody 10566992 Dec 31 1969 libnvcuvid.so.550.135
lrwxrwxrwx. 1 root root 30 Nov 25 16:56 libnvidia-allocator.so.1 -> libnvidia-allocator.so.550.135
-rwxr-xr-x. 3 nobody nobody 168808 Dec 31 1969 libnvidia-allocator.so.550.135
lrwxrwxrwx. 1 root root 24 Nov 25 16:56 libnvidia-cfg.so.1 -> libnvidia-cfg.so.550.135
-rwxr-xr-x. 3 nobody nobody 398968 Dec 31 1969 libnvidia-cfg.so.550.135
-rwxr-xr-x. 3 nobody nobody 30352200 Dec 31 1969 libnvidia-eglcore.so.550.135
lrwxrwxrwx. 1 root root 27 Nov 25 16:56 libnvidia-encode.so.1 -> libnvidia-encode.so.550.135
-rwxr-xr-x. 3 nobody nobody 277152 Dec 31 1969 libnvidia-encode.so.550.135
lrwxrwxrwx. 1 root root 24 Nov 25 16:56 libnvidia-fbc.so.1 -> libnvidia-fbc.so.550.135
-rwxr-xr-x. 3 nobody nobody 137824 Dec 31 1969 libnvidia-fbc.so.550.135
-rwxr-xr-x. 3 nobody nobody 32464992 Dec 31 1969 libnvidia-glcore.so.550.135
-rwxr-xr-x. 3 nobody nobody 582808 Dec 31 1969 libnvidia-glsi.so.550.135
-rwxr-xr-x. 3 nobody nobody 9062480 Dec 31 1969 libnvidia-glvkspirv.so.550.135
-rwxr-xr-x. 3 nobody nobody 43659040 Dec 31 1969 libnvidia-gpucomp.so.550.135
-rwxr-xr-x. 3 nobody nobody 1379720 Dec 31 1969 libnvidia-gtk2.so.550.135
-rwxr-xr-x. 3 nobody nobody 1388424 Dec 31 1969 libnvidia-gtk3.so.550.135
lrwxrwxrwx. 1 root root 23 Nov 25 16:56 libnvidia-ml.so.1 -> libnvidia-ml.so.550.135
-rwxr-xr-x. 3 nobody nobody 2082456 Dec 31 1969 libnvidia-ml.so.550.135
lrwxrwxrwx. 1 root root 24 Nov 25 16:56 libnvidia-ngx.so.1 -> libnvidia-ngx.so.550.135
-rwxr-xr-x. 3 nobody nobody 4562136 Dec 31 1969 libnvidia-ngx.so.550.135
lrwxrwxrwx. 1 root root 25 Nov 25 16:56 libnvidia-nvvm.so.4 -> libnvidia-nvvm.so.550.135
-rwxr-xr-x. 3 nobody nobody 86842616 Dec 31 1969 libnvidia-nvvm.so.550.135
lrwxrwxrwx. 1 root root 27 Nov 25 16:56 libnvidia-opencl.so.1 -> libnvidia-opencl.so.550.135
-rwxr-xr-x. 3 nobody nobody 23613128 Dec 31 1969 libnvidia-opencl.so.550.135
lrwxrwxrwx. 1 root root 32 Nov 25 16:56 libnvidia-opticalflow.so.1 -> libnvidia-opticalflow.so.550.135
-rwxr-xr-x. 3 nobody nobody 67704 Dec 31 1969 libnvidia-opticalflow.so.550.135
-rwxr-xr-x. 3 nobody nobody 10176 Dec 31 1969 libnvidia-pkcs11-openssl3.so.550.135
-rwxr-xr-x. 3 nobody nobody 10168 Dec 31 1969 libnvidia-pkcs11.so.550.135
lrwxrwxrwx. 1 root root 35 Nov 25 16:56 libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.550.135
-rwxr-xr-x. 3 nobody nobody 28674464 Dec 31 1969 libnvidia-ptxjitcompiler.so.550.135
-rwxr-xr-x. 3 nobody nobody 76336528 Dec 31 1969 libnvidia-rtcore.so.550.135
-rwxr-xr-x. 3 nobody nobody 18632 Dec 31 1969 libnvidia-tls.so.550.135
-rwxr-xr-x. 3 nobody nobody 10088 Dec 31 1969 libnvidia-wayland-client.so.550.135
lrwxrwxrwx. 1 root root 21 Nov 25 16:56 libnvoptix.so.1 -> libnvoptix.so.550.135
-rwxr-xr-x. 3 nobody nobody 59927784 Dec 31 1969 libnvoptix.so.550.135
The toolkit is to be installed in the host following the instructions for podman and then get injected in the container. The setup need to run each time you install a new driver version.
For Podman, NVIDIA recommends using CDI for accessing NVIDIA devices in containers.
You can examine my installation transcript from the bug report and point out the step that I missed. I would be grateful.
The CDI specification was correct and also noting:
Podman configuration
podman does not require any specific
configuration to enable CDI support and processes specified --device flags directly.
Do any of these procedures generate a nvidia_icd.json? It’s certainly possible I skipped an essential step somewhere.
nvidia-smi found the card as did glxinfo -B. The container and drivers were fine. cuda was even fine. It was only vulkaninfo that failed because of the missing nvidia_icd.json.
The json file is provided by the Nvidia drivers you installed on the host. The container toolkit only layers them into containers created after the fact to make things easier.
This need to be run each time you change nvidia driver.
This is just one of the way you can run nvidia drivers in a container. This option is great because you don’t end up duplicating files in each containers. You can also download and install the drivers manually inside your container or mount the files yourself with an elaborate command line. Distrobox find and mount the files itself when using the --nvidia argument to get the same result without the nvidia container toolkit.
Again, you really don’t need to install anything in your container to make this work.
Edit:
That bug you linked in your own bug report makes sense, actually. I don’t use RPM Fusion, but I do know that they rename the json file to .x86_64.json. The correct name are,
nvidia_icd.json
nvidia_layers.json
Not sure why they do that. But good catch. @leigh123linux
WARN[0000] Could not locate vulkan/icd.d/nvidia_icd.json: pattern vulkan/icd.d/nvidia_icd.json not found
pattern vulkan/icd.d/nvidia_icd.json not found
WARN[0000] Could not locate vulkan/icd.d/nvidia_layers.json: pattern vulkan/icd.d/nvidia_layers.json not found
pattern vulkan/icd.d/nvidia_layers.json not found