Nvidia library mismatch with nvidia-ctk cdi generate

Hi,

I try to run some machine learning pods using podman on f42. After some upgrade I can’t get the GPU helper config generated:


╭─ikke at [gpubox.ikenet] in ~ 25-06-06 - 18:03:16
╰─$ dnf list '*nvidia*' --installed
Installed packages
akmod-nvidia.x86_64                       3:570.153.02-1.fc42              rpmfusion-nonfree-nvidia-driver
kmod-nvidia-6.14.4-300.fc42.x86_64.x86_64 3:570.153.02-1.fc42              @commandline
kmod-nvidia-6.14.9-300.fc42.x86_64.x86_64 3:570.153.02-1.fc42              @commandline
libnvidia-container-tools.x86_64          1.17.8-1                         cuda-fedora41-x86_64
libnvidia-container1.x86_64               1.17.8-1                         cuda-fedora41-x86_64
libva-nvidia-driver.x86_64                0.0.13^20250419gitc2860cc-1.fc42 updates
nvidia-container-toolkit.x86_64           1.17.8-1                         cuda-fedora41-x86_64
nvidia-container-toolkit-base.x86_64      1.17.8-1                         cuda-fedora41-x86_64
nvidia-gpu-firmware.noarch                20250509-1.fc42                  updates
nvidia-modprobe.x86_64                    3:570.153.02-1.fc42              rpmfusion-nonfree-nvidia-driver
nvidia-persistenced.x86_64                3:570.153.02-1.fc42              rpmfusion-nonfree-nvidia-driver
nvidia-settings.x86_64                    3:570.153.02-1.fc42              rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia.x86_64                3:570.153.02-1.fc42              rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-cuda.x86_64           3:570.153.02-1.fc42              rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-cuda-libs.i686        3:570.153.02-1.fc42              rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-cuda-libs.x86_64      3:570.153.02-1.fc42              rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-kmodsrc.x86_64        3:570.153.02-1.fc42              rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-libs.i686             3:570.153.02-1.fc42              rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-libs.x86_64           3:570.153.02-1.fc42              rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-power.x86_64          3:570.153.02-1.fc42              rpmfusion-nonfree-nvidia-driver
╭─ikke at [gpubox.ikenet] in ~ 25-06-06 - 18:03:17
╰─$ nvidia-ctk cdi list            
INFO[0000] Found 3 CDI devices                          
nvidia.com/gpu=0
nvidia.com/gpu=GPU-ddf9d362-ba80-d466-455b-b662f9ba5596
nvidia.com/gpu=all
╭─ikke at [gpubox.ikenet] in ~ 25-06-06 - 18:03:46
╰─$ sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
INFO[0000] Using /usr/lib64/libnvidia-ml.so.570.153.02  
INFO[0000] Auto-detected mode as 'nvml'                 
ERRO[0000] failed to generate CDI spec: failed to create device CDI specs: failed to initialize NVML: Driver/library version mismatch

And driver is the same version as most of the other sw:

$ modinfo nvidia
filename:       /lib/modules/6.14.9-300.fc42.x86_64/extra/nvidia/nvidia.ko
alias:          char-major-195-*
version:        570.153.02
supported:      external
license:        NVIDIA
firmware:       nvidia/570.153.02/gsp_tu10x.bin
firmware:       nvidia/570.153.02/gsp_ga10x.bin
srcversion:     82F23DA6F1A39DF1BF2EC42

I have followed these guides:

So I have the 570 version of libs and kernel module. What gives?

BTW, everything was working still while ago. Then I had some troubles updating stuff, and there were conficts and I had to remove some of the stuff. But I’ve repeated the install steps just to make sure.

nvidia-ctk cdi generate worked fine for me on 570.153.02, though I got nvidia-container-toolkit from the @ai-ml COPR - don’t know if that makes a difference.

Well, of course. Update fixed it. Whatever it was, was broken for week or two, and now just started to work. I hate this nvidia crap. Thanks for verifying.