I made some progress. I succesfully installed Cuda10 from a local RPM. Im still trying to figure out how to intsall the older driver run file. When executing the run in a container toolbox i get.
ERROR: An NVIDIA kernel module 'nvidia-uvm' appears to already be loaded in your kernel. This may be because it
is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may
also happen if your kernel was configured without support for module unloading. Please be sure to exit any
programs that may be using the GPU(s) before attempting to upgrade your driver. If no GPU-based programs
are running, you know that your kernel supports module unloading, and you still receive this message, then
an error may have occured that has corrupted an NVIDIA kernel module's usage count, for which the simplest
remedy is to reboot your computer.
I searched for it and i found a similar issiue.
In my case its nvidia-uvm not drm.
i tried sudo modprobe -r nvidia-uvm
but i still get the same error after running the .run file.
But when i try it i get:
sudo systemctl stop systemd-logind
System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to connect to bus: Host is down
I assume that containers arent booted with systemd-logind.
Any ideas how i can mange to install 410 via the run, yum or dnf?
Edit2:
Ok i have a suspicion. On the container even without installing CUDA first i would get the same error. Silverblue has the newest Nvidia drivers installed. Could that produce the error?
How is it possible that nvidia-uvm is allready loaded in the Container kernel? I thought toolbox containers have their own kernel that starts from scratch when you fire it up. Right?
Edit3:
After adding blacklist nvidia
line to /etc/modprobe.d/blacklist.conf
im getting the nvidia-drm error instead of the nvidia-uvm
⬢[node@toolbox Downloads]$ lsof /dev/nvidia*
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/host/run/user/42/gvfs
Output information may be incomplete.
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /var/home/node/.local/share/containers/storage/overlay/076dec9164d5b18ee75a2a5a61c773cc027d0c45d5a1ac18ada7993d90956192/merged/run/host/run/user/42/gvfs
Output information may be incomplete.
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/host/var/home/node/.local/share/containers/storage/overlay/076dec9164d5b18ee75a2a5a61c773cc027d0c45d5a1ac18ada7993d90956192/merged/run/host/run/user/42/gvfs
Output information may be incomplete.
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/host/var/home/node/.local/share/containers/storage/overlay/076dec9164d5b18ee75a2a5a61c773cc027d0c45d5a1ac18ada7993d90956192/merged/var/home/node/.local/share/containers/storage/overlay/076dec9164d5b18ee75a2a5a61c773cc027d0c45d5a1ac18ada7993d90956192/merged/run/host/run/user/42/gvfs
Output information may be incomplete.
here i tried to kill everything nvidia inside the container before running the .run file to install the driver.
⬢[node@toolbox Downloads]$ ps -ef | grep nvidia
nobody 913 2 0 03:13 ? 00:00:00 [nvidia-modeset/]
nobody 914 2 0 03:13 ? 00:00:00 [nvidia-modeset/]
nobody 1214 2 0 03:13 ? 00:00:08 [irq/106-nvidia]
nobody 1215 2 0 03:13 ? 00:00:00 [nvidia]
node 9454 8341 0 03:28 pts/2 00:00:00 grep --color=auto nvidia
⬢[node@toolbox Downloads]$ sudo pkill -f nvidia
pkill: killing pid 913 failed: Operation not permitted
pkill: killing pid 914 failed: Operation not permitted
pkill: killing pid 1214 failed: Operation not permitted
pkill: killing pid 1215 failed: Operation not permitted
Edit4:
I suspected Silverblues Nvidia-driver affecting the nvidia driver instalation inside the container. I commented out inside
rpm-ostree kargs --editior
rd.driver.blacklist=nouveau modprobe.blacklist=nouveau nvidia-drm.modeset=1 and
I added this inside silverblue after intalling the newest nvidia drivers on silverblue. After i commented out it got commited and i had to reboot.
Whats curious is:
I entered my container and ran the nvidia driver .run file again and im getting a new kind of error now:
ERROR: The Nouveau kernel driver is currently in use by your system. This driver is incompatible with the NVIDIA
driver, and must be disabled before proceeding. Please consult the NVIDIA driver README and your Linux
distribution's documentation for details on how to correctly disable the Nouveau kernel driver.
Then further
WARNING: One or more modprobe configuration files to disable Nouveau are already present at:
/usr/lib/modprobe.d/nvidia-installer-disable-nouveau.conf,
/etc/modprobe.d/nvidia-installer-disable-nouveau.conf. Please be sure you have rebooted your system
since these files were written. If you have rebooted, then Nouveau may be enabled for other reasons,
such as being included in the system initial ramdisk or in your X configuration file. Please consult the
NVIDIA driver README and your Linux distribution's documentation for details on how to correctly disable
the Nouveau kernel driver.
I actually restarted the container with
podman restart TF1
4085c10869665ba9a9276bd6e8dc91fb61f147b0ae75385f9bb65fc2bfc8e429
Whats bizzare is i rechecked me kargs inside silverblue and the entire line that i commented out with the pound sign is gone.
Im tempted to disable the nvidia and the nouveau drivers inside silverblue and see if i can install the Nvidia 410 driver inside the container.
The goal is to have the following structure:
Silverblue Nvidia 440 drivers and Cuda 10.2
Container1: Nvidia 410, Cuda 10, TensorFlow 1.4
Container2: Nvidia 440, Cuda 10.2, Tensorflow 2.1
Existing code im working with demands certain versions and i dont want to dual boot into two diffrent systems.
Is what im trying to do even possible or am i chasing something impossible?
I think i have installed successfully cuda 10.0 on my container and i know for a fact that i have installed cuda 10.2 on my host Silverblue. What about the drivers?
Is it possible to have two containers with each a diffrent version of the Nvidia drivers and Cuda running? If yes is it possible to run both containers parallel?
EDIT5:
I found the Nvidia driver 410 as a RPM:
https://rpmfind.net/linux/RPM/mageia/6/x86_64/media/nonfree/backports/x11-driver-video-nvidia-current-410.57-1.mga6.nonfree.x86_64.html
Tried installing it with
sudo yum localinstall x11-driver-video-nvidia-current-410.57-1.mga6.nonfree.x86_64.rpm
⬢[node@toolbox Downloads]$ sudo yum localinstall x11-driver-video-nvidia-current-410.57-1.mga6.nonfree.x86_64.rpm
Last metadata expiration check: 0:17:47 ago on Sun Feb 16 16:14:58 2020.
Error:
Problem: conflicting requests
- nothing provides kmod(nvidia-current.ko) = 410.57 needed by x11-driver-video-nvidia-current-410.57-1.mga6.nonfree.x86_64
- nothing provides ldetect-lst >= 0.3.7.9-2 needed by x11-driver-video-nvidia-current-410.57-1.mga6.nonfree.x86_64
- nothing provides xserver-abi(videodrv) < 25 needed by x11-driver-video-nvidia-current-410.57-1.mga6.nonfree.x86_64
- nothing provides lib64vdpau1 needed by x11-driver-video-nvidia-current-410.57-1.mga6.nonfree.x86_64
- nothing provides x11-server-common needed by x11-driver-video-nvidia-current-410.57-1.mga6.nonfree.x86_64
- nothing provides update-alternatives needed by x11-driver-video-nvidia-current-410.57-1.mga6.nonfree.x86_64
(try to add '--skip-broken' to skip uninstallable packages)