Thank you @brogos and @FranciscoD for your inputs. I tried to switch from podman
to docker
. But docker
broke the networking of my QEMU/KVM setup. This is a known issue with a workaround. It also requires switching to the old CGroups
as a part of docker
installation. So I was not very satisfied with it.
Solution:
I was able to get podman to work using the links proposed by @brogos. The important steps are:
- Install NVIDIA driver on the host. Currently it is at version 440:100 for Fedora 32.
- Install
nvidia-container-toolkit
following the instructions here. Note: You will get an error that Fedora 32 is unsupported distribution, so just setdistribution=rhel8.2
. - Edit
/etc/nvidia-container-runtime/config.toml
to setno-cgroups = true
. - Whatever container image you want to run should match the CUDA version supported by the NVIDIA driver installed on the host. For driver 440:100 it is CUDA 10.2. The nvidia/cuda:latest docker image is at CUDA 11 so it will not work. I was making this mistake at first.
- To test the installation you can run the following command which uses
nvidia-smi
which is provided byxorg-x11-drv-nvidia-cuda
package from RPMFusion on the host machine.
podman run -it --rm --security-opt=label=disable nvidia/cuda:10.2-base nvidia-smi
Sat Aug 1 15:43:00 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 960M Off | 00000000:01:00.0 Off | N/A |
| N/A 53C P8 N/A / N/A | 36MiB / 2004MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
If you don’t get this output most likely you will get some error message. You may have to check if you have followed all the steps correctly. All this is based on the discussion in this github issue.
ALTERNATE SOLUTION:
I came across another solution proposed by u/Abraxis_Dragon on this comment on r/Fedora. I find this solution as the most hassle-free. It uses singularity containers. I followed the following steps:
- Install NVIDIA drivers from RPMFusion as explained here.
- Install Singularity which is available in official repo.
sudo dnf install singularity
. - Build the Tensorflow GPU container into a Singularity container
singularity build mytensorflow.sif docker://tensorflow/tensorflow:latest-gpu
. - Run the container using
--nv
flag to allow direct access to the NVIDIA GPU.
singularity run --nv mytensorflow.sif
- We can check that GPU is actually available inside the container by running the following command:
INFO: Could not find any nv files on this host!
________ _______________
___ __/__________________________________ ____/__ /________ __
__ / _ _ \_ __ \_ ___/ __ \_ ___/_ /_ __ /_ __ \_ | /| / /
_ / / __/ / / /(__ )/ /_/ / / _ __/ _ / / /_/ /_ |/ |/ /
/_/ \___//_/ /_//____/ \____//_/ /_/ /_/ \____/____/|__/
You are running this container as user with ID 1000 and group 1000,
which should map to the ID and group for your user on the Docker host. Great!
Singularity> python3 -c "import tensorflow as tf; tf.config.list_physical_devices('GPU')"
2020-08-01 14:42:30.562914: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-08-01 14:42:32.529449: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-08-01 14:42:32.537876: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-01 14:42:32.538595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 960M computeCapability: 5.0
coreClock: 1.0975GHz coreCount: 5 deviceMemorySize: 1.96GiB deviceMemoryBandwidth: 74.65GiB/s
2020-08-01 14:42:32.538636: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-08-01 14:42:32.584239: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-08-01 14:42:32.609743: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-08-01 14:42:32.618569: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-08-01 14:42:32.666673: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-08-01 14:42:32.676543: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-08-01 14:42:32.765228: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-08-01 14:42:32.765425: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-01 14:42:32.766202: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-01 14:42:32.766547: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
So Tensorflow is actually able to use the GPU without any configuration tweaks or workarounds! Hope this will help somebody!