GPU Support for Deep Learning frameworks (Pytorch/Tensorflow)

Can fedora support GPU acceleration for NVIDIA cards like Ubuntu does? I often am involved with deep learning projects with CuDNN and CUDA which installs fine Ubuntu, but I haven’t made a full migration over to Fedora aside from my Mac M2 which I am just getting started with Fedora.

1 Like

Added ai-ml-sig and removed engineering

I’ve switched to a container workflow using @ai-ml/nvidia-container-toolkit Copr and https://containers.dev/ and it’s working fairly nicely for me.

If you don’t have a suitable workflow for containers, you just need to setup the cuda repository and decide of you want to use the akmods from rpmfusion or switch to the dkms from nvidia. When I go this route I still use the rpmfusion akmods and just filter out the drivers from the cuda repo.

[cuda]
name=cuda-fedora39-x86_64
baseurl=https://developer.download.nvidia.com/compute/cuda/repos/fedora39/x86_64
enabled=0
gpgcheck=1
gpgkey=https://developer.download.nvidia.com/compute/cuda/repos/fedora39/x86_64/D42D0685.pub
excludepkgs=nvidia-*,kmod-nvidia-*

There is a good guide over at rpmfusion but do be aware that mixing rpmfusion and the cuda repo can cause package conflicts when they are not in sync. I generally run with the cuda repo disabled, driver filtered as illustrated above, and only enable the repo when I know there is something I need from there.

1 Like

If we could build a documentation it would be nice. For security reasons and variation of CUDA and hardware it’s better to build from Source. For example I use CUDA 11.8, and TF 2.4

<https://www.tensorflow.org/install/source#gpu>

This is a link the TF GPU build from source table. There is some keyring stuff I had to do in Ubuntu to install CuDNN.

Build from source documentation instead of docker would be preferred for the user base.

Or a docker and build from source blog.

Getting the older CUDA might be a little more painful, but running TF 2.4 should be quick. I did a naive test with:

podman run \
    --device nvidia.com/gpu=all \
    --rm \
    docker.io/tensorflow/tensorflow:2.4.3-gpu \
    bash -c 'python -c "import tensorflow as tf; print(\"Num GPUs Available: \", len(tf.config.list_physical_devices(\"GPU\")))"'

But you will see that CUDA is what is available on the host:

podman run \
    --device nvidia.com/gpu=all \
    --rm \
    docker.io/tensorflow/tensorflow:2.4.3-gpu \
    nvidia-smi

Quickly looking at the nvidia archive the 11.8 support was built with a F35 comparable toolchain. I’m not sure if it would be happy running on a later Fedora version. You likely will need to run with the dkms drivers available in this repo and they might not work on later kernels.

I agree it would be good to get some documentation together on how to do this.

Can Fedora support GPU acceleration for NVIDIA cards like Ubuntu Does?

Overly short answer

no, not right now in the way I think you’re asking about.

However, if you’re just looking for a writeup about how to get things working without having everything in the Fedora repos, much of what I’ve written here won’t apply.

medium answer

there are folks looking into options to make it easier to use NVidia accelerators on Fedora but that will likely involve containers and/or wheels before any other solutions are available. @jsteffan has gone over some options and has more experience with getting the nvidia stuff working on Fedora than I do

longer, overly verbose answer

The issues with NVidia acceleration, in particular, pretty much come down to three things:

  1. Fedora has policies which make distribution of proprietary binary-only software difficult at best
  2. nvidia doesn’t support CuDNN on Fedora
  3. nvidia has placed restrictions around distribution of CuDNN in particular (but it seems to be distributed elsewhere so I could be wrong and/or mis-reading the CuDNN docs)

Of those things, #1 is the only one kind of in our control but it’s still a big deal because it’s an issue with a core philosophy and why many of the existing solutions involve rpmfusion.

NVidia has started providing CUDA binaries for F39 but CuDNN still has more limited availability (RHEL, Ubuntu, SLES IIRC). As far as I know, you can get CuDNN working on Fedora with some hackery involving installation of gcc from F37 (or older) but YMMV.

For now, the ai-ml SIG is focusing on AMD’s ROCm for HW acceleration first because it doesn’t have the same policy and distribution issues.

Tensorflow

Tensorflow is a different beast. I’m not aware of any legal/policy issues with it but someone would have to package bazel (and maybe a few other deps) in order to build Tensorflow for the Fedora repos and I’m told that is a non-trivial task.

If you’re not talking about packaging TF for Fedora and just building it locally, it would be a lot less work and doing it in a container would certainly be the easiest option.

Let me go look at the price of ROCm cards and see the comparison.

Thank you for your replies. I will just stay with Ubuntu 22.04 LTS on WSL2/W11 for now. Since the project focus is ROCm I will have to just build future machines with this in mind.

I got Pytorch working through podman on CUDA 12.4 on my 4080RTX and it works really well.

I need to compile documentation and instructions on how to do this.

1 Like

@tflink TF doesn’t need to be build with bazel, it can be built with Cmake I am just trying to figure out there is no preexisting wheel.

opened up issue on github

this worked!!

Step-by-Step Instructions

Ensure Podman is Installed and Configured:
If Podman isn't installed or configured yet, follow these steps:

sh

sudo dnf install -y podman

Install NVIDIA Container Toolkit:
Install the NVIDIA Container Toolkit to enable GPU support in Podman.

sh

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
sudo dnf install -y nvidia-container-toolkit
sudo systemctl restart containerd

Configure Podman for NVIDIA Runtime:
Update Podman’s configuration to use the NVIDIA runtime.

sh

mkdir -p ~/.config/containers
cat < ~/.config/containers/containers.conf
[engine]
runtime = “nvidia”
EOT

Run TensorFlow Nightly GPU Container with Podman:

Use Podman on your host system to run the TensorFlow nightly GPU container. The --runtime=nvidia flag ensures that GPU support is enabled.

sh

podman run --runtime=nvidia -it tensorflow/tensorflow:nightly-gpu bash

If the above command fails or --runtime is not recognized, use the --gpus all flag:

sh

podman run --gpus all -it tensorflow/tensorflow:nightly-gpu bash

Verify TensorFlow GPU Support Inside the Container:

Once inside the container, verify that TensorFlow can access the GPU:

sh

python -c "import tensorflow as tf; print(‘TensorFlow version:’, tf.version); prin

[aaronl@DESKTOP-G3ABECU ~]$ podman run --gpus all -it tensorflow/tensorflow:nightly-gpu bash
WARN[0000] Using cgroups-v1 which is deprecated in favor of cgroups-v2 with Podman v5 and will be removed in a future version. Set environment variable PODMAN_IGNORE_CGROUPSV1_WARNING to hide this warning.


___ /________________________________ / /______ __
__ / _ _ _ __ _ / __ _ / / __ / __ _ | /| / /
_ / / / / / /( )/ /
/ / / _ __/ _ / / /
/ /
|/ |/ /
// ___/// //// _/// // // _/__/|__/

WARNING: You are running this container as root, which can cause new files in
mounted volumes to be created as the root user on your host machine.

To avoid this, run the container by specifying your user’s userid:

$ docker run -u $(id -u):$(id -g) args…

/sbin/ldconfig.real: /usr/lib/wsl/drivers/nvmdsi.inf_amd64_23a2cede5f1383ec/libnvidia-ml.so.1 is not a symbolic link

root@b24205da9419:/# python -c “import tensorflow as tf; print(‘TensorFlow version:’, tf.version); print(‘Num GPUs Available:’, len(tf.config.experimental.list_physical_devices(‘GPU’)))”
2024-06-13 21:27:38.985917: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-06-13 21:27:39.009506: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
TensorFlow version: 2.18.0-dev20240612
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1718314059.869098 17 cuda_executor.cc:990] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1718314059.872207 17 cuda_executor.cc:990] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1718314059.872254 17 cuda_executor.cc:990] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
Num GPUs Available: 1

here is documentation https://swift-tree.dev/?p=197

I can’t network pods well enough to get the IDE Jupyter to detect the changes in pod drivers or NVIDIA toolkit it’s CPU only, and I give up I am trying to swap cards now