GPU Support for Deep Learning frameworks (Pytorch/Tensorflow)

ludiusvox · May 25, 2024, 10:49pm

Can fedora support GPU acceleration for NVIDIA cards like Ubuntu does? I often am involved with deep learning projects with CuDNN and CUDA which installs fine Ubuntu, but I haven’t made a full migration over to Fedora aside from my Mac M2 which I am just getting started with Fedora.

mattdm · May 30, 2024, 3:49am

Added ai-ml-sig and removed engineering

jsteffan · June 1, 2024, 3:55pm

I’ve switched to a container workflow using @ai-ml/nvidia-container-toolkit Copr and https://containers.dev/ and it’s working fairly nicely for me.

If you don’t have a suitable workflow for containers, you just need to setup the cuda repository and decide of you want to use the akmods from rpmfusion or switch to the dkms from nvidia. When I go this route I still use the rpmfusion akmods and just filter out the drivers from the cuda repo.

[cuda]
name=cuda-fedora39-x86_64
baseurl=https://developer.download.nvidia.com/compute/cuda/repos/fedora39/x86_64
enabled=0
gpgcheck=1
gpgkey=https://developer.download.nvidia.com/compute/cuda/repos/fedora39/x86_64/D42D0685.pub
excludepkgs=nvidia-*,kmod-nvidia-*

There is a good guide over at rpmfusion but do be aware that mixing rpmfusion and the cuda repo can cause package conflicts when they are not in sync. I generally run with the cuda repo disabled, driver filtered as illustrated above, and only enable the repo when I know there is something I need from there.

ludiusvox · June 1, 2024, 4:38pm

If we could build a documentation it would be nice. For security reasons and variation of CUDA and hardware it’s better to build from Source. For example I use CUDA 11.8, and TF 2.4

<https://www.tensorflow.org/install/source#gpu>

This is a link the TF GPU build from source table. There is some keyring stuff I had to do in Ubuntu to install CuDNN.

Build from source documentation instead of docker would be preferred for the user base.

Or a docker and build from source blog.

jsteffan · June 1, 2024, 5:30pm

Getting the older CUDA might be a little more painful, but running TF 2.4 should be quick. I did a naive test with:

podman run \
    --device nvidia.com/gpu=all \
    --rm \
    docker.io/tensorflow/tensorflow:2.4.3-gpu \
    bash -c 'python -c "import tensorflow as tf; print(\"Num GPUs Available: \", len(tf.config.list_physical_devices(\"GPU\")))"'

But you will see that CUDA is what is available on the host:

podman run \
    --device nvidia.com/gpu=all \
    --rm \
    docker.io/tensorflow/tensorflow:2.4.3-gpu \
    nvidia-smi

Quickly looking at the nvidia archive the 11.8 support was built with a F35 comparable toolchain. I’m not sure if it would be happy running on a later Fedora version. You likely will need to run with the dkms drivers available in this repo and they might not work on later kernels.

I agree it would be good to get some documentation together on how to do this.

tflink · June 3, 2024, 5:05am

Can Fedora support GPU acceleration for NVIDIA cards like Ubuntu Does?

Overly short answer

no, not right now in the way I think you’re asking about.

However, if you’re just looking for a writeup about how to get things working without having everything in the Fedora repos, much of what I’ve written here won’t apply.

medium answer

there are folks looking into options to make it easier to use NVidia accelerators on Fedora but that will likely involve containers and/or wheels before any other solutions are available. @jsteffan has gone over some options and has more experience with getting the nvidia stuff working on Fedora than I do

longer, overly verbose answer

The issues with NVidia acceleration, in particular, pretty much come down to three things:

Fedora has policies which make distribution of proprietary binary-only software difficult at best
nvidia doesn’t support CuDNN on Fedora
nvidia has placed restrictions around distribution of CuDNN in particular (but it seems to be distributed elsewhere so I could be wrong and/or mis-reading the CuDNN docs)

Of those things, #1 is the only one kind of in our control but it’s still a big deal because it’s an issue with a core philosophy and why many of the existing solutions involve rpmfusion.

NVidia has started providing CUDA binaries for F39 but CuDNN still has more limited availability (RHEL, Ubuntu, SLES IIRC). As far as I know, you can get CuDNN working on Fedora with some hackery involving installation of gcc from F37 (or older) but YMMV.

For now, the ai-ml SIG is focusing on AMD’s ROCm for HW acceleration first because it doesn’t have the same policy and distribution issues.

Tensorflow

Tensorflow is a different beast. I’m not aware of any legal/policy issues with it but someone would have to package bazel (and maybe a few other deps) in order to build Tensorflow for the Fedora repos and I’m told that is a non-trivial task.

If you’re not talking about packaging TF for Fedora and just building it locally, it would be a lot less work and doing it in a container would certainly be the easiest option.

ludiusvox · June 13, 2024, 3:22am

Let me go look at the price of ROCm cards and see the comparison.

Thank you for your replies. I will just stay with Ubuntu 22.04 LTS on WSL2/W11 for now. Since the project focus is ROCm I will have to just build future machines with this in mind.

ludiusvox · June 13, 2024, 2:05pm

I got Pytorch working through podman on CUDA 12.4 on my 4080RTX and it works really well.

I need to compile documentation and instructions on how to do this.

ludiusvox · June 13, 2024, 5:51pm

@tflink TF doesn’t need to be build with bazel, it can be built with Cmake I am just trying to figure out there is no preexisting wheel.

ludiusvox · June 13, 2024, 6:05pm

opened up issue on github

github.com/tensorflow/tensorflow

Installing Tensorflow on Fedora 40

opened 06:00PM - 13 Jun 24 UTC

ludiusvox

type:bug

### Issue type Bug ### Have you reproduced the bug with TensorFlow Nightly? Y…es ### Source source ### TensorFlow version 2.5 ### Custom code Yes ### OS platform and distribution WSL Fedora 40 ### Mobile device _No response_ ### Python version 3.10 ### Bazel version _No response_ ### GCC/compiler version _No response_ ### CUDA/cuDNN version 12.4 ### GPU model and memory 4080RTX NVIDIA ### Current behavior? I am trying to figure out how to build a wheel for Tensorflow 2.5+ for CUDA 12.4, I am having difficulty, the Fedora developers think they need Bazel, but I explained that Bazel is unneccesary, but I am looking for a procedure to install with CMAKE and build from source and upload a container. Pytorch is easy to install, and the team lead on Fedora 40 AI/ML is more interested in supporting ROCm so I am just coming in here see if I can get some advice. I had completed a graduate thesis in TF and i am really comfortable with the Software, but it's limited pipeline support for all the distros is difficult. I would hate to have to spend $1000 on a new GPU I really like fedora and Pytorch works out of the box, can anything be done? ### Standalone code to reproduce the issue ```shell Building TensorFlow with CMake TensorFlow provides some support for building with CMake, primarily targeting Windows environments. Below are the steps for building TensorFlow using CMake on Linux. This process is more experimental and might not cover all features of TensorFlow. Prerequisites Dependencies: CMake (version 3.16 or later) Python (3.6, 3.7, 3.8, or 3.9) TensorFlow dependencies: numpy, six, wheel, setuptools, h5py, keras_preprocessing GCC compiler (GCC 7.3 or later recommended) Install necessary packages: bash sudo apt-get update sudo apt-get install cmake python3-dev python3-pip gcc g++ pip3 install numpy six wheel setuptools h5py keras_preprocessing Steps to Build Clone TensorFlow Repository: bash git clone https://github.com/tensorflow/tensorflow.git cd tensorflow Create Build Directory: bash mkdir build cd build Configure the Build with CMake: bash cmake .. -Dtensorflow_ENABLE_GPU=OFF -Dtensorflow_BUILD_PYTHON_BINDINGS=ON Adjust options as necessary. For example, set -Dtensorflow_ENABLE_GPU=ON if building with GPU support. Build TensorFlow: bash cmake --build . --config Release Build the Python Wheel: After building the main TensorFlow library, you need to build the Python package: bash cd ../tensorflow/tools/pip_package python3 build_pip_package.py --src .. --output . This script should generate a .whl file in the specified output directory. Install the Wheel: bash pip3 install ./tensorflow-*.whl ``` ### Relevant log output ```shell -------------------------- Error here, I am stuck on python3-dev, and we expecting problems with keras_preprocessing ```

ludiusvox · June 13, 2024, 9:28pm

this worked!!

Step-by-Step Instructions

Ensure Podman is Installed and Configured:
If Podman isn't installed or configured yet, follow these steps:

sh

sudo dnf install -y podman

Install NVIDIA Container Toolkit:
Install the NVIDIA Container Toolkit to enable GPU support in Podman.

sh

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
sudo dnf install -y nvidia-container-toolkit
sudo systemctl restart containerd

Configure Podman for NVIDIA Runtime:
Update Podman’s configuration to use the NVIDIA runtime.

sh

mkdir -p ~/.config/containers
cat < ~/.config/containers/containers.conf
[engine]
runtime = “nvidia”
EOT

Run TensorFlow Nightly GPU Container with Podman:

Use Podman on your host system to run the TensorFlow nightly GPU container. The --runtime=nvidia flag ensures that GPU support is enabled.

sh

podman run --runtime=nvidia -it tensorflow/tensorflow:nightly-gpu bash

If the above command fails or --runtime is not recognized, use the --gpus all flag:

sh

podman run --gpus all -it tensorflow/tensorflow:nightly-gpu bash

Verify TensorFlow GPU Support Inside the Container:

Once inside the container, verify that TensorFlow can access the GPU:

sh

python -c "import tensorflow as tf; print(‘TensorFlow version:’, tf.version); prin

ludiusvox · June 13, 2024, 9:29pm

[aaronl@DESKTOP-G3ABECU ~]$ podman run --gpus all -it tensorflow/tensorflow:nightly-gpu bash
WARN[0000] Using cgroups-v1 which is deprecated in favor of cgroups-v2 with Podman v5 and will be removed in a future version. Set environment variable PODMAN_IGNORE_CGROUPSV1_WARNING to hide this warning.

___ /________________________________ / /______ __
__ / _ _ _ __ _ / __ _ / / __ / __ _ | /| / /
_ / / / / / /( )/ // / / _ __/ _ / / // / |/ |/ /
// ___/// //// _/// // // _/__/|__/

WARNING: You are running this container as root, which can cause new files in
mounted volumes to be created as the root user on your host machine.

To avoid this, run the container by specifying your user’s userid:

$ docker run -u $(id -u):$(id -g) args…

/sbin/ldconfig.real: /usr/lib/wsl/drivers/nvmdsi.inf_amd64_23a2cede5f1383ec/libnvidia-ml.so.1 is not a symbolic link

root@b24205da9419:/# python -c “import tensorflow as tf; print(‘TensorFlow version:’, tf.version); print(‘Num GPUs Available:’, len(tf.config.experimental.list_physical_devices(‘GPU’)))”
2024-06-13 21:27:38.985917: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-06-13 21:27:39.009506: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
TensorFlow version: 2.18.0-dev20240612
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1718314059.869098 17 cuda_executor.cc:990] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1718314059.872207 17 cuda_executor.cc:990] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1718314059.872254 17 cuda_executor.cc:990] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
Num GPUs Available: 1

ludiusvox · June 13, 2024, 11:33pm

here is documentation https://swift-tree.dev/?p=197

ludiusvox · July 24, 2024, 10:43pm

I can’t network pods well enough to get the IDE Jupyter to detect the changes in pod drivers or NVIDIA toolkit it’s CPU only, and I give up I am trying to swap cards now

willian · January 7, 2025, 4:56am

For those wanting cudnn, I got around by using RHEL 9 repos. Not recommended at all, but considering the closest thing officially available for fedora is for fedora 39 and doesn’t even contain cudnn packages, I think it’s a better option. It worked fine with Onnx on Fedora 41.

sudo dnf config-manager addrepo --from-repofile=https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo

Topic		Replies	Views
Using Nvidia drivers on Fedora 38? Ask Fedora f38 , kde-plasma , driver , nvidia	26	3429	December 1, 2023
PyTorch to Fedora Introduction Project Discussion ai-ml-sig	34	5926	December 8, 2023
Fedora 35 and cuda: How to get GPU tf to work? Ask Fedora f35 , installation , cuda , nvidia	2	4806	November 7, 2021
Is there a way to install docker with gpu support? Ask Fedora f36 , docker , nvidia	10	4198	September 17, 2022
Experimenting with Fedora PyTorch distributed from a PIP repository Project Discussion ai-ml-sig	17	395	April 17, 2024

GPU Support for Deep Learning frameworks (Pytorch/Tensorflow)

Overly short answer

medium answer

longer, overly verbose answer

Tensorflow

Related topics