PyTorch to Fedora Introduction

I sent the following message out to various email lists. I’ve included it here so that anyone who wants to add to the discussion can do so here.

Hello everyone,

I hope everyone had a great weekend. I’m here to introduce myself but also ask for anyone who would be interested in helping me out on a fun side project that me and my colleague, Teng, have planned. I’m Kaitlyn Abdo and I’m an Associate Technical Marketing Manager at Red Hat, specifically for the AI/ML products and offerings. Teng is a Principal Technical Product Manager at Red Hat as well, but I’ll let Teng formally introduce himself :).

To give the TLDR, we want to package PyTorch for Fedora. We are looking for individuals in the community who are interested in making this happen with us. We will be hosting most of the discussion moving forward on the AI/ML discussion page on Fedora Discussion, but we will still send out major communication through the mailing lists.

We are aware of several existing SIGs (ML SIG, Heterogeneous Computing Special Interest SIG, Big Data SIG) and would like to help reignite activity in a coordinated way. I talked to the leadership of the ML SIG about revamping and possibly merging other less active SIGS to create a broader AI/ML SIG. The ML SIG leadership agreed that this could be beneficial for communication and project purposes, so I’m open to input from both leadership and individuals in other SIGs.

Please reach out to either me at kabdo@redhat.com or Teng at tema@redhat.com if you are interested. We are in the very early stages of this and want to build a team of anyone who would be interested. Thank you for your time and we look forward to hearing from you!

Best,

Kaitlyn Abdo

5 Likes

Thanks for doing this! I’m really looking forward to where this can go!

I really like that. Development environments would get so much easier for research institutions to mass-deploy to students etc.

A bit interested in how this will happen:

Packaging PyTorch is awesome! But, practically, for the majority of PyTorch users, only makes sense if you also package CUDA – and that’s something that Redhat, CentOS and Fedora have so far left up to rpmfusion, since CUDA is non-free. But packaging CUDA only makes sense if you also packaged the nvidia kernel module (or rather the kernel module builder, since distributing the earlier binary nvidia kernel modules without nvidia’s secret source breaks Linux’ GPLv2; Nvidia open, for Turing, Ampere and later, is supposedly GPL-compatible. Do we get an official statement from Redhat that it actually is? That’d be very helpful for a lot of maintainers.).

Leave any of these ingredients out, and people will still end up having to go to rpmfusion to get the nvidia kernel driver and CUDA. I’ll argue that

python3 -m venv pytorch-environment
. pytorch-environment/bin/activate
pip3 install torch torchvision torchaudio

is not the part where users fail; it’s the nvidia underpinnings, due to underinvestment of nvidia (reminder: one of the financially largest semiconductor companies in the world) in making their software actually installable by standard means.

In practice, for a Fedora user, that process is a bit longer than three lines (which packaging pytorch would reduce to a much more elegant one line, sudo dnf install pytorch):

# install RPMfus
sudo dnf install https://mirrors.rpmfusion.org/free/fedora/rpmfusion-free-release-$(rpm -E %fedora).noarch.rpm https://mirrors.rpmfusion.org/nonfree/fedora/rpmfusion-nonfree-release-$(rpm -E %fedora).noarch.rpm
## Check which of the card drivers is right for you
# check your GPU model:
/sbin/lspci | grep -e VGA
# and verify which driver generation is right for you on
# https://download.nvidia.com/XFree86/Linux-x86_64/495.44/README/supportedchips.html
# If current:
sudo dnf install --refresh akmod-nvida xorg-x11-drv-nvidia-cuda
# If previous-gen
sudo dnf install --refresh xorg-x11-drv-nvidia-470xx akmod-nvidia-470xx xorg-x11-drv-nvidia-470xx-cuda 
# Note down what's in:
/sbin/lspci | grep -e 3D 
# If there was something by nvidia, you want to read https://rpmfusion.org/Howto/Optimus

# Wait an unspecified amount of minutes before rebooting, so that akmods can build the drivers,
# which it for inexplicable reasons doesn't do as part of the dnf transaction, so that failures don't
# appear as installation failures

## Read up on KMS and decide on whether you want it.
# if it causes problems, disable it 
sudo grubby --update-kernel=ALL --remove-args='nvidia-drm.modeset=1'

## Generally hope that you don't have to learn how to blacklist nouveau, which is still far too often the case

I hope you understand why I’m raising the CUDA / nvidia issue this early on – unless you want to restrict yourself to the far less popular amongst ML researchers AMD cards and CPU computation, the hard part has not been the actual installation of Pytorch.

2 Likes

Yeah, I had questions about CUDA as well but as I understand it, you don’t have to use CUDA to use PyTorch. CPU only and ROCm are supported hardware platforms but that gets to your point about less popular platforms. cuDNN is another can of worms that I’ve had way more problems with compatibility-wise and I’m pretty sure that PyTorch requires it for full HW acceleration.

All that being said, I’m reminded of “make it work and once that happens, worry about making it work fast, making it pretty or anything else” akin to “crawl before you walk before you run”.

Are there existing plans around how to handle hardware acceleration? I imagine that starting with CPU only would make sense as a starting point. After that’s done, work with the Heterogeneous Computing folks on enabling ROCm (and maybe Intel’s framework if that makes it upstream to pytorch). Once we have all that working, look at the can of worms that is the nvidia ecosystem to see if we have any reasonable options.

But that’s just my thought before digging into the details of actually packaging PyTorch.

1 Like

Shipping a CPU-only PyTorch with s not of fanfare will put this effort pretty solidly in the “yeah, don’t do that, use pip” region; I mean, we’re not talking about things being 2 to 5 times slower, we’re talking about orders of magnitude, often several, making a CPU-only setup mostly useless for all but the “does the tutorial and never touched PyTorch again” kinds of users.

Now, I do feel sympathetic to rolling out rocm support but no Nvidia support - if Nvidia wanted their stuff to work flawlessly in deployments, they could have taken steps, ages ago.

Apologies for the second post here. I wanted to do a better job of gathering my thoughts but managed to write a short novel in the process.

I’m interested in seeing better AI/ML tooling support in Fedora. To be honest, I’ve pretty much given up on using Fedora with the nvidia toolchains and I moved all my AI/ML work over to Ubuntu years ago for the sake of my sanity.

I’m sorry for being a party pooper but unless I’m missing something, I don’t think that packaging PyTorch for the Fedora repos will benefit the project enough to be worth the effort that would be required. I honestly think that Fedora would be better served by focusing on documentation, support tooling, getting at least the non-nvidia stuff to work with minimal effort or maybe even providing some example models/datasets that are more advanced than the usual MNIST or Iris examples.

Examples of AI/ML support tooling are mlflow, ray and open data hub which can be really useful for AI/ML work, don’t seem to be focused on as much and don’t have the same proprietary blob problems that the base frameworks do.

As others have said, installing PyTorch is the easy part of AI/ML; “pip install torch” doesn’t require more packages and it works in virtualenvs so that the proprietary stuff can at least be contained. Getting the HW acceleration stuff installed and actually working is the hard part.

However, for the sake of argument, let’s assume that we’re going to move forward and package PyTorch for Fedora.

I see two big problems: keeping packages current and dealing with proprietary software. If someone knows ways around these things, please chime in.

Keeping PyTorch Packages Current

torch seems to release every 2 months or so. It seems like they’ve been doing major releases every 4-6 months (release history on pypi) for the last two major releases (1.13.0 and 2.0.0).

I see four options here:

  1. We get exception to changing major versions of PyTorch in a release. This would potentially break existing projects through regular updates which seems less than optimal.
  2. Build compat packages for new versions as they’re released.
  3. Use flatpaks instead of RPMs (if that would work with the nvidia stuff, I don’t know flatpak well enough to say).
  4. Just accept that there will be some time during which the latest Fedora release will not have the latest PyTorch.

Unless there are more options in handling this, which route are we planning on?

Proprietary Software

This has been touched on earlier in the thread but proprietary software is, IMHO, the biggest problem here. I agree with the sentiment of “if I’m restricted to using CPU only, I’ll go use Ubuntu” and that’s what I have done in the past.

There are plenty of problems here but IMHO, nvidia is the biggest problem in getting AI/ML tools to work on Fedora because that’s what people generally use

The nvidia AI/ML support toolchain

In order to take advantage of nvidia hardware acceleration, PyTorch requires the nvidia blob driver, CUDA and cuDNN. There are existing solutions for getting at least one version of the nvidia blob driver installed on Fedora so I won’t touch on that much here because that’s also the easiest part of getting the whole CUDA/cuDNN stack working.

It looks like PyTorch doesn’t have the same strict version requirements that I’ve seen with TensorFlow but this is still a huge potential problem because cudnn isn’t compatible with all versions of glibc or gcc. The last time I tried getting cuDNN working on anything in the RHEL ecosystem (2 or so years ago, I think), the only thing that could run cuDNN was RHEL/CentOS 7. Everything else (RHEL8, any supported Fedora at the time) didn’t have versions of gcc and glibc that would allow cuDNN to work.

I think CUDA also has requirements but I’ve personally had far more problems with cudnn as they seem to design their dependencies around Ubuntu releases and whatever versions of gcc, glibc etc. those releases contain.

nvidia alternatives (Intel, AMD, etc.)

I have no practical experience with AMD and Intel’s toolchains beyond reading docs to decide if I wanted to attempt to get them working in Fedora so this will be brief. I assume that folks who have more experience in this area will chime in.

Even if we ignore the nvidia problem, I’m not 100% clear on whether ROCm works without any proprietary blobs. The docs I’m finding from AMD indicate that some closed source software is required but there is a ROCm packaging effort in Fedora so I hope they know something I don’t.

From what I can tell, Intel’s toolchain (OpenVINO) doesn’t require proprietary blobs but it has two different problems:

  1. nvidia and AMD hardware are generally more powerful than Intel GPUs
  2. PyTorch has no official support for intel’s toolchain and forks are required for this acceleration to work.

At the moment, neither OpenVINO nor ROCm are packaged for Fedora in our repositories.

Google has TPUs and Amazon has Trainium accelerators for use in their clouds. Either one of these could be reasonable targets if the bits needed to support them in PyTorch are acceptably licensed but they still have the problem of not being supported upstream, AFAIK.

Docker

One popular method of making the nvidia problem easier to deal with is nvidia-docker.

At one point, I remember the TensorFlow docs recommending nvidia-docker for people who wanted HW acceleration but that appears not to be the case any more.

At a glance, it appears as if nvidia has licensed all the bits of nvidia-docker as APL2 so that’s not a problem but Docker is. I have no idea how much work it would take to get nvidia-docker working in Fedora or if it’s even possible.

Do we really need HW Acceleration?

This has also been touched on elsewhere in the thread. I assert that HW acceleration for PyTorch will be needed if we want people to use the packages for more than toy examples.

As anecdotal evidence, I’m currently working on 2 models built with PyTorch. One is a very small NLP model that is tailored for a very specific purpose; I don’t always use HW acceleration for because it trains on my smaller datasets in less than 2 minutes. I only consider using HW acceleration when I’m running it against my largest dataset but even then, it’s not that bad. Note that this model is very small and not a great example.

The other model I’m working on is a relatively simple computer vision model (3 CNN layers, 1 FC layer at the moment) but requires large images in order to get decent performance. It takes about 5 minutes to train with GPU and I don’t know how long it takes to train CPU only because I haven’t had the patience to let it finish a single epoch of training. The only time I run that model CPU-only is when I need to debug something deep in the inner workings of the model.

The models I’m working on right now are tiny compared to many of the models out there but I’m already to the point where I have little interest in or patience for non-accelerated training.

How modular is PyTorch?

I haven’t looked at building PyTorch from source but I’m concerned that it might not work well with the approach of having the proprietary bits built separately.

Something like having python-pytorch-core from the Fedora repos installed and a python-pytorch-nvidia package from some non-Fedora source.

Just the coordination to make sure that the python-pytorch-nvidia available elsewhere will always work with the python-pytorch-core provided in the Fedora repos is a bit intimidating.

EDIT1: removing some extra and adding a few missing words to complete thoughts

3 Likes

While Pytorch for CUDA is the most common, it is not the only option. Pytorch with ROCm is also available and works in practice with a lot of AI/ML stuff. Technically there is also Apple silicon, but don’t know how that works in practice.

From what I’ve seen getting PyTorch compiling with different versions of ROCm (necessary to support older and newest hardware) may also be necessary.

Hello all,

I work on Meta’s Linux Userspace team. We maintain a lot of Meta’s FLOSS projects in Fedora, but some of the concerns already fleshed out by others above have caused us to steer clear from attempting to package PyTorch in the past.

If the community wants to try and get this in Fedora with the ROCm and CPU backends, I am happy to help try ensure any packaging related PRs get looked at by the PyTorch team, and help with package maintenance.

It might be worth trying to maintain a fully fledged package in RPM Fusion at the same time, or at least make sure any packaging effort there does not conflict with this?

3 Likes

It looks like it’s documented as “typically, 3 releases a year”. They do have a stable / beta / prototype label for documented features, and say that stable ones should stay backwards-compatible. I think probably an exception is the best. Keep the current release updated to the latest, but only update N-1 Fedora release for important security issues if any.

Hey there guys, this is my first time actually contributing to fedora (therefore, it’s possible i could make some mistakes while talking here), but i’m willing to help with whatever is possible. I currently have 2 Nvidia GPUs, so in case CUDA gets packaged, i could help test it with a few cards. Not the most potent, but at the very least compatible with nvidia’s current(ish) drivers.

I agree that we will probably need extra packages to account for how common it is for Hardware acceleration to be used alongside PyTorch. How feasible and/or problematic would it be for different packages regarding this to be located in different repositories? i’d imagine having everything as close as possible would be the best case scenario, but again, i’m talking mostly as a rookie here.

2 Likes

I think one can already use provided CUDA packages as noted here:

https://rpmfusion.org/Howto/CUDA

The page also notes other issues with using CUDA—compiler versions and so on.

Is there reason to assume that our userbase would be occupied with high-performance NN stuff? Are there applications that perform tolerably on modern CPUs? (My personal experience says “yes” but that doesn’t constitute a census.)

We’re not talking “high-performance NN stuff”; we’re talking about “basic stuff and upwards”.
CPUs simply do not suffice beyond absolute toy example/entry level learning tasks. That’s why Deep Learning saw a sudden flare in popularity when GPGPU computation became possible: even mighty 192-thread CPUs with excellent memory bandwidth are simply much worse at this than a 400€ GPU.

For whom else would you want to go through the effort of packaging? If you need no accelerator support at all, pip install pytorch already does what users want, and interferes with nothing on Fedora.

The docs I’m finding from AMD indicate that some closed source software is required but there is a ROCm packaging effort in Fedora so I hope they know something I don’t.

I don’t believe any proprietary software is required from ROCm to use pytorch.

As far as I know, ROCm only has a few non-free items:

  • “rocm-llvm-alt”, which is a fork of llvm with some non-free extentions. I believe it’s not strictly required for pytorch.
  • “aql-profile”, which I believe is just used for profiling and debugging, not required for pytorch

For Fedora, it looks like we can just start hacking away at the requirements listed here:

To break them down:

set(hip_DIR ${ROCM_PATH}/lib/cmake/hip)

This is in review
https://bugzilla.redhat.com/show_bug.cgi?id=2209759

set(hsa-runtime64_DIR ${ROCM_PATH}/lib/cmake/hsa-runtime64)
set(AMDDeviceLibs_DIR ${ROCM_PATH}/lib/cmake/AMDDeviceLibs)
set(amd_comgr_DIR ${ROCM_PATH}/lib/cmake/amd_comgr)

These are already in fedora:
(hsa-runtime) Overview - rpms/rocm-runtime - src.fedoraproject.org
(AMDDeviceLibs) Overview - rpms/rocm-device-libs - src.fedoraproject.org
(amd_comgr) Overview - rpms/rocm-compilersupport - src.fedoraproject.org

set(rocrand_DIR ${ROCM_PATH}/lib/cmake/rocrand)
set(hiprand_DIR ${ROCM_PATH}/lib/cmake/hiprand)
set(rocblas_DIR ${ROCM_PATH}/lib/cmake/rocblas)
set(miopen_DIR ${ROCM_PATH}/lib/cmake/miopen)
set(rocfft_DIR ${ROCM_PATH}/lib/cmake/rocfft)
set(hipfft_DIR ${ROCM_PATH}/lib/cmake/hipfft)
set(hipsparse_DIR ${ROCM_PATH}/lib/cmake/hipsparse)
set(rccl_DIR ${ROCM_PATH}/lib/cmake/rccl)
set(rocprim_DIR ${ROCM_PATH}/lib/cmake/rocprim)
set(hipcub_DIR ${ROCM_PATH}/lib/cmake/hipcub)
set(rocthrust_DIR ${ROCM_PATH}/lib/cmake/rocthrust)

And the above are still missing. I’ll likely be tackling rocrand once rocm-hip (rocclr) is accepted into Fedora.

2 Likes

I’m glad to hear that my fears of required proprietary software were unfounded.

In the interest of getting stuff done, I have rocm-cmake pretty much ready for review and plan to submit it later today.

I’m going to start working on rocblas next but I won’t be able to test anything ROCm related until my new-to-me ROCm supported card arrives next week. I don’t plan on submitting anything more than the rocm-cmake pacakge for review until after I’ve been able to test them.

2 Likes

Very nice, please keep me posted and I’ll review it, as it’s needed for rocrand.

I have a test build available if you want to borrow from it:
https://copr.fedorainfracloud.org/coprs/mystro256/rocm-hip/package/rocm-cmake/

It’s a bit old because I ran into issues packaging rocm-hip originally and ended up spending the last little while working with upstream to get hip easier to package in Fedora, which only recently has made good headway.

Feel free to take anything you want. I’m more interested in rocrand as a simple way to verify that hip works well when I update it, but I won’t likely be able to take all of it. I’m open to co-maintainership though.

Fun timing, I just noticed that you had a rocm-cmake build in your copr when I went poking around to learn more about how you’ve been building stuff. I don’t have any strong feelings over the spec file I’ve written if yours has already been used. I don’t see any major differences between the two but I do prefer the way you did some things so I’ll be borrowing some parts at a minimum.

I’m not sure how much bandwith I’m going to have for this, either but we’ll see how everything works out and if we can convince others to join in the effort.

If I haven’t said anything before, thanks for getting the ROCm packaging effort started. Just the wiki page with the packaging table is helpful to getting started.

2 Likes

Package review request for rocm-cmake has been submitted:

https://bugzilla.redhat.com/show_bug.cgi?id=2211987
SPEC File: https://tflink.fedorapeople.org/packages/rocm-cmake/rocm-cmake.spec
SRPM: https://tflink.fedorapeople.org/packages/rocm-cmake/rocm-cmake-5.5.1-1.fc38.src.rpm
COPR Build: Build 5999442 in tflink/rocm-packaging

2 Likes

Because of some various circumstances, I’ve become interested in packaging hipblas on Rawhide, which from what i understand has dependencies on rocblas, rocsolver, rocsparse, tensile, and maybe more. Is work still happening on packaging rocblas? I’ve played around with building tensile on Rawhide so I think I can help out.

I also have a 6600 XT that I can use to do some small-scale testing.

1 Like

Feel free to add yourself here to track progress:
https://fedoraproject.org/wiki/SIGs/HC

rocparse requires rocprim, which I almost done and I will be submitting rocprim for Package Review later today. Feel free to CC me (mystro256) and Tom Rix (trix) if you make packaging requests for ROCm related bits. We’ve been trying to get things moving since rocm-hip was introduced. Feel free to use my rocrand package as a template for introducing new ROCm libraries.

hipblas has a lot of inter-ROCm dependencies, so it might take some time, but the more of ROCm we get packaged, the closer we can get to it in rawhide.

1 Like