Apologies for the second post here. I wanted to do a better job of gathering my thoughts but managed to write a short novel in the process.
I’m interested in seeing better AI/ML tooling support in Fedora. To be honest, I’ve pretty much given up on using Fedora with the nvidia toolchains and I moved all my AI/ML work over to Ubuntu years ago for the sake of my sanity.
I’m sorry for being a party pooper but unless I’m missing something, I don’t think that packaging PyTorch for the Fedora repos will benefit the project enough to be worth the effort that would be required. I honestly think that Fedora would be better served by focusing on documentation, support tooling, getting at least the non-nvidia stuff to work with minimal effort or maybe even providing some example models/datasets that are more advanced than the usual MNIST or Iris examples.
Examples of AI/ML support tooling are mlflow, ray and open data hub which can be really useful for AI/ML work, don’t seem to be focused on as much and don’t have the same proprietary blob problems that the base frameworks do.
As others have said, installing PyTorch is the easy part of AI/ML; “pip install torch” doesn’t require more packages and it works in virtualenvs so that the proprietary stuff can at least be contained. Getting the HW acceleration stuff installed and actually working is the hard part.
However, for the sake of argument, let’s assume that we’re going to move forward and package PyTorch for Fedora.
I see two big problems: keeping packages current and dealing with proprietary software. If someone knows ways around these things, please chime in.
Keeping PyTorch Packages Current
torch seems to release every 2 months or so. It seems like they’ve been doing major releases every 4-6 months (release history on pypi) for the last two major releases (1.13.0 and 2.0.0).
I see four options here:
- We get exception to changing major versions of PyTorch in a release. This would potentially break existing projects through regular updates which seems less than optimal.
- Build compat packages for new versions as they’re released.
- Use flatpaks instead of RPMs (if that would work with the nvidia stuff, I don’t know flatpak well enough to say).
- Just accept that there will be some time during which the latest Fedora release will not have the latest PyTorch.
Unless there are more options in handling this, which route are we planning on?
This has been touched on earlier in the thread but proprietary software is, IMHO, the biggest problem here. I agree with the sentiment of “if I’m restricted to using CPU only, I’ll go use Ubuntu” and that’s what I have done in the past.
There are plenty of problems here but IMHO, nvidia is the biggest problem in getting AI/ML tools to work on Fedora because that’s what people generally use
The nvidia AI/ML support toolchain
In order to take advantage of nvidia hardware acceleration, PyTorch requires the nvidia blob driver, CUDA and cuDNN. There are existing solutions for getting at least one version of the nvidia blob driver installed on Fedora so I won’t touch on that much here because that’s also the easiest part of getting the whole CUDA/cuDNN stack working.
It looks like PyTorch doesn’t have the same strict version requirements that I’ve seen with TensorFlow but this is still a huge potential problem because cudnn isn’t compatible with all versions of glibc or gcc. The last time I tried getting cuDNN working on anything in the RHEL ecosystem (2 or so years ago, I think), the only thing that could run cuDNN was RHEL/CentOS 7. Everything else (RHEL8, any supported Fedora at the time) didn’t have versions of gcc and glibc that would allow cuDNN to work.
I think CUDA also has requirements but I’ve personally had far more problems with cudnn as they seem to design their dependencies around Ubuntu releases and whatever versions of gcc, glibc etc. those releases contain.
nvidia alternatives (Intel, AMD, etc.)
I have no practical experience with AMD and Intel’s toolchains beyond reading docs to decide if I wanted to attempt to get them working in Fedora so this will be brief. I assume that folks who have more experience in this area will chime in.
Even if we ignore the nvidia problem, I’m not 100% clear on whether ROCm works without any proprietary blobs. The docs I’m finding from AMD indicate that some closed source software is required but there is a ROCm packaging effort in Fedora so I hope they know something I don’t.
From what I can tell, Intel’s toolchain (OpenVINO) doesn’t require proprietary blobs but it has two different problems:
- nvidia and AMD hardware are generally more powerful than Intel GPUs
- PyTorch has no official support for intel’s toolchain and forks are required for this acceleration to work.
At the moment, neither OpenVINO nor ROCm are packaged for Fedora in our repositories.
Google has TPUs and Amazon has Trainium accelerators for use in their clouds. Either one of these could be reasonable targets if the bits needed to support them in PyTorch are acceptably licensed but they still have the problem of not being supported upstream, AFAIK.
One popular method of making the nvidia problem easier to deal with is nvidia-docker.
At one point, I remember the TensorFlow docs recommending nvidia-docker for people who wanted HW acceleration but that appears not to be the case any more.
At a glance, it appears as if nvidia has licensed all the bits of nvidia-docker as APL2 so that’s not a problem but Docker is. I have no idea how much work it would take to get nvidia-docker working in Fedora or if it’s even possible.
Do we really need HW Acceleration?
This has also been touched on elsewhere in the thread. I assert that HW acceleration for PyTorch will be needed if we want people to use the packages for more than toy examples.
As anecdotal evidence, I’m currently working on 2 models built with PyTorch. One is a very small NLP model that is tailored for a very specific purpose; I don’t always use HW acceleration for because it trains on my smaller datasets in less than 2 minutes. I only consider using HW acceleration when I’m running it against my largest dataset but even then, it’s not that bad. Note that this model is very small and not a great example.
The other model I’m working on is a relatively simple computer vision model (3 CNN layers, 1 FC layer at the moment) but requires large images in order to get decent performance. It takes about 5 minutes to train with GPU and I don’t know how long it takes to train CPU only because I haven’t had the patience to let it finish a single epoch of training. The only time I run that model CPU-only is when I need to debug something deep in the inner workings of the model.
The models I’m working on right now are tiny compared to many of the models out there but I’m already to the point where I have little interest in or patience for non-accelerated training.
How modular is PyTorch?
I haven’t looked at building PyTorch from source but I’m concerned that it might not work well with the approach of having the proprietary bits built separately.
Something like having
python-pytorch-core from the Fedora repos installed and a python-pytorch-nvidia package from some non-Fedora source.
Just the coordination to make sure that the
python-pytorch-nvidia available elsewhere will always work with the
python-pytorch-core provided in the Fedora repos is a bit intimidating.
EDIT1: removing some extra and adding a few missing words to complete thoughts