Experimenting with Fedora PyTorch distributed from a PIP repository

Hey,

Once you play with LLMs it becomes clear that the world really needs a stable distribution of PyTorch with builds that actually work with various accelerators out of the box … and I’m really glad to see the work being done in this SIG to bring that into Fedora.

I noticed that most people will use a PIP repository regardless (eg: on Ubuntu, despite PyTorch being packaged up in debs). So I wonder what would happen if we took the same content and packaged it up as RPMs, PIP repository, Conda repository … and maybe some containers … all under the Fedora umbrella … then what would people actually use? I’d like to experiment with answering this, and figure it out empirically.

I see Mark McLoughlin[1], Brian Stinson, have been playing with this idea too.

Is this SIG a decent place to experiment with this, and have related discussions? Or is the Pytorch SIG a better home? If this ends up being a wildly successful way to consume python related content from Fedora, then it may end up in the Python SIG, but I’d like to start more focused.

Stef


  1. GitHub - markmc/rebuilding-the-wheel ↩︎

1 Like

I’ve often had bad luck with PIP when there’s some binary dependency — sometimes things fail to build properly, and I have no idea if it’s consistent or repeatable. As things stand, I tend to prefer Fedora RPMs when available, and want those for production because of the reproducibility issue. But, when I’m just hacking on something and want, say, waffle charts and there’s no Fedora package, pip install is a million times easier.

I’d like to be able to not care — that is, have both things work as reliably and easily, and ideally be interchangeable.

I think what you’re suggesting of having our own set of python-native packages might address the problems I’ve seen. And, maybe @dcantrell’s experiment with pip-aware-DNF could play in somewhere here?

Yeah goal would be to have both work reliably and easily … but for Fedora to provide stable (in relation to one another, and the accelerators) PIP builds so that developers can include them in their dependency trees. Let me sketch a bit more around the goal … and post it here.

pip-aware-DNF sounds interesting. I should catch up on that.
Stef

1 Like

What’s the usecase you have in mind here? In my experience, attempting to target research usecases via packaging is a bit of a fools errands – despite best efforts, people will likely always prefer to use conda or pip as they’re iterating quickly / trying to reproduce some paper / etc. Where packaging can help (and what IMO Fedora should aim to target) is production usecases, where someone wants to use a LLM or some AI-powered tool in a supportable way without having to spin up a whole devel environment.

In practice, the ai-ml sig, pytorch sig and HC sig are pretty much the same. There aren’t enough people involved to justify the overhead of too much separation.

I don’t see any issue with trying to build wheels but I do admit that I don’t understand what we’d be gaining over the upstream pytorch wheels.

FWIW, we have made progress on HW acceleration for AMD cards. As of last week, the rawhide builds of pytorch do seem to be working with rocm acceleration. It’s a pre-release of pytorch and the builds definitely need more testing but from an initial look, they are working.

3 Likes

Yes, I think there are distinct use cases with distinct requirement and typically different solutions:

  • Have an application stack which you can deploy and update consistently.
  • Have an application which you can run consistently.

In the first case, you’d want distro packages which integrate with everything and are kept secure through updates.

In the second case, you’d want an isolated environment with pinned dependencies so that your application keeps running and producing reproducible results.

In that second case, people typically start with packaged dependencies, get lost in version dependencies, try pip, get lost in version conflicts or problems with binary dependencies, try conda or containers … E.g., imagine you develop on your machine and want/need to put your “app” on a HPC cluster. There’s basically no way to avoid conda or containers (apptainer/singularity).

… and I’d have pytorch firmly in case 2, but what do I know about your (OP) use case :slight_smile:

Agree

| Michael J Gruber mjg Embryo
April 17 |

  • | - |

Davide Cavalca:

What’s the usecase you have in mind here? In my experience, attempting to target research usecases via packaging is a bit of a fools errands – despite best efforts, people will likely always prefer to use conda or pip as they’re iterating quickly / trying to reproduce some paper / etc. Where packaging can help (and what IMO Fedora should aim to target) is production usecases, where someone wants to use a LLM or some AI-powered tool in a supportable way without having to spin up a whole devel environment.

So the basic value that Fedora has for packaging PyTorch (and similar ML dependencies) is to sort out the current rather wild interdependency between kernel, accelerator driver stack (like ROCM), and application stack (like Pytorch). Someone should be able to rely on the fact that the combination they get from Fedora “just works” and at the same time is reasonably “cutting edge”. Kudos to folks in this SIG for pulling that off with ROCM and PyTorch.

Agree Davide, that Fedora RPMs are really amazing for production use cases.

Fedora is also used by developers, and for ML use cases the interplay between PIP requirements.txt and Fedora RPMs starts to get more difficult. Among my myriad of other ML dependencies (some of which are not in Fedora eg: llama_cpp_python, langchain, gguf), I want to have a requirements.txt that pulls a “torch” that “just works” with my accelerator and Fedora kernel.

Stef

1 Like

A difference we will have in the ROCm stacks of Fedora vs AMD. They may look the same but there is a dependency of what gpu set each are built from. If your base library is not built for a specific gpu target it will likely just fail, you can monkey around and try setting the gpu to a target that was built for in an environment variable and hoping you get lucky. Fedora tries to build for as many gpus as possible to limit this bad user experience, but we still miss gpu’s. If we do file a bz and I will try to enable it.

1 Like

Is there a document tracking which GPUs are currently supported? (And if there are different levels of support or compatibility, what the differences are?)

On the AMD side there is this
https://rocm.docs.amd.com/en/docs-5.5.1/release/gpu_os_support.html
I have not review if this is exactly what is in their stack, but i suspect that is what you would support for… the gfx9 family.
On the Fedora side, it would be nice if this was documented somewhere. In reality it is control by some rpm macros… see
$ rpm -qli rocm-rpm-macros-modules | grep modulefile

/usr/share/modulefiles/rocm

/usr/share/modulefiles/rocm/default

/usr/share/modulefiles/rocm/gfx10

/usr/share/modulefiles/rocm/gfx11

/usr/share/modulefiles/rocm/gfx1100

/usr/share/modulefiles/rocm/gfx1101

/usr/share/modulefiles/rocm/gfx1102

/usr/share/modulefiles/rocm/gfx1103

/usr/share/modulefiles/rocm/gfx8

/usr/share/modulefiles/rocm/gfx9

/usr/share/modulefiles/rocm/gfx906

/usr/share/modulefiles/rocm/gfx908

/usr/share/modulefiles/rocm/gfx90a

The most important is ‘default’ because that lands in normal location for install … it is$ cat /usr/share/modulefiles/rocm/default
#%Module 1.0

ROCm module for use with ‘environment-modules’ package:

conflict rocm

setenv ROCM_BIN /usr/bin
setenv ROCM_LIB /usr/lib64
setenv ROCM_GPUS “gfx1010;gfx1012;gfx1030;gfx1100;gfx1101;gfx1102”

the variable ROCM_GPUS is what is built by most of the packages, we need some consistency improvements, but this is what i would expect to work as a default on fedora. gfx10 and gfx11 are for the most part the consumer cards

as a side note, when we document that list, we should make sure that we translate from the shader isa to the model numbers most people use. at this point, I think we’ve both memorized the translation but I suspect that someone wanting to know if their card is supported is going to look at that list and be confused :slight_smile:

:eyes: Excuse me. . . Some of those do not work.

gfx9 support is finicky. . . RX580 and RX5700XT for example a notoriously bad to configure. Remember that their APU’s are on the list for gfx9 and the Vega ones have bad support. A ton of work to :crossed_fingers:t5: get right. Building PyTorch with the right OpenCL /RocM version as well which AMD has conveniently removed from their site ( RocM3.0~)

yeah, there are certainly holes in the list of what actually works. AMD’s support list for ROCm has always been small but they’ve never really stopped people from building for “unsupported” cards.

Similarly, we as the Fedora ROCm packagers aren’t making promises that every card in those families will work with ROCm but we are building the binaries such that they can at least try to run. As our testing infrastructure improves, we’ll have more data on exactly what works and what doesn’t but I doubt that filling in our test matrix with older hardware is going to be a high priority.

gfx9 is a weird thing, though. The family consists of old stuff but it also consists of the entire MI series of accelerators. So you have stuff like the MI25 (gfx901) that barely works, the radeon pro VII (gfx906) which works well and has been fully supported by AMD for a long time and up to the brand new neigh-unobtanium MI300 (gfx90a).

From what I’ve seen of ROCm as an outsider, AMD’s primary focus for ROCm at the moment seems to be the newer end of gfx9 (gfx908, gfx90a) which is what is going into datacenters. Pretty much everything else is at a lower priority, even if the shader ISA family appears newer. The rest of the ROCm supported list is at a slightly lesser priority and everything else is YMMV. When Tom talks about gfx9 support, I’m pretty sure he’s referring to the newer end of the family.

For what it’s worth, I’ve had a decent experience getting ROCm to work on Fedora with cards I have access to (gfx906, gfx1100) but I’ve also gone out of my way to find cards that are on AMD’s supported list for ROCm.

1 Like

We’ll need to get this SIG better connected with some Docs Team folks!

we’ve created a docs repo but haven’t populated it with anything yet. I assume that the plan is to use the same templates/generator what the docs folks are using.

1 Like

FWIW, the steps are here: Create a new documentation module :: Fedora Docs

Yeah I’ve put together a stub but we’re waiting on Issue #11883: remap gitlab permission groups for the ai-ml-sig - fedora-infrastructure - Pagure.io to be able to actually push to this repo.