Setting up Fedora CI for ROCm

With the ramp up of more ROCm packages in rawhide, I would like to see if we can setup some CI pipelines for testing on real HW. I’m not sure where to start, but I assume @tflink and @mattdm could help guide the discussion a bit. I think I could get my hands on some AMD HW, but the machines and fedora infrastructure is where I’m a bit in the dark.

Let me know if there’s anything that I can do to get this going.

1 Like

I think there are several parts to this, assuming that the issue of obtaining specialized hardware is taken care of (for ROCm and/or any other stacks we may pursue in the future). The order of this may seem backwards but I think it’ll make more sense as I expand on them

  1. How will we coordinate/run the testing?
  2. Where will those host machines live?
  3. Where do we get host machines to facilitate testing?

How will we coordinate/run testing of ROCm?

I’m aware of 3 existing systems for running automated tests in Fedora at the moment: fedora-ci, zuul and openqa.

As far as I know, none of those systems are set up for running hardware-specific tests in Fedora beyond arch. I’ve heard that upstream OpenQA might add some support for that in the future and I’m not sure if the fedora-ci backing systems have any concept of hardware-specific testing but if they do, I don’t believe any of that is set up in Fedora at this time. @adamwill and @mvadkert would be the best references for this. The zuul setup we have access to is hosted outside of Fedora and I doubt that is an option for specialized hardware at the moment and I’m not clear on if upstream zuul has the support for hardware testing that we’d need.

If my suspicions are correct and the three existing systems are not options at this time, I can think of four remaining options for automated testing of ROCm:

  1. Ask Red Hat to host the hardware and use their systems to coordinate and run the HW specific tests
    • This has the obvious problems of access for non-RH folks and overall visibility
  2. Put together a beaker instance
    • I’m not aware of another setup that can do hardware coordination and management and is relatively compatible with Fedora from the get-go.
    • I did put together a beaker instance for Fedora in the past but it was never used
  3. Work to add support to one of the existing systems
  4. Roll our own setup
    • Even minimizing the amount of things that would have to be done (i.e using an existing runner, build upon an existing base system), this would be a lot of work.

If we’re choosing between the non-existing options, my instinct would be to start exploring the possibility of setting up a beaker instance in/for Fedora.

No matter what we do, there are options for getting results where they need to be. Whether there’s available bandwidth to maintain that glue code or if they’re needed at this time are different questions that can be left for later.

Where Will the Test Infrastructure Live?

The easiest answer is “in Fedora infra” but I have zero visibility into how much rack space they have available or if this is a potential option. I assume that @kevin is the best reference for whether hosting more test machines in Fedora infra in is an option.

If that’s not an option, we’d have to find another host whether that’s Red Hat, some other company or a contributor. There are issues with any of those options and I’d rather not explore alternatives until we have an answer to whether Fedora infra is an option.

Where Will the Host Machines Come From?

No matter what we do, some form of host machine will be required. The specifics of what we’re looking for will depend on the answers to how and where the tests are being run but unless there’s hardware lying around that I’m not aware of, it’s going to be a budget or a sponsor issue.

I’m of the mind to leave this alone until we answer the other two questions. Once we have some (at least potential) answers to those, we can worry about finding sponsors for the host hardware.

There are some tests that might not need AMD hardware. For example, building a simple program just to verify that the compiler and the other tools can find each other on the filesystem.

Clang, for example, will verify it can find a valid gcc installation just by running clang -v, I believe it can do the same for ROCm installs too. I think these tests add value and you could start running them today with existing infrastructure.

That’s a very good point. My brain went straight to the hardware tests since those are going to be the most complicated and take the most work.

Is Test Management Tool :: Fedora Docs the way to set that up?

Is AWS enough for AMD? If so, I am happy to onboard you to Testing Farm, and there is a new reservation workflow ready for you!

https://www.youtube.com/watch?v=nS0Z1OilOas&t=18485s

Onboarding:
https://docs.testing-farm.io/general/0.1/onboarding.html

You can catch me on Fedora Chat under mvadkert for more info

Is AWS enough for AMD?

Not sure, can we get AMD HW on the AWS systems?

To be very clear, I am not an infrastructure expert. I’d like to get the conversation going, but I’d need some more experienced people to setup the machines and whatnot, then I will be happy to try to acquire the necessary HW and write the test scripts.

I have experience using zuul with dockers, but my experience is more along the lines of zuul is a black box to me that I stick dockers and test scripts into, and it pops out results that I can read.

Right now when I pull in updates for rocm-runtime or rocclr (OpenCL/HIP), I only do basic tests on a single RX 6750, and some limited tests on a raven APU. Currently the testing is very bare and often error prone, so having CI is a big plus to improve the ROCm experience in Fedora.

This is what AWS offers in terms of AMD HW

Currently they seem to have:

NVIDIA GPUs (G4dn) or AMD GPUs (G4ad).

Would that be any help?

I tried to provision now a g4ad.xlarge machine with Fedora 37, and this is the output of cpuinfo & lspci:

# cat /proc/cpuinfo 
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 23
model		: 49
model name	: AMD EPYC 7R32
stepping	: 0
microcode	: 0x8301055
cpu MHz		: 3299.322
cache size	: 512 KB
[root@ip-172-31-35-106 ~]# lspci 
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
00:1e.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 12 [Radeon Pro V520/V540] (rev c3)
00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller

If it was enough, we can help with setting up with CI against these machines …

Ah navi 12, not bad. Not fantastic, but it’ll make a good basic test platform if it’s possible to tie it into src.fedoraproject.org pull requests. I’m assuming I can get one of these machines with the test farms? If so, I’ll start the onboarding process.

It would still be good to explore ways of getting maybe RH hosted HW testing, but the AWS instance might suffice for now. I’ll only really need it maybe 3-6 hours every few weeks, but if RHEL adopts ROCm, there’ll be need for much more HW testing.

@mystro256 we can test already against RH hosted HW in Testing Farm, but we would need to sort out the permissions, as it needs to be at least locked down for PR approvals.

@mystro256 I will talk to @msrb how to route your component to the machines, which one is it?

You will need to add some tests via tmt to dist-git:

Ok, we discussed, so the only thing we need is the component name, for which we should enable it.

Then the added tmt tests will run against the machine type I posted - g4ad.

Our generic HW requirements do not yet support expressing, that you want a machine with a GPU.

I filled that requirement here:

So I have a bunch of ROCm components that would benefit from testing.

Would benefit from AMD GPU HW for testing (in order of priority):
https://src.fedoraproject.org/rpms/rocclr
https://src.fedoraproject.org/rpms/rocm-runtime
https://src.fedoraproject.org/rpms/rocrand
https://src.fedoraproject.org/rpms/rocminfo
https://src.fedoraproject.org/rpms/hsakmt

Does not need HW (just needs a x86 CPU):
https://src.fedoraproject.org/rpms/rocm-device-libs
Overview - rpms/rocm-compilersupport - src.fedoraproject.org (still figuring out what test cases are applicable, but shouldn’t need a GPU)

There’s more components coming but this is a good start I think.

Thanks so much for the help!

Thanks, so will go with the list you provided. We are adding config, so this is possible, might take a day or two. Will let you know.

Glad to help, as it turns out, we reached a point that some nice things are easily possible, just the community is not yet aware :slight_smile:

Thanks again, I didn’t think it would be so simple :slight_smile:

Is there any documentation on how to write test cases to hook into fedora PR’s? I only see jenkins, gitlab, and github.

Scratch that, I missed your message with:

I’ll reach out if I have questions

No worries, yes, that is our quickstart guide for Fedora.

More about tmt on its official page, especially the guide:

https://tmt.readthedocs.io/en/stable/guide.html

Basically, you should be able to work on tests locally with tmt, until you know they work well.
If you have GPU on your local machine, you can just use the local provisioner.

Sorry, I was unable to deliver it before my 3 week PTO, so resurrecting the work here as I am back :slight_smile:

I filled the issue here:

https://pagure.io/fedora-ci/general/issue/422

To track the progress.

1 Like

@tflink, hey! I suppose this is what you were talking about WRT the AWS resources.

Yeah, this is the one.