ROCm Testing Setup

tflink · August 31, 2023, 4:05pm

We’ve touched on the topic of testing ROCm in the past and I want to continue that discussion in a more specific way. I’m deliberately avoiding any discussion on where tests should run and for the moment, I’m working with the assumption that Testing Farm + EC2 (and by extension, tft for the high level specification and launching) is going to be sufficient for our needs on a high level, at least in the short term.

I’m starting to look at automated testing of ROCm packages and the first questions I’m hitting are:

What do we want to run for testing?
How do we want to run the tests
Where should the test code live

What To Run for Testing

I think that the easiest thing to do and the best place to start is with the test suites which are part of the ROCm components. The rocm-examples repository might be another interesting place to try but I see that as less valuable than the component testsuites and something that could be added later.

How do we want to run the tests?

The two solutions linked to from the Fedora CI docs are beakerlib and avocado. restraint is likely another option as it’s designed to be a drop-in replacement for beakerlib.

In my mind, there’s still a non-zero chance that we’ll end up wanting to run at least some of the tests in beaker and I can’t think of a good reason to not use beakerlib/restraint so I’m leaning towards using one of them. Any other thoughts on what to use for a framework/runner/whatever your preferred terminology for this type of tool is?

Where should the test code live?

My first thought is a project in the ai/ml group in gitlab but pagure is another option. I don’t have any terribly strong opinions on this so long as it works and everyone who needs to can access the test code.

jsteffan · August 31, 2023, 6:58pm

I’d lean towards beaker compatible (beakerlib/restraint), the ai/ml group in gitlab, the ai/ml copr for builds, and tito for packaging. This is the stack I’ve been working towards for PyTorch.

tflink · September 8, 2023, 6:22pm

After doing a bit more research, I started working on getting some tests put together.

For now, at least, they live on gitlab.

If you try to run the tests, remember that tmt isn’t exactly forthcoming about what it does with output from tests that are run locally and it doesn’t show any live output from the things its running. You should find the output in /var/tmp/tmt/ but in all honestly, I’d suggest running the individual test with beakerlib at this point - a single test being run locally, not in a VM and without automation doesn’t really gain anything from tmt.

The idea was to start with something that’s already packaged and has an upstream test suite - I decided on rocrand and the test does run but I’m unclear on whether the failures that it reports are valid or not.

I’m not quite convinced that the methodology I used is as good as it could and should be. The tests are not built in koji so what I ended up doing is getting the major and minor versions from the envr, cloning the upstream repo and building the tests against the devel package. After building the tests on the test machine, the tests are run and results are supposed to be returned to the runner.

I see two options if the methodology needs to change:

build the tests in koji as a <component>-tests subpackage
- I don’t know how common the issue is but rocfft’s test suite is not simple to package because something is setting a standard rpath which fails the build.
- This exact issue may be limited to rocfft and might be solveable but if there are other issues, is it worth spending the time to address them?
- Is there a relevant packaging policy about packaging tests which can’t be run at build time?
extract the tarball+patches from the srpm and build the tests locally on test machines using that (potentially patched) tarball instead of trying to directly use upstream’s git repo
- This avoids any potential problems with building tests in koji and if we end up needing to patch the tests, it keeps those patchs with the tests and wouldn’t require new builds to apply

Are there thoughts on whether making a bunch of assumptions and trying to run tests from git is good enough, if one of the two other options I listed are better or if there is some other solution that I haven’t thought of?

tflink · September 8, 2023, 6:25pm

I haven’t looked at tito for quite a while but I should probably look at it again.

Otherwise, I ended up doing pretty much the same thing after realizing that restraint isn’t packaged in Fedora and looked like it wouldn’t be worth the trouble to deal with.

I figure we may end up moving the tests to the dist-git test namespace but that’s a conversation and decision for another day

tflink · September 8, 2023, 10:05pm

I did some more testing to see if this will run on an EC2 g4ad instance (g4ad.xl to be exact) and it does run but it fails quite badly - this will need more investigation to see why things are failing. I don’t see all those failures when running locally on a gfx906.

Running the whole test took 17.5 minutes (no package installs), 15 of those were building the upstream test suite. As a reference, my local dev system (ryzen 5700X) took 6.5 minutes to run the same test.

It probably doesn’t matter for the moment but if we ever end up in the place where we’re running these tests on multiple platforms, it might make sense to take a closer look at building the tests in koji. Assuming that the other rocm components have similar or longer build times for the tests, anyways.

jsteffan · September 23, 2023, 10:28pm

I’d be curious if you know what driver stack/version you were using?

The g4 (nvidia & amd)/g5 (nvidia only right now) instances do look useful for getting access to at least some hardware for testing in an automated fashion.

tflink · September 25, 2023, 7:24pm

I was using the amdgpu driver in rawhide. I assume that the amazon/amd provided drivers don’t work on Fedora but to be honest, I didn’t try. I remember having some trouble getting them to work with RHEL during my initial testing but I don’t recall the specific issues.

At this point, I’m waiting for the 5.7 packages to land in rawhide before I do more testing.