Can you run Nvidia GPU workloads on FCOS?

Yes it is possible using Nvidia’s container toolkit and their driver container.

This is not trivial however as FedoraCoreOS is not one of the supported platforms.

Since the Nvidia driver container image is kernel specific there is also a degree of risk when leaning into the OS auto-update model. You can of course defer OS updates to mitigate this.

Below I reference forks of the Nvidia projects that can be used as a reference point if you want to schedule GPU workloads successfully. As of the time of writing they are being used successfully to run GPU workloads on the most recent versions of next, testing and stable.

The driver container project fork automatically rebuilds images for the three FOCS streams daily and makes them available in the project’s in-built public gitlab docker registry.

NVIDIA_DRIVER_VERSION=470.82.00 # Current minimum for latest version of all streams.
FEDORA_VERSION_ID=$(cat /etc/os-release | grep VERSION_ID | cut -d = -f2)
FEDORA_KERNEL_VERSION=$(uname -r)
docker pull registry.gitlab.com/container-toolkit-fcos/driver:${NVIDIA_DRIVER_VERSION}-fedora${FEDORA_VERSION_ID}-${FEDORA_KERNEL_VERSION}

The container runtime project fork (together with additional supporting project forks) builds a series of rpms that can be layered via rpm-ostree at the command line or preferably via systemd. It auto-publishes a GitLab page that has a sample ignition example here.

Hopefully these may be of use to someone starting down the path of scheduling GPU workloads on their FedoraCoreOS-based infrastructure.

1 Like

Note that the driver project now publishes to dockerhub (benefiting from security scanning) and the supported GPU driver versions have moved forward across all streams. The CICD now pre-compiles kernel modules storing them in the kernel-specific images for faster startup.

The container-toolkit fork has also been brought up to v1.9.0 which means cgroup2 support.

Container-toolkit fork has been brought up to v1.10.0 and x86 fedora37 artifacts are now produced and validated. aarch64 artifacts also produced but not validated.

Meanwhile the upstream NVIDIA driver project has accepted a pull request that means it now builds a fedora driver container image (not yet published to NCR but available in their GitLab repo), and work seems afoot to support fedora in the container-toolkit (fedora35 artifacts added in v1.11.0+).

Currently the most up to date NVIDIA drivers, e.g. 515.65.01, will only install/run successfully up until Fedora37 kernel 5.19.16. From a FedoraCoreOS perspective this luckily means an OS version which includes fixes for the OpenSSL 3.0 vulnerability. Subsequent 6.x kernels fail on kernel header pre-compilation. There seem to be plenty of issues with 6.x kernels referenced on NVIDIA user forums albeit for non-datacenter-based GPUs.

I clicked the link Fifo Phonics / container-toolkit · GitLab
but the web server returned a web page with the text “404 Page Not Found”

Sorry. A better link.

Newest NVIDIA data center drivers, e.g. 470.161.03 510.108.03 515.86.01, address compilation issues on 6.x kernels across all current FCOS streams.

A few notes:

  • Published docker image tags now convergent with those required by NVIDIA Container Operator.
  • Additional kernel-specific docker image tags (not NCO convergent) contain pre-compiled kernel headers for faster startup.
  • 525 driver now produced and running fine on latest FCOS streams.
  • Since kernel 6.1.x it seems that it is necessary to manually load the video kernel module (modprobe video) or symbol lookup failures occur at driver container runtime. Recommended systemd unit updated accordingly in documentation.