Can you run Nvidia GPU workloads on FCOS?

Yes it is possible using Nvidia’s container toolkit and their driver container.

This is not trivial however as FedoraCoreOS is not one of the supported platforms.

Since the Nvidia driver container image is kernel specific there is also a degree of risk when leaning into the OS auto-update model. You can of course defer OS updates to mitigate this.

Below I reference forks of the Nvidia projects that can be used as a reference point if you want to schedule GPU workloads successfully. As of the time of writing they are being used successfully to run GPU workloads on the most recent versions of next, testing and stable.

The driver container project fork automatically rebuilds images for the three FOCS streams daily and makes them available in the project’s in-built public gitlab docker registry.

NVIDIA_DRIVER_VERSION=470.82.00 # Current minimum for latest version of all streams.
FEDORA_VERSION_ID=$(cat /etc/os-release | grep VERSION_ID | cut -d = -f2)

The container runtime project fork (together with additional supporting project forks) builds a series of rpms that can be layered via rpm-ostree at the command line or preferably via systemd. It auto-publishes a GitLab page that has a sample ignition example here.

Hopefully these may be of use to someone starting down the path of scheduling GPU workloads on their FedoraCoreOS-based infrastructure.

1 Like

Note that the driver project now publishes to dockerhub (benefiting from security scanning) and the supported GPU driver versions have moved forward across all streams. The CICD now pre-compiles kernel modules storing them in the kernel-specific images for faster startup.

The container-toolkit fork has also been brought up to v1.9.0 which means cgroup2 support.

Container-toolkit fork has been brought up to v1.10.0 and x86 fedora37 artifacts are now produced and validated. aarch64 artifacts also produced but not validated.

Meanwhile the upstream NVIDIA driver project has accepted a pull request that means it now builds a fedora driver container image (not yet published to NCR but available in their GitLab repo), and work seems afoot to support fedora in the container-toolkit (fedora35 artifacts added in v1.11.0+).