Hello!
I’ve been having issues using CUDA drivers, and I’ve finally resorted to asking here. I’m trying to use CUDA drivers for Blender and PyTorch, and there’s something wrong with my drivers. Whenever I try to run a CUDA program, I get an error along the lines of:
failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
To get more information, I ran deviceQuery from the CUDA samples[1] and this gave me very strange results.
The first time I run it after a restart, I get this output:
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 3
-> initialization error
Result = FAIL
but after that, every time I run it, I get this:
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 999
-> unknown error
Result = FAIL
After some digging online, I somehow figured that loading and unloading the nvidia_uvm
module with rmmod
and modprobe
seems to sometimes temporarily fix this, and immediately after I do that, I get this (I think, ideal) output from deviceQuery:
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA GeForce RTX 2070 Super with Max-Q Design"
CUDA Driver Version / Runtime Version 12.3 / 12.2
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 7974 MBytes (8361738240 bytes)
(040) Multiprocessors, (064) CUDA Cores/MP: 2560 CUDA Cores
GPU Max Clock rate: 1155 MHz (1.15 GHz)
Memory Clock rate: 5501 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 4194304 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 65536 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.3, CUDA Runtime Version = 12.2, NumDevs = 1
Result = PASS
but it always subsequently goes back to the error 3 and then 999 after a couple of seconds.
Here is my output of lsmod | grep -e "nvidia"
nvidia_uvm 3522560 0
nvidia_drm 118784 4
nvidia_modeset 1585152 3 nvidia_drm
nvidia 62394368 42 nvidia_uvm,nvidia_modeset
i2c_nvidia_gpu 12288 0
video 77824 3 msi_wmi,i915,nvidia_modeset
and the output of dnf list installed "*nvidia*"
akmod-nvidia.x86_64 3:545.29.06-1.fc39 @rpmfusion-nonfree-nvidia-driver
kmod-nvidia-6.6.4-200.fc39.x86_64.x86_64 3:545.29.06-1.fc39 @@commandline
nvidia-gpu-firmware.noarch 20231111-1.fc39 @updates
nvidia-modprobe.x86_64 3:545.29.06-1.fc39 @rpmfusion-nonfree-nvidia-driver
nvidia-persistenced.x86_64 3:545.29.06-1.fc39 @rpmfusion-nonfree-nvidia-driver
nvidia-settings.x86_64 3:545.29.06-1.fc39 @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia.x86_64 3:545.29.06-1.fc39 @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-cuda.x86_64 3:545.29.06-1.fc39 @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-cuda-libs.i686 3:545.29.06-1.fc39 @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-cuda-libs.x86_64 3:545.29.06-1.fc39 @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-kmodsrc.x86_64 3:545.29.06-1.fc39 @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-libs.i686 3:545.29.06-1.fc39 @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-libs.x86_64 3:545.29.06-1.fc39 @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-power.x86_64
and nvidia-smi
Wed Dec 13 02:54:51 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06 Driver Version: 545.29.06 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 2070 ... On | 00000000:01:00.0 Off | N/A |
| N/A 50C P0 14W / 80W | 4MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 3090 G /usr/bin/gnome-shell 2MiB |
+---------------------------------------------------------------------------------------+
I’ve tried fully reinstalling the nvidia drivers from rpmfusion following this guide.
I’m running Fedora 39 Workstation (kernel version 6.6.4-200.fc39.x86_64) on an MSI gaming laptop with a Geforce 2070 Super, and I use the Intel integrated gpu for my primary display. I don’t use Secure Boot.
Thanks in advance for your help!
which are actually a pain to build, because they require GCC 12 ↩︎