CUDA Initialisation Errors

Hello!

I’ve been having issues using CUDA drivers, and I’ve finally resorted to asking here. I’m trying to use CUDA drivers for Blender and PyTorch, and there’s something wrong with my drivers. Whenever I try to run a CUDA program, I get an error along the lines of:

failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error

To get more information, I ran deviceQuery from the CUDA samples[1] and this gave me very strange results.

The first time I run it after a restart, I get this output:

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 3
-> initialization error
Result = FAIL

but after that, every time I run it, I get this:

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 999
-> unknown error
Result = FAIL

After some digging online, I somehow figured that loading and unloading the nvidia_uvm module with rmmod and modprobe seems to sometimes temporarily fix this, and immediately after I do that, I get this (I think, ideal) output from deviceQuery:

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA GeForce RTX 2070 Super with Max-Q Design"
  CUDA Driver Version / Runtime Version          12.3 / 12.2
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 7974 MBytes (8361738240 bytes)
  (040) Multiprocessors, (064) CUDA Cores/MP:    2560 CUDA Cores
  GPU Max Clock rate:                            1155 MHz (1.15 GHz)
  Memory Clock rate:                             5501 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        65536 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.3, CUDA Runtime Version = 12.2, NumDevs = 1
Result = PASS

but it always subsequently goes back to the error 3 and then 999 after a couple of seconds.

Here is my output of lsmod | grep -e "nvidia"

nvidia_uvm           3522560  0
nvidia_drm            118784  4
nvidia_modeset       1585152  3 nvidia_drm
nvidia              62394368  42 nvidia_uvm,nvidia_modeset
i2c_nvidia_gpu         12288  0
video                  77824  3 msi_wmi,i915,nvidia_modeset

and the output of dnf list installed "*nvidia*"

akmod-nvidia.x86_64                                                     3:545.29.06-1.fc39                                @rpmfusion-nonfree-nvidia-driver
kmod-nvidia-6.6.4-200.fc39.x86_64.x86_64                                3:545.29.06-1.fc39                                @@commandline                   
nvidia-gpu-firmware.noarch                                              20231111-1.fc39                                   @updates                        
nvidia-modprobe.x86_64                                                  3:545.29.06-1.fc39                                @rpmfusion-nonfree-nvidia-driver
nvidia-persistenced.x86_64                                              3:545.29.06-1.fc39                                @rpmfusion-nonfree-nvidia-driver
nvidia-settings.x86_64                                                  3:545.29.06-1.fc39                                @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia.x86_64                                              3:545.29.06-1.fc39                                @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-cuda.x86_64                                         3:545.29.06-1.fc39                                @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-cuda-libs.i686                                      3:545.29.06-1.fc39                                @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-cuda-libs.x86_64                                    3:545.29.06-1.fc39                                @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-kmodsrc.x86_64                                      3:545.29.06-1.fc39                                @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-libs.i686                                           3:545.29.06-1.fc39                                @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-libs.x86_64                                         3:545.29.06-1.fc39                                @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-power.x86_64         

and nvidia-smi

Wed Dec 13 02:54:51 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2070 ...    On  | 00000000:01:00.0 Off |                  N/A |
| N/A   50C    P0              14W /  80W |      4MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      3090      G   /usr/bin/gnome-shell                          2MiB |
+---------------------------------------------------------------------------------------+

I’ve tried fully reinstalling the nvidia drivers from rpmfusion following this guide.

I’m running Fedora 39 Workstation (kernel version 6.6.4-200.fc39.x86_64) on an MSI gaming laptop with a Geforce 2070 Super, and I use the Intel integrated gpu for my primary display. I don’t use Secure Boot.

Thanks in advance for your help!


  1. which are actually a pain to build, because they require GCC 12 ↩︎

You might be seeing something similar to what I found with cuda and the nvidia 545 drivers.

My cuda apps totally quit when the nvidia drivers were updated to 545 and it also updated cuda to 12.3.

What I did to recover was really simple – I downgraded the nvidia drivers back to the 535.129 driver version.

  1. sudo dnf remove \*nvidia\*545\*
  2. sudo dnf install akmod-nvidia-535.129\* xorg-x11-drv-nvidia-cuda-535.129\* nvidia-{persistenced,settings}-535.129\*
  3. wait at least 5 minutes after the installation in step 2 completes then reboot

What this did was downgrade both nvidia driver versions to 535.129 and cuda version to 12.2.
After the reboot my cuda apps worked again.

Thanks for the reply. I just tried this, and unfortunately it doesn’t work - I have the exact same behaviour as before. It did however successfully downgrade the CUDA version to 12.2.

For reference, this is now my output for dnf list installed "*nvidia*":

Installed Packages
akmod-nvidia.x86_64                                             3:535.129.03-1.fc39                         @rpmfusion-nonfree
kmod-nvidia-6.6.6-200.fc39.x86_64.x86_64                        3:535.129.03-1.fc39                         @@commandline     
nvidia-gpu-firmware.noarch                                      20231111-1.fc39                             @updates          
nvidia-persistenced.x86_64                                      3:535.129.03-1.fc39                         @rpmfusion-nonfree
nvidia-settings.x86_64                                          3:535.129.03-1.fc39                         @rpmfusion-nonfree
xorg-x11-drv-nvidia.x86_64                                      3:535.129.03-2.fc39                         @rpmfusion-nonfree
xorg-x11-drv-nvidia-cuda.x86_64                                 3:535.129.03-2.fc39                         @rpmfusion-nonfree
xorg-x11-drv-nvidia-cuda-libs.i686                              3:535.129.03-2.fc39                         @rpmfusion-nonfree
xorg-x11-drv-nvidia-cuda-libs.x86_64                            3:535.129.03-2.fc39                         @rpmfusion-nonfree
xorg-x11-drv-nvidia-kmodsrc.x86_64                              3:535.129.03-2.fc39                         @rpmfusion-nonfree
xorg-x11-drv-nvidia-libs.i686                                   3:535.129.03-2.fc39                         @rpmfusion-nonfree
xorg-x11-drv-nvidia-libs.x86_64                                 3:535.129.03-2.fc39                         @rpmfusion-nonfree
xorg-x11-drv-nvidia-power.x86_64                                3:535.129.03-2.fc39                         @rpmfusion-nonfree

I just updated to the 6.6.6 kernel and now the nvidia 545 driver and cuda 12.3 are working properly for me.

Just made the upgrade to 6.6.6 and driver version 545.29, and still have the same issue, unfortunately. Is there any way I get more information on what exactly is going wrong when the initialisation error occurs? I wasn’t able to locate any useful error logs in any of the usual places.

Specifically which app is showing the error?
Are there any related entries in journalctl for the specific time that may show a clue?
What exactly is seen on the screen that shows this error?

You keep saying initialisation error but are providing no details as to exactly what is being done by you nor the error messages you may be seeing.

I’m sorry I haven’t been specific - it’s because there’s barely any error message at all, except for the one I already posted. For example, in Blender when I open the preferences to set the GPU to CUDA, the error shows up on the console as:

CUDA cuInit: Unknown error

and in-app:

image

During Tensorflow initialization, the output is:

023-12-16 09:35:02.218854: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:276] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2023-12-16 09:35:02.218993: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:168] retrieving CUDA diagnostic information for host: fedora
2023-12-16 09:35:02.219066: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:175] hostname: fedora
2023-12-16 09:35:02.219339: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:199] libcuda reported version is: 545.29.6
2023-12-16 09:35:02.219423: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:203] kernel reported version is: 545.29.6
2023-12-16 09:35:02.219461: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:309] kernel version seems to match DSO: 545.29.6

and when trying to use NVEnc on handbrake:

[09:36:55] encavcodecInit: H.265 (Nvidia NVENC)
[09:36:55] encavcodec: encoding at rc=vbr, multipass=fullres, 7.00
[09:36:55] encavcodec: encoding with stored aspect 1/1
[hevc_nvenc @ 0x7f4d50781e80] dl_fn->cuda_dl->cuInit(0) failed -> CUDA_ERROR_UNKNOWN: unknown error
[09:36:57] encavcodecInit: avcodec_open failed
[09:36:57] Failure to initialise thread 'FFMPEG encoder (libavcodec)'

and there is nothing that stands out in the output of journalctl -b | grep -i drm, journalctl -b | grep -i nvidia or journalctl -b | grep -i cuda, either at startup or when a CUDA initialisation error occurs. And once again, thanks for trying to help - I appreciate it.

There was a thread recently about blender not working and I seem to recall it was related to cuda. I can’t find it right now but you may be able to find it by searching for blender here.

@rollonit I have your exact same issue on a Thinkpad P14s and an RTX A500.

Have you managed to fix it?

Unfortunately, I haven’t had any luck so far - tried some of the solutions for other CUDA issues on here and none of them worked. Eventually ended up using Windows for the thing I needed.

I’m currently on 545.29, BTW - and I’m still looking for ways to fix this.