Fedora 38 Failing To Boot With NVIDIA Drivers After Software Update

Full disclosure I also posted here, but I thought I may get a more specialized answer here.

Hello, all, new here, gotten help from other’s posts here before

I have had Nvidia drivers working on fedora for a while with secure boot. For some reason, after a software update the other day (the one that before shut down will ask if you want to update, even as an unprivileged user [does this only update flatpaks?]), I have had difficulty booting. I noticed after looking at the output log while booting that it got stuck on NetworkManager-dispatch.service. But after disabling it, it got stuck on gdm.service (GNOME Desktop Manager). It seemed when I got rid of the boot parameter rhgb quite, to get stuck on something to do with networking, but had warnings about the NVIDIA device. I removed all the GRUB 2 boot parameters (sans kernel spec) and it booted (second monitor now shows white or flickers in/out of white on half the screen). Its boot able now but would like to use the NVIDIA card. Here are the parameters:

rd.driver.blacklist=nouveau modprobe.blacklist=nouveau nvidia-drm.modeset=1 initcall_blacklist=simpledrm_platform_driver_init rhgb quiet initcall_blacklist=simpledrm_platform_driver_init rd.driver.blacklist=nouveau modprobe.blacklist=nouveau nvidia-drm.modeset=1

I tried removing different combinations of these parameters, but the best I seem to be able t do is boot with no NVIDIA driver.

Lots of these seem duplicated, I am not sure if that may have anything to do with it, there was a post on NVIDIA forums I saw with regard to this, I tried something similar but it did not seem to help.

Any ideas or steps I should take next?

It seems to randomly work all the sudden every around every 10 boots or so, either with unmodified boot parameters, or with modified ones, I don’t know why. I have tried repeating the same boot parameters, and it doesn’t work.

So sometimes it boots with the NVIDIA driver working, but most of the time not, and I don’t know how to control it or what causes this.

Also I would use NVK, but I may want to use CUDA (p.s might there be a way to use either as boot options, or NVK as a fallback?)

P.s.s is there an open source alternative I could use to CUDA, like HIP with my NVIDIA card? Preferably compatible with NVK (I am having a lot of issues with the proprietary drivers)

Thanks!

NVK is a work-in-progress nvidia driver - not sure you mean that?

Are you using the rpmfusion nvidia driver?
If so did you wait for the driver to be built for the new kernel?
You can see if the build is still running using systemctl list-jobs
and look for an akmods job running.

Try running sudo akmods --force --rebuild the reboot.

For example on my system:

$ sudo akmods --force --rebuild
Checking kmods exist for 6.6.7-200.fc39.x86_64             [  OK  ]
Building and installing nvidia-kmod                        [  OK  ]

These black lists are only to stop the nouveau driver loading before there is a chance to load the rpmfusion driver. If the rpmfusion driver fails to load then there is a systemd service (nvidia-fallback.service) that will load the nouveau driver after any failed.

You should leave those entries in place.

What is the output of systemctl status nvidia-fallback.service?

Install inxi and show us the output of inxi -FGxx so we can see what the state of your system is and what hardware you have.

Thanks for the reply!

It does not say there are any jobs running. When I load with without the boot parameters I posted.

System:
  Host: fedora Kernel: 6.6.8-100.fc38.x86_64 arch: x86_64 bits: 64
    compiler: gcc v: 2.39-16.fc38 Desktop: GNOME v: 44.7 tk: GTK v: 3.24.38
    wm: gnome-shell dm: GDM Distro: Fedora release 38 (Thirty Eight)
Graphics:
  Device-1: NVIDIA GA106M [GeForce RTX 3060 Mobile / Max-Q] vendor: Lenovo
    driver: nouveau v: kernel arch: Ampere pcie: speed: 8 GT/s lanes: 8 ports:
    active: none
  Device-2: AMD Cezanne [Radeon Vega Series / Radeon Mobile Series]
    vendor: Lenovo driver: amdgpu v: kernel arch: GCN-5 pcie: speed: 8 GT/s
    lanes: 16 ports: active: eDP-2
  Display: wayland server: X.org v: 1.20.14 with: Xwayland v: 22.1.9
    compositors: 1: gnome-shell 2: xfwm driver: X: loaded: amdgpu,nvidia
    unloaded: fbdev,modesetting,nouveau,vesa alternate: nv dri: radeonsi
    gpu: amdgpu display-ID: 0

akmods cant find the nvidia driver in this state. It also says --rebuild is an unknown option

When I run sudo dnf install kmod-nvidia-470xx-6.6.8-100.fc38.x86_64.x86_64 it seems to alrady be installed, and it seems to match my kernel version. akmod-nvidia-470xx.x86_64 seems to be installed as well

And I have been following the development of NVK. I may have misremembered the readyness from the latest update (thought I had read that NVK is the default on some distros when it said “This will also signal distros to go ahead and ship NVK by default.”. But even if its ready soon it may be something I try

Check what --help says. I get this:

$ akmods --help
Checks the akmod packages and rebuilds them if needed

Usage: akmods [OPTIONS]

Options:
 --force             -- try all, even if they failed earlier
 --kernels <kernel>  -- build and install only for kernel <kernel>
                        (formatted the same as 'uname -r' would produce)
 --rebuild           -- rebuild all, even if they are up to date
 --akmod <akmod>     -- build and install only akmod <akmod>

For reference this is the RPMs that I have installed. I have a desktop RTX3060.

$ rpm -qa | grep -e akmod -e nvidia
libva-nvidia-driver-0.0.11-1.fc39.x86_64
nvidia-modprobe-545.29.06-1.fc39.x86_64
nvidia-settings-545.29.06-1.fc39.x86_64
akmods-0.5.8-6.fc39.noarch
kmod-nvidia-6.6.6-200.fc39.x86_64-545.29.06-1.fc39.x86_64
kmod-nvidia-6.6.7-200.fc39.x86_64-545.29.06-1.fc39.x86_64
xorg-x11-drv-nvidia-kmodsrc-545.29.06-2.fc39.x86_64
xorg-x11-drv-nvidia-cuda-libs-545.29.06-2.fc39.x86_64
xorg-x11-drv-nvidia-libs-545.29.06-2.fc39.i686
xorg-x11-drv-nvidia-libs-545.29.06-2.fc39.x86_64
xorg-x11-drv-nvidia-power-545.29.06-2.fc39.x86_64
xorg-x11-drv-nvidia-545.29.06-2.fc39.x86_64
akmod-nvidia-545.29.06-2.fc39.x86_64
kmod-nvidia-6.6.8-200.fc39.x86_64-545.29.06-2.fc39.x86_64

Like wise, I’m very interest in this work. As I understand it we will get incremental improvements that add support for more nvidia hardware and move vulkan versions.

That blog posts says they are up to nvdia turin that is the RTX16 and RTX20 hardware, does not cover the RTX30 yet.

NVK improvements get added the kernel and mesa so you get the work as Fedora packages newer kernels and mesa releases. I don’t think there is anything special to
do but may be wrong, just use the nouveau driver.

This is what I get from --help

❯ akmods --help
Checks the akmod packages and rebuilds them if needed

Usage: akmods [OPTIONS]

Options:
 --force             -- try all, even if they failed earlier
 --kernels <kernel>  -- build and install only for kernel <kernel>
                        (formatted the same as 'uname -r' would produce)
 --akmod <akmod>     -- build and install only akmod <akmod>

A while ago I changed some setting somewhere to disable dynamic kernel modules from being loaded… I don’t remember what though, dont think its kernel.modules_disabled I removed the ro option from the boot options (before the parameters I posted) but it does not seem to change anything

Here is the result of the command you posted to search the rpm’s

❯ rpm -qa | grep -e akmod -e nvidia
akmods-0.5.7-10.fc38.noarch
xorg-x11-drv-nvidia-470xx-libs-470.223.02-1.fc38.x86_64
xorg-x11-drv-nvidia-470xx-kmodsrc-470.223.02-1.fc38.x86_64
nvidia-settings-470xx-470.223.02-1.fc38.x86_64
xorg-x11-drv-nvidia-470xx-470.223.02-1.fc38.x86_64
akmod-nvidia-470xx-470.223.02-1.fc38.x86_64
xorg-x11-drv-nvidia-470xx-cuda-libs-470.223.02-1.fc38.x86_64
xorg-x11-drv-nvidia-470xx-cuda-470.223.02-1.fc38.x86_64
xorg-x11-drv-nvidia-470xx-power-470.223.02-1.fc38.x86_64
nvidia-persistenced-545.29.06-1.fc38.x86_64
nvidia-gpu-firmware-20231211-1.fc38.noarch
kmod-nvidia-470xx-6.6.6-100.fc38.x86_64-470.223.02-1.fc38.x86_64
kmod-nvidia-470xx-6.6.7-100.fc38.x86_64-470.223.02-1.fc38.x86_64
kmod-nvidia-470xx-6.6.8-100.fc38.x86_64-470.223.02-1.fc38.x86_64
xorg-x11-drv-nvidia-kmodsrc-545.29.06-2.fc38.x86_64

Think I may be a little too excited for NVK, or not paying enough attention when I went through the article, I thought it supported Turing and later but it says: Turing and later GPUs sometime in February

So I look forward too it, I am curious though if it may be possible to use/it will be compatible with something like HIP with it? AFIK it wont have CUDA support, and I have compute tasks. I am thinking of trying to link it against cupy (in the future)

I don’t think there is anything special to
do but may be wrong, just use the nouveau driver.

Excellent! I keep having problems every few months with the NVIDIA driver

I run nouveau until I encounter something that needs Nvidia, then switch back to nouveau when that something is done.

I might do that, but I want to use the NVIDIA card for things, the nouveau driver also doesent tend to play nice with a second monitor for me (which is very helpful too me right now). :frowning:

The problem seems to be the nvidia driver.

The nvidia 470xx driver was available long before the RTX 3060 GPU was in use, and does not support wayland at all. The 470 driver may not support the 3060 at all.

I would suggest removing all nvidia 470 driver modules and installing the latest driver in its place so the GPU is properly supported and wayland is also supported.

sudo dnf remove \*nvidia\*470xx\* --noautoremove should handle removing all the older drivers.
sudo dnf install akmod-nvidia xorg-x11-drv-nvidia-cuda should then install the current nvidia drivers which are the 545.29.06 version

For some reason, its one of those occasions where the nvidia drivar has booted without changing the boot config and seems to be working, so I am posting the results of some of the commands you wanted me to run

❯ systemctl status nvidia-fallback.service
○ nvidia-fallback.service - Fallback to nouveau as nvidia did not load
     Loaded: loaded (/usr/lib/systemd/system/nvidia-fallback.service; disabled; preset: disabled)
    Drop-In: /usr/lib/systemd/system/service.d
             └─10-timeout-abort.conf
     Active: inactive (dead)
  Condition: start condition failed at Mon 2024-01-01 11:59:07 EST; 13min ago
             └─ ConditionPathExists=!/sys/module/nvidia was not met

Jan 01 11:59:07 fedora systemd[1]: nvidia-fallback.service - Fallback to nouveau as nvidia did not load was skipped because of an unmet condition check (ConditionPathExists=!/sys/module/nvidia).
❯ rpm -qa | grep -e akmod -e nvidia
akmods-0.5.7-10.fc38.noarch
xorg-x11-drv-nvidia-470xx-libs-470.223.02-1.fc38.x86_64
xorg-x11-drv-nvidia-470xx-kmodsrc-470.223.02-1.fc38.x86_64
nvidia-settings-470xx-470.223.02-1.fc38.x86_64
xorg-x11-drv-nvidia-470xx-470.223.02-1.fc38.x86_64
akmod-nvidia-470xx-470.223.02-1.fc38.x86_64
xorg-x11-drv-nvidia-470xx-cuda-libs-470.223.02-1.fc38.x86_64
xorg-x11-drv-nvidia-470xx-cuda-470.223.02-1.fc38.x86_64
xorg-x11-drv-nvidia-470xx-power-470.223.02-1.fc38.x86_64
nvidia-persistenced-545.29.06-1.fc38.x86_64
nvidia-gpu-firmware-20231211-1.fc38.noarch
kmod-nvidia-470xx-6.6.6-100.fc38.x86_64-470.223.02-1.fc38.x86_64
kmod-nvidia-470xx-6.6.7-100.fc38.x86_64-470.223.02-1.fc38.x86_64
kmod-nvidia-470xx-6.6.8-100.fc38.x86_64-470.223.02-1.fc38.x86_64
xorg-x11-drv-nvidia-kmodsrc-545.29.06-2.fc38.x86_64
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.223.02   Driver Version: 470.223.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| N/A   39C    P8    10W /  N/A |     18MiB /  5946MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

inxi -FGxx (some info removed)

System:
  Host: Shows something to do with my router and laptop name ? o.o different than last time Kernel: 6.6.8-100.fc38.x86_64
    arch: x86_64 bits: 64 compiler: gcc v: 2.39-16.fc38 Desktop: GNOME v: 44.7
    tk: GTK v: 3.24.38 wm: gnome-shell dm: GDM Distro: Fedora release 38
    (Thirty Eight)
CPU:
  Info: 8-core model: AMD Ryzen 7 5800H with Radeon Graphics bits: 64
    type: MT MCP arch: Zen 3 rev: 0 cache: L1: 512 KiB L2: 4 MiB L3: 16 MiB
  Speed (MHz): avg: 703 high: 3207 min/max: 400/4463 cores: 1: 400 2: 2444
    3: 400 4: 400 5: 3207 6: 400 7: 400 8: 400 9: 400 10: 400 11: 400 12: 400
    13: 400 14: 400 15: 400 16: 400 bogomips: 102209
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
Graphics:
  Device-1: NVIDIA GA106M [GeForce RTX 3060 Mobile / Max-Q] vendor: Lenovo
    driver: nvidia v: 470.223.02 arch: Ampere pcie: speed: 2.5 GT/s lanes: 8
    ports: active: none
  Device-2: AMD Cezanne [Radeon Vega Series / Radeon Mobile Series]
    vendor: Lenovo driver: amdgpu v: kernel arch: GCN-5 pcie: speed: 8 GT/s
    lanes: 16 ports: active: eDP-1 
  Display: wayland server: X.org v: 1.20.14 with: Xwayland v: 22.1.9
    compositors: 1: gnome-shell 2: xfwm driver: X: loaded: amdgpu,nvidia
    unloaded: fbdev,modesetting,nouveau,vesa alternate: nv dri: radeonsi
    gpu: amdgpu,nvidia,nvidia-nvswitch display-ID: 0
  Monitor-1: DP-4 model: HP LA2205 res: WxH dpi: 90 diag: 558mm (22")
  Monitor-2: eDP-1 model: AU Optronics res: WxH dpi: 142
    diag: 394mm (15.5")
  API: OpenGL v: 4.6 vendor: amd mesa v: 23.1.9 glx-v: 1.4 es-v: 3.2
    direct-render: yes renderer: AMD Radeon Graphics (renoir LLVM 16.0.6 DRM
    3.54 6.6.8-100.fc38.x86_64) 
API: EGL Message: EGL data requires eglinfo. Check --recommends.

I think this booted occurred after an updated but I searched to some logs and couldn’t find what updated. Just trying to find a pattern

I did stop updating a while ago because updated tended to cause breaks more often (last version I tried, I couldent get to work, idk if I couldent get it to work with Secure Boot or I couldent get it to work in general), and downgraded to 470xx because it seemed to favored/more stable for some reason (where I had to install using the NVIDIA website before, I could dnf install it IIRC).

I may give this a go, but before I begin, I just wanted to ask, if there is anything that I need to install support wayland given?

sudo dnf install akmods-nvidia xorg-x11-drv-nvidia-cuda

No, nothing else should be needed. I did note a typo in the command I provided above and corrected it.

It is always critical when installing new drivers and/or kernels that you wait several minutes after the update completes before rebooting. The drivers are built and installed in the background so a quick reboot may interrupt the build and cause the driver to fail.

Thanks! Will do

Patience is needed when updating from the command-line, but I’ve been wondering if future updates using the Gnome Software Manager GUI may reboot too quickly.

I suspect that it will and does when installing or updating packages that require compiling of a driver/kernel module since it seems to not be aware of the background operations and does not wait for the return.

At the command line the user can see when the command completes or at least delay the reboot for a reasonable time. With the software gui it seems set up that certain things trigger a reboot without delaying.

sudo dnf remove \*nvidia\*470xx\* --noautoremove should handle removing all the older drivers.
sudo dnf install akmod-nvidia xorg-x11-drv-nvidia-cuda should then install the current nvidia drivers which are the 545.29.06 version

Seems to have worked for now, thank you

I suspect that it will and does when installing or updating packages that require compiling of a driver/kernel module since it seems to not be aware of the background operations and does not wait for the return.

At the command line the user can see when the command completes or at least delay the reboot for a reasonable time. With the software gui it seems set up that certain things trigger a reboot without delaying.

Interesting, I have tried to update through the GUI before, I have used dnf update to update software before, but will it update the kernel/fedora version?

Patience is needed when updating from the command-line, but I’ve been wondering if future updates using the Gnome Software Manager GUI may reboot too quickly.

I wonder if this is why I have had trouble updating the NVIDIA driver in the past

With dnf you can update everything, at your leisure. This includes the ability to do a full version upgrade. In fact, if you are currently using F38 and wish to upgrade to F39 the instructions are here.

possibly