In my previous threads I couldn’t boot into the latest kernels because of the nVidia driver. Last time, I fixed it by installing the RPMFusion nVidia driver so I don’t have to recompile the driver with each kernel update. This worked for the kernel update that happened next, but not for the one that happened after that one. Now it freezes up during boot, again.
This problem started about a week and a half ago; I was on a vacation for most of that time so I’m just now getting around to trying to fix this.
Now, it’s harder to find the problem because I can’t find anything specific in the error log that’s causing the problem. So last night I compared error logs from a successful boot on the old kernel and an unsuccessful boot on the new kernel, side by side. The first big difference between the two, unsurprisingly, concerned nVidia. The following lines appear in the successful boot, but not in the unsuccessful one:
Feb 02 21:40:57 localhost.localdomain kernel: nvidia: loading out-of-tree module taints kernel.
Feb 02 21:40:57 localhost.localdomain kernel: nvidia: module license ‘NVIDIA’ taints kernel.
Feb 02 21:40:57 localhost.localdomain kernel: Disabling lock debugging due to kernel taint
Feb 02 21:40:57 localhost.localdomain kernel: nvidia: module verification failed: signature and/or required key missing - tainti>
Feb 02 21:40:57 localhost.localdomain kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 511
Feb 02 21:40:57 localhost.localdomain kernel:
Feb 02 21:40:57 localhost.localdomain kernel: nvidia 0000:02:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:o>
Feb 02 21:40:57 localhost.localdomain kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 495.46 Wed Oct 27 16:31:33 UTC 20>
Feb 02 21:40:57 localhost.localdomain kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 495.>
Feb 02 21:40:57 localhost.localdomain kernel: [drm] [nvidia-drm] [GPU ID 0x00000200] Loading driver
Feb 02 21:40:57 localhost.localdomain kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more th>
Feb 02 21:40:57 localhost.localdomain kernel: caller _nv000717rm+0x1ad/0x200 [nvidia] mapping multiple BARs
Feb 02 21:40:57 localhost.localdomain systemd-udevd: nvidia: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidi>
Feb 02 21:40:58 localhost.localdomain systemd-udevd: nvidia: Process '/usr/bin/bash -c 'for i in $(cat /proc/driver/nvidia/>
Feb 02 21:40:58 localhost.localdomain kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:02:00.0 on minor 0
So, when it tries to boot on the new kernel, it’s just not starting the nVidia processes. Does anyone know why this might be? I’m totally stumped.
Very simply really.
If you use dmesg right after booting (with the failed result) like this it may give you the answer. dmesg | grep -iE "secure|nvidia"
Please post that output so we can verify it.
What I anticipate is that it will say something like “secure boot enabled”.
If that is the case, then you need to boot into bios and disable secure boot, then boot and the nvidia driver should load properly.
The explanation is that when secure boot is enabled, the kernel refuses to load an unsigned (therefore insecure) module. It is refusing to load the nvidia modules for that reason.
In your initial post the above is what you said. The discussion says you sometimes get the driver to load and sometimes not. We don’t even know what kernel works and what doesn’t, nor how you can sometimes get nvidia to load or not.
Please provide enough detail for us to even be able to evaluate the situation. Some of the things that can help are
uname -a output
inxi -Fzxx output
Update status. Hopefully fully updated just before you do steps 1 & 2.
discussion about what you do to achieve success (or not) in loading nvidia drivers
journalctl -b -0 | grep -iE "secure|nvidia|fail" output for when nvidia actually loads, and again for when it fails to load. You can change the number in the “-b -0” to give us the different perspectives since that number tells journalctl to give the output of the current boot (-0) or the last boot (-1) or further back (-2 … etc.)
sudo dnf list installed \*nvidia\*
Some of that (possibly 4) may need to be put on pastebin due to the size, and you should then paste the links here.
So far I have only been able to speak in generic terms about getting the nvidia drivers to load since you have been unwilling to share details as to exactly what is happening.
Again, I’m sorry for causing any confusion. I didn’t know what information you needed.
The kernel version that was released near the end of January is the last one that works with my machine. I think–I’m at work right now so I can’t get the number–that it was 5.16.7. Any time I load it with this kernel, it works fine and all the drivers load.
The kernel update that came out a week later, something like the 2nd of February, I think it was 5.16.8, doesn’t work. The drivers never load.
The last time I had a problem, it was due to having installed the proprietary nVidia drivers which did not recompile on a kernel update. Replacing them with RPMfusion solved that problem, and the next kernel update, from .6 to .7, went smoothly.
I’ll get the other stuff after I get home later today.
The fact that the kernel update caused the nvidia driver to not work, even with having installed the rpmfusion drivers leads me to question exactly what packages are installed. To identify the packages and what may be the actual problem please post the output of dnf list installed \*nvidia\* as previously requested. The rest of the output requested will assist as well.
So I just booted it up into command line and did a fresh update which installed the latest kernel. After rebooting it again into the new kernel, I saw something in the bootup lines scrolling up the screen that wasn’t there before:
[FAILED] Failed to start nvidia-powerd service.
See ‘systemctl status nvidia-powerd.service’ for details.
Then it booted straight into command line mode, interesting because I didn’t tell it to do that this time. So first I did “dnf list installed *nvidia*” and then I entered the command it told me to…
As for the other requested info…
Linux localhost.localdomain 5.16.8-200.fc35.x86_64 #1 SMP PREEMPT Tue Feb 8 20:58:59 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
inxi -Fzxx (Returned a whole page of output but the relevant line says)
Device-1: NVIDIA GP104 [GeForce GTX 1060 6GB] vendor: Gigabyte driver: nvidia v: 510.47.03
bus-ID: 02:00.0 chip-ID: 10de:1b83 class-ID: 0300
That’s all I have right now, I’m still trying to figure out how to pipe all the mass of dmesg output into a file I can read.
OK this is weird. I just booted it into the latest kernel, making sure not to boot into command line, and it…worked. EVEN THOUGH it still said during boot “Failed to start nvidia-powerd service.” Running systemctl status nvidia-powerd.service in a Terminal window produced the exact same results as in the photo above when I ran it in command line. So something’s still not quite right.
So I just did a dmesg (in KDE, with the nVidia driver loaded) and within the output I found:
[ 13.037269] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.1/0000:02:00.1/sound/card1/input11
[ 13.037377] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.1/0000:02:00.1/sound/card1/input12
[ 13.037482] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.1/0000:02:00.1/sound/card1/input13
[ 13.037574] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.1/0000:02:00.1/sound/card1/input14
[ 13.037653] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:01.1/0000:02:00.1/sound/card1/input15
[ 13.037729] input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:00/0000:00:01.1/0000:02:00.1/sound/card1/input16
[ 13.037807] input: HDA NVidia HDMI/DP,pcm=12 as /devices/pci0000:00/0000:00:01.1/0000:02:00.1/sound/card1/input17
[ 13.979847] nvidia: loading out-of-tree module taints kernel.
[ 13.979872] nvidia: module license ‘NVIDIA’ taints kernel.
[ 13.979874] Disabling lock debugging due to kernel taint
[ 13.992772] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 14.035597] nvidia-nvlink: Nvlink Core is being initialized, major device number 511
[ 14.037000] nvidia 0000:02:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 14.236535] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 510.47.03 Mon Jan 24 22:58:54 UTC 2022
[ 15.164331] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[ 15.186915] nvidia-uvm: Loaded the UVM driver, major device number 509.
[ 15.283932] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 510.47.03 Mon Jan 24 22:51:43 UTC 2022
[ 15.394997] [drm] [nvidia-drm] [GPU ID 0x00000200] Loading driver
[ 15.632966] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000d0000-0x000dffff window]
[ 15.632971] caller _nv000651rm+0x1ad/0x200 [nvidia] mapping multiple BARs
Just a suggestion but if you had alot of updates at once then the kernel and nvidia may have gotten updated at the same time.You might try and remove nvidia reboot make sure everything is up to date then install the latest nvidia driver and reboot to see if that fixes the issue.I had an issue with the 5.16 kernel when nvidia was updated to the 5.10 driver which after a couple days and updating everything is back to normal now.
To go along with the suggestions of @straycat & @oprizal it seems something may have gotten slightly corrupted. I would suggest that you do a little bit of reinstalling to make sure everything is actually cleanly installed and the nvidia modules are clean and properly built.
First boot to the latest kernel which the pic shows as 5.16.8, then do sudo dnf remove kmod-nvidia-5.16.8 to remove the modules for the kernel you say it won’t load for. Follow that with sudo dnf reinstall akmod-nvidia xorg-x11-drv-nvidia* to reinstall all the nvidia packages and rebuild the modules for the currently running kernel.
Those steps will first make certain that all the nvidia packages are clean and properly installed so the modules will be rebuilt for that kernel version.
@computersavvysudo dnf remove kmod-nvidia-5.16.8 only returns No packages marked for removal. Dependencies resolved. Nothing to do. Complete! I thought “maybe that should say akmod instead of kmod” and that got the same result. Maybe this is part of the problem?
I went over to the Nvidia site and found a new driver version was released only two days ago on the 14th.
I’ll have to finish this later, during the afternoon.
That lsmod output shows the nvidia driver is properly loaded and functioning. and you have 3 kernel versions installed as I expected.
A newer driver from nvidia will very quickly make its way to rpmfusion so that is only a matter of days before it will update automatically.
In the meantime, did you try the failed sudo dnf remove kmod-nvidia-5.16.8 with changing it like this sudo dnf remove kmod-nvidia-5.16.8* and did you try the sudo dnf reinstall akmod-nvidia xorg-x11-drv-nvidia* command (while booted to the 5.16.8 kernel) to see its output?
I just did the remove and reinstall. After rebooting I thought it had failed to boot again, but it turns out it was just taking a very long time, and everything still works fine. I note that, just like the last couple successful boots, I still see “nvidia-powerd failed to start” in the bootup messages. Maybe that’s something that’ll get resolved when the driver updates?