I just installed fedora on my SSD, but it constantly crushes. I really don’t know how to debug, so my gut instinct is to reinstall the system. Any suggestion would be appreciated!
Information that might be related:
The system log said something about kernel and tainted. The report is here.
When the system crushes, it first showed
I got these logs but I forgot where I got them (maybe journalctl)
[ 308.588450] nouveau 0000:01:00.0: chrome[8123]: VMM allocation failed: -110 [ 329.696370] nouveau 0000:01:00.0: DRM: failed to idle channel 8 [DRM]
WARNING: CPU: 8 PID: 591 at drivers/pci/pci.c:2269 pci_disable_device+0x88/0x90
With Linux, reinstalling often just loops back to the same problem without providing any details that would help resolve the issue so please collect some data. Since you have errors for sda3, it is possible the drive
is failing. You should use the Live Installer to copy important data to another location.
Assuming you are using Fedora Workstation, the Live Installer has Gnome Disks that you can use to check the health of your SSD. You can also used the terminal in the Live installer to install inxi and run inxi -Fzxx to get a (sanitized) hardware summary that will help with troubleshooting. Post the output as pre-formatted text using </> button from the top line of the text entry panel.
That means you have installed drivers that are not from kernel.org. The inxi report will tell us more about that.
Not necessarily, although this is the reason in most cases. cat /proc/sys/kernel/tainted can output the value of the taint, and the value tells more about it. There are other reasons for a tainted kernel too: something could be broken. Here something to interpret the number that is output by the above cat command: Tainted kernels — The Linux Kernel documentation . If it is any value other than 0, the journalctl -k --no-hostname --boot=0 will contain details about it.
You might also use smartctl (much documentation to be found about this command) to check out the affected drive(s). Maybe any drive(s) logged errors that indicate a hardware failure or so.
As George indicated: get more data and check out the cause before just reinstalling. If the drive is failing, a reinstallation might even work for some time, but then you might experience a severe data loss at any time. In any case, more data is necessary
cat /proc/sys/kernel/tainted result is 512
for smartctl, i got health status ok. Here is the result
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.9-200.fc41.x86_64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: J.ZAO QL
Product: SERIES 2TB SSD
Revision: 1.00
Compliance: SPC-4
User Capacity: 2,048,408,248,320 bytes [2.04 TB]
Logical block size: 512 bytes
LU is fully provisioned
Rotation Rate: Solid State Device
Logical Unit id: 0x3001237923792379
Serial number: 0000000000000000
Device type: disk
Local Time is: Sun Jan 19 00:23:07 2025 +08
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Disabled or Not Supported
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature: 0 C
Drive Trip Temperature: 0 C
Error Counter logging not supported
Device does not support Self Test logging
I got perplexed by journalctl results and am still trying to figure out the details I need. I find this ACPI error but don’t know whether it’s related.
Jan 19 06:22:11 kernel: ACPI BIOS Error (bug): Failure creating named object [\_SB.PC00.TXHC.RHUB.SS03._UPC], AE_ALREADY_EXISTS (20240827/dswload2-326)
Jan 19 06:22:11 kernel: ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20240827/psobject-220)
If the output is 512, there is a kernel warning somewhere in the journal. It mostly contains two lines: kernel: ------------[ cut here ]------------
and something like kernel: ---[ end trace 0000000000000000 ]---
(of course both lines begin with the very time index)
The warning is what is in between these two lines (so, there are several more lines in between the two). This warning can be related to your issue, but does not have to be.
You can put the warning here (please put it into a code box), maybe it contains indication.
However, as far as it concerns the possibility of hardware failure, the suggestion of Barry might be more indicative.
Though my smartctl -a test said SMART support is: Enabled, when I tried smartctl with either -a or -x parameter now, it says SMART support is: Disabled. I can’t open it with --smart=on or -s on. If I use -d it will show NVMe device succesfully opened, but the smart support is still disabled.
These are the warnings I found between cut here and end trace (I can’t post the full content between these two lines, since it exceeds word limits.
Jan 18 22:26:41 kernel: WARNING: CPU: 2 PID: 5426 at drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:114 r535_gsp_msgq_wait+0x1af/0x1d0 [nouveau]
Jan 18 22:26:57 kernel: WARNING: CPU: 4 PID: 5427 at drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:114 r535_gsp_msgq_wait+0x1af/0x1d0 [nouveau]
Jan 18 22:27:13 kernel: WARNING: CPU: 4 PID: 5426 at drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:114 r535_gsp_msgq_wait+0x1af/0x1d0 [nouveau]
Jan 18 22:27:28 kernel: WARNING: CPU: 1 PID: 5427 at drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:114 r535_gsp_msgq_wait+0x1af/0x1d0 [nouveau]
Jan 18 22:27:50 kernel: WARNING: CPU: 1 PID: 127 at drivers/pci/pci.c:2269 pci_disable_device+0x88/0x90
Jan 19 00:12:35 kernel: WARNING: CPU: 4 PID: 9810 at drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmtu102.c:45 tu102_vmm_flush+0x167/0x170 [nouveau]
Jan 19 00:12:54 kernel: WARNING: CPU: 7 PID: 9810 at drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:114 r535_gsp_msgq_wait+0x1af/0x1d0 [nouveau]
Jan 19 00:12:54 kernel: WARNING: CPU: 7 PID: 9810 at drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmtu102.c:45 tu102_vmm_flush+0x167/0x170 [nouveau]
Jan 19 00:12:55 kernel: WARNING: CPU: 4 PID: 9810 at drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmtu102.c:45 tu102_vmm_flush+0x167/0x170 [nouveau]
Jan 19 00:13:14 kernel: WARNING: CPU: 3 PID: 9810 at drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:114 r535_gsp_msgq_wait+0x1af/0x1d0 [nouveau]
Jan 19 00:13:14 kernel: WARNING: CPU: 2 PID: 9810 at drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmtu102.c:45 tu102_vmm_flush+0x167/0x170 [nouveau]
Does that mean something is wrong with the nouveau driver? Should I try to reinstall it?
Please let me know if you need additional information, and thanks again for your help!
My guess is either the device is failing or of poor quality because of the lack of SMART.
Something is wring with the nvidia would.
The driver is part of the linux kernel and does not need reinstalling.
Make sure you have the nvidia firmware installed.
Given you have disk issues you might have damage to key system files.
Sorry I’m slow, but if I have damage to key system files, does that mean I need to reinstall it? or repair the system partition? or it’s the issue of SSD? Thanks for your help!
nouveau is only used if the rpmfusion nvidia drivers are not loaded.
Something is wrong with you rpmfusion driver install.
I do not know, there is not enough information.
But if the SSD is failing then a reinstall will not help in the long run.
If you have to replace the SSD then consider getting a better quality of SSD.
I guess you should try a reinstall at this point and see if that helps.
Well, I used to prefer nouveau even when I had nvidia, as it was more stable and considered in our QA. I would not say this is a mistake, but …
… yes, there is an issue with nouveau. If I read the time index, I assume there are many errors that all contain such a warning? I am wondering as each kernel warning usually has many lines (all logged at the same second +/- 1). In your post, I see many 1 line warnings spread over different periods. That’s why I ask. I am not sure if I can help with a nouveau issue, but you might check your preferred search engine and search for the very line. Maybe that adds some information, and more experienced people with nvidia hardware might make more sense out of a full warning with all its entries. You might also provide a full log of a boot through an external link or in a code box.
However, without further indication, I would guess that the likelihood that the nouveau issue is the cause to the sda3 failure is below 50%, although both might be just a symptom of a different cause. In any case, I would consider a disk failure that affects system files including of nouveau (as Barry already suggested) more likely that something vice versa.
I agree to that, but retain and provide a full log of a journalctl -k --no-hostname --boot=0 and a journalctl --no-hostname --boot=0 (the latter without -k). We do not yet need the latter, but you might already provide the first, just to have a skim. If this does not add information, we see anyway how the new installation develops… But keep in mind:
→ Your smart output is a little odd, so I would emphasize on good backups, even if the system seems to work fine after the next installation
Given that the SSD isn’t fully supporting S.M.A.R.T, it is a low quality drive and/or has broken. You could check with the vendor to see if there is a firmware update. Since it appears to be a very recent model, it might be replaced under warranty. At work we bought hard disks by the case. It not unusual to encounter a bad drive and most vendors not named IBM were quick to replace them.