Should I reinstall my system?

umbion · January 18, 2025, 2:32pm

I just installed fedora on my SSD, but it constantly crushes. I really don’t know how to debug, so my gut instinct is to reinstall the system. Any suggestion would be appreciated!
Information that might be related:
The system log said something about kernel and tainted. The report is here.
When the system crushes, it first showed

Then, when I try to shutdown it, it went

I got these logs but I forgot where I got them (maybe journalctl)
[ 308.588450] nouveau 0000:01:00.0: chrome[8123]: VMM allocation failed: -110 [ 329.696370] nouveau 0000:01:00.0: DRM: failed to idle channel 8 [DRM]
WARNING: CPU: 8 PID: 591 at drivers/pci/pci.c:2269 pci_disable_device+0x88/0x90

gnwiii · January 18, 2025, 2:54pm

With Linux, reinstalling often just loops back to the same problem without providing any details that would help resolve the issue so please collect some data. Since you have errors for sda3, it is possible the drive
is failing. You should use the Live Installer to copy important data to another location.

Assuming you are using Fedora Workstation, the Live Installer has Gnome Disks that you can use to check the health of your SSD. You can also used the terminal in the Live installer to install inxi and run inxi -Fzxx to get a (sanitized) hardware summary that will help with troubleshooting. Post the output as pre-formatted text using </> button from the top line of the text entry panel.

That means you have installed drivers that are not from kernel.org. The inxi report will tell us more about that.

py0xc3 · January 18, 2025, 2:58pm

Not necessarily, although this is the reason in most cases. cat /proc/sys/kernel/tainted can output the value of the taint, and the value tells more about it. There are other reasons for a tainted kernel too: something could be broken. Here something to interpret the number that is output by the above cat command: Tainted kernels — The Linux Kernel documentation . If it is any value other than 0, the journalctl -k --no-hostname --boot=0 will contain details about it.

You might also use smartctl (much documentation to be found about this command) to check out the affected drive(s). Maybe any drive(s) logged errors that indicate a hardware failure or so.

As George indicated: get more data and check out the cause before just reinstalling. If the drive is failing, a reinstallation might even work for some time, but then you might experience a severe data loss at any time. In any case, more data is necessary

umbion · January 18, 2025, 5:08pm

The Gnome Disks checking said sda3 is undamaged.
inxi result is here:

System:
  Kernel: 6.11.4-301.fc41.x86_64 arch: x86_64 bits: 64 compiler: gcc
    v: 2.43.1-2.fc41
  Desktop: GNOME v: 47.0 tk: GTK v: 3.24.43 wm: gnome-shell dm: GDM
    Distro: Fedora Linux 41 (Workstation Edition)
Machine:
  Type: Laptop System: LENOVO product: 82YA v: Legion Y7000P IRH8
    serial: <superuser required> Chassis: type: 10 v: Legion Y7000P IRH8
    serial: <superuser required>
  Mobo: LENOVO model: LNVNB161216 v: SDK0T76479 WIN
    serial: <superuser required> part-nu: LENOVO_MT_82YA_BU_idea_FM_Legion
    Y7000P IRH8 UEFI: LENOVO v: M2CN28WW date: 08/24/2023
Battery:
  ID-1: BAT1 charge: 52.7 Wh (70.3%) condition: 75.0/80.0 Wh (93.8%)
    volts: 16.0 min: 15.4 model: Sunwoda L22D4PC2 serial: <filter>
    status: discharging
CPU:
  Info: 10-core (6-mt/4-st) model: 13th Gen Intel Core i7-13620H bits: 64
    type: MST AMCP arch: Raptor Lake rev: 2 cache: L1: 864 KiB L2: 9.5 MiB
    L3: 24 MiB
  Speed (MHz): avg: 400 min/max: 400/4700:4900:3600 cores: 1: 400 2: 400
    3: 400 4: 400 5: 400 6: 400 7: 400 8: 400 9: 400 10: 400 11: 400 12: 400
    13: 400 14: 400 15: 400 16: 400 bogomips: 93388
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
Graphics:
  Device-1: Intel Raptor Lake-P [UHD Graphics] vendor: Lenovo driver: i915
    v: kernel arch: Gen-13 ports: active: DP-1,eDP-1 empty: DP-2,DP-3,DP-4
    bus-ID: 00:02.0 chip-ID: 8086:a7a8
  Device-2: NVIDIA AD107M [GeForce RTX 4060 Max-Q / Mobile] vendor: Lenovo
    driver: nouveau v: kernel arch: Lovelace pcie: speed: 16 GT/s lanes: 8
    ports: active: none empty: DP-5,HDMI-A-1,eDP-2 bus-ID: 01:00.0
    chip-ID: 10de:28e0
  Device-3: Chicony Integrated Camera driver: uvcvideo type: USB rev: 2.0
    speed: 480 Mb/s lanes: 1 bus-ID: 3-7:5 chip-ID: 04f2:b7b8
  Display: wayland server: Xwayland v: 24.1.3 compositor: gnome-shell
    driver: gpu: i915 display-ID: 0
  Monitor-1: DP-1 model: AOC 24G2W1G3- res: 1920x1080 dpi: 93
    diag: 604mm (23.8")
  Monitor-2: eDP-1 model-id: CSO 0x161d res: 2560x1600 dpi: 188
    diag: 407mm (16")
  API: OpenGL v: 4.6 vendor: intel mesa v: 24.2.4 glx-v: 1.4 es-v: 3.2
    direct-render: yes renderer: Mesa Intel Graphics (RPL-P)
    device-ID: 8086:a7a8 display-ID: :0.0
  API: EGL Message: EGL data requires eglinfo. Check --recommends.
Audio:
  Device-1: Intel Raptor Lake-P/U/H cAVS vendor: Lenovo driver: snd_hda_intel
    v: kernel bus-ID: 00:1f.3 chip-ID: 8086:51ca
  Device-2: NVIDIA AD107 High Definition Audio vendor: Lenovo
    driver: snd_hda_intel v: kernel pcie: speed: 16 GT/s lanes: 8
    bus-ID: 01:00.1 chip-ID: 10de:22be
  Device-3: EDIFIER M16+ driver: hid-generic,snd-usb-audio,usbhid type: USB
    rev: 2.0 speed: 12 Mb/s lanes: 1 bus-ID: 3-6.4:10 chip-ID: 2d99:a037
  API: ALSA v: k6.11.4-301.fc41.x86_64 status: kernel-api
  Server-1: JACK v: 1.9.22 status: off
  Server-2: PipeWire v: 1.2.5 status: active with: 1: pipewire-pulse
    status: active 2: wireplumber status: active 3: pipewire-alsa type: plugin
Network:
  Device-1: Intel Raptor Lake PCH CNVi WiFi driver: iwlwifi v: kernel
    bus-ID: 00:14.3 chip-ID: 8086:51f1
  IF: wlp0s20f3 state: up mac: <filter>
  Device-2: Realtek RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet
    vendor: Lenovo driver: r8169 v: kernel pcie: speed: 2.5 GT/s lanes: 1
    port: 3000 bus-ID: 07:00.0 chip-ID: 10ec:8168
  IF: enp7s0 state: down mac: <filter>
  Device-3: Realtek RTL8153 Gigabit Ethernet Adapter driver: r8152 type: USB
    rev: 3.0 speed: 5 Gb/s lanes: 1 bus-ID: 2-1.1:3 chip-ID: 0bda:8153
  IF: enp0s13f0u1u1 state: down mac: <filter>
Bluetooth:
  Device-1: Intel AX211 Bluetooth driver: btusb v: 0.8 type: USB rev: 2.0
    speed: 12 Mb/s lanes: 1 bus-ID: 3-10:8 chip-ID: 8087:0033
  Report: btmgmt ID: hci0 rfk-id: 2 state: up address: <filter> bt-v: 5.3
    lmp-v: 12
Drives:
  Local Storage: total: 2.82 TiB used: 25.12 GiB (0.9%)
  ID-1: /dev/nvme0n1 vendor: Lenovo model: UMIS RPEYJ1T24MKN2QWY
    size: 953.87 GiB speed: 63.2 Gb/s lanes: 4 serial: <filter> temp: 41.9 C
  ID-2: /dev/sda model: USB Flash size: 30 GiB type: USB rev: 2.0
    spd: 480 Mb/s lanes: 1 serial: <filter>
  ID-3: /dev/sdb vendor: J.ZAO model: QL SERIES 2TB SSD size: 1.86 TiB
    type: USB rev: 2.1 spd: 480 Mb/s lanes: 1 serial: <filter>
Partition:
  ID-1: / size: 3.07 GiB used: 267.3 MiB (8.5%) fs: overlay source: ERR-102
Swap:
  ID-1: swap-1 type: zram size: 8 GiB used: 0 KiB (0.0%) priority: 100
    dev: /dev/zram0
Sensors:
  System Temperatures: cpu: 44.8 C mobo: N/A
  Fan Speeds (rpm): N/A
Info:
  Memory: total: 16 GiB note: est. available: 15.34 GiB used: 2.2 GiB (14.3%)
  Processes: 430 Power: uptime: 2m wakeups: 0 Init: systemd v: 256
    target: graphical (5) default: graphical
  Packages: pm: rpm pkgs: N/A note: see --rpm Compilers: N/A Shell: Bash
    v: 5.2.32 running-in: ptyxis-agent inxi: 3.3.36

Thank you so much for your help.

umbion · January 18, 2025, 5:34pm

cat /proc/sys/kernel/tainted result is 512
for smartctl, i got health status ok. Here is the result

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.9-200.fc41.x86_64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               J.ZAO QL
Product:              SERIES 2TB SSD
Revision:             1.00
Compliance:           SPC-4
User Capacity:        2,048,408,248,320 bytes [2.04 TB]
Logical block size:   512 bytes
LU is fully provisioned
Rotation Rate:        Solid State Device
Logical Unit id:      0x3001237923792379
Serial number:        0000000000000000
Device type:          disk
Local Time is:        Sun Jan 19 00:23:07 2025 +08
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature:     0 C
Drive Trip Temperature:        0 C

Error Counter logging not supported

Device does not support Self Test logging

I got perplexed by journalctl results and am still trying to figure out the details I need. I find this ACPI error but don’t know whether it’s related.

Jan 19 06:22:11 kernel: ACPI BIOS Error (bug): Failure creating named object [\_SB.PC00.TXHC.RHUB.SS03._UPC], AE_ALREADY_EXISTS (20240827/dswload2-326)
Jan 19 06:22:11 kernel: ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20240827/psobject-220)

Thanks for your help!

barryascott · January 18, 2025, 5:39pm

Can you try smartctl -x /dev/sda to see the counters for the drive.
The percent of available spares should be 100% if the drive is in good health.

py0xc3 · January 18, 2025, 7:25pm

If the output is 512, there is a kernel warning somewhere in the journal. It mostly contains two lines:
kernel: ------------[ cut here ]------------
and something like
kernel: ---[ end trace 0000000000000000 ]---
(of course both lines begin with the very time index)

The warning is what is in between these two lines (so, there are several more lines in between the two). This warning can be related to your issue, but does not have to be.

You can put the warning here (please put it into a code box), maybe it contains indication.

However, as far as it concerns the possibility of hardware failure, the suggestion of Barry might be more indicative.

umbion · January 19, 2025, 3:35am

Though my smartctl -a test said SMART support is: Enabled, when I tried smartctl with either -a or -x parameter now, it says SMART support is: Disabled. I can’t open it with --smart=on or -s on. If I use -d it will show NVMe device succesfully opened, but the smart support is still disabled.

These are the warnings I found between cut here and end trace (I can’t post the full content between these two lines, since it exceeds word limits.

Jan 18 22:26:41 kernel: WARNING: CPU: 2 PID: 5426 at drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:114 r535_gsp_msgq_wait+0x1af/0x1d0 [nouveau]

Jan 18 22:26:57 kernel: WARNING: CPU: 4 PID: 5427 at drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:114 r535_gsp_msgq_wait+0x1af/0x1d0 [nouveau]

Jan 18 22:27:13 kernel: WARNING: CPU: 4 PID: 5426 at drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:114 r535_gsp_msgq_wait+0x1af/0x1d0 [nouveau]

Jan 18 22:27:28 kernel: WARNING: CPU: 1 PID: 5427 at drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:114 r535_gsp_msgq_wait+0x1af/0x1d0 [nouveau]

Jan 18 22:27:50 kernel: WARNING: CPU: 1 PID: 127 at drivers/pci/pci.c:2269 pci_disable_device+0x88/0x90

Jan 19 00:12:35 kernel: WARNING: CPU: 4 PID: 9810 at drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmtu102.c:45 tu102_vmm_flush+0x167/0x170 [nouveau]

Jan 19 00:12:54 kernel: WARNING: CPU: 7 PID: 9810 at drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:114 r535_gsp_msgq_wait+0x1af/0x1d0 [nouveau]

Jan 19 00:12:54 kernel: WARNING: CPU: 7 PID: 9810 at drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmtu102.c:45 tu102_vmm_flush+0x167/0x170 [nouveau]

Jan 19 00:12:55 kernel: WARNING: CPU: 4 PID: 9810 at drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmtu102.c:45 tu102_vmm_flush+0x167/0x170 [nouveau]

Jan 19 00:13:14 kernel: WARNING: CPU: 3 PID: 9810 at drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:114 r535_gsp_msgq_wait+0x1af/0x1d0 [nouveau]

Jan 19 00:13:14 kernel: WARNING: CPU: 2 PID: 9810 at drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmtu102.c:45 tu102_vmm_flush+0x167/0x170 [nouveau]

Does that mean something is wrong with the nouveau driver? Should I try to reinstall it?
Please let me know if you need additional information, and thanks again for your help!

barryascott · January 19, 2025, 8:05am

My guess is either the device is failing or of poor quality because of the lack of SMART.

Something is wring with the nvidia would.
The driver is part of the linux kernel and does not need reinstalling.
Make sure you have the nvidia firmware installed.

Given you have disk issues you might have damage to key system files.

umbion · January 19, 2025, 8:54am

I installed NVIDIA following this guide.

Sorry I’m slow, but if I have damage to key system files, does that mean I need to reinstall it? or repair the system partition? or it’s the issue of SSD? Thanks for your help!

barryascott · January 19, 2025, 9:36am

nouveau is only used if the rpmfusion nvidia drivers are not loaded.
Something is wrong with you rpmfusion driver install.

I do not know, there is not enough information.

But if the SSD is failing then a reinstall will not help in the long run.
If you have to replace the SSD then consider getting a better quality of SSD.

I guess you should try a reinstall at this point and see if that helps.

py0xc3 · January 19, 2025, 11:04am

Well, I used to prefer nouveau even when I had nvidia, as it was more stable and considered in our QA. I would not say this is a mistake, but …

… yes, there is an issue with nouveau. If I read the time index, I assume there are many errors that all contain such a warning? I am wondering as each kernel warning usually has many lines (all logged at the same second +/- 1). In your post, I see many 1 line warnings spread over different periods. That’s why I ask. I am not sure if I can help with a nouveau issue, but you might check your preferred search engine and search for the very line. Maybe that adds some information, and more experienced people with nvidia hardware might make more sense out of a full warning with all its entries. You might also provide a full log of a boot through an external link or in a code box.

However, without further indication, I would guess that the likelihood that the nouveau issue is the cause to the sda3 failure is below 50%, although both might be just a symptom of a different cause. In any case, I would consider a disk failure that affects system files including of nouveau (as Barry already suggested) more likely that something vice versa.

I agree to that, but retain and provide a full log of a journalctl -k --no-hostname --boot=0 and a journalctl --no-hostname --boot=0 (the latter without -k). We do not yet need the latter, but you might already provide the first, just to have a skim. If this does not add information, we see anyway how the new installation develops… But keep in mind:

→ Your smart output is a little odd, so I would emphasize on good backups, even if the system seems to work fine after the next installation

barryascott · January 19, 2025, 12:14pm

FYI the OP said they installed the rpmfusion drivers. Hence my comment.

gnwiii · January 19, 2025, 2:38pm

Given that the SSD isn’t fully supporting S.M.A.R.T, it is a low quality drive and/or has broken. You could check with the vendor to see if there is a firmware update. Since it appears to be a very recent model, it might be replaced under warranty. At work we bought hard disks by the case. It not unusual to encounter a bad drive and most vendors not named IBM were quick to replace them.

Topic		Replies	Views
Tainted Kernel Ask Fedora nvidia	15	808	April 15, 2024
OOPS! Tainted Kernel... but why? Ask Fedora f34 , kernel	15	1809	June 7, 2021
Everything crashing on new system using new hardware Ask Fedora amd , gnome , intel , workstation , f41	30	620	February 14, 2025
Complete noobie to Fedora on a Dell Workstation with nvidia issues Ask Fedora f33 , nvidia	8	543	January 11, 2021
System is randomly crashing while playing games on a fresh install Ask Fedora	12	367	April 20, 2025

Should I reinstall my system?

Related topics