GPU is causing the system to freeze

Hello,

For the past few days, my Fedora 44 has been freezing occasionally and then displaying a black screen. Nothing responds anymore. This forces me to hard reboot the computer.

Here is a part of the output of the command inxi -Fzxx :

System:
  Kernel: 7.0.4-200.fc44.x86_64 arch: x86_64 bits: 64 compiler: gcc v: 16.1.1
  Desktop: GNOME v: 50.1 tk: GTK v: 3.24.52 wm: gnome-shell dm: 1: GDM
    2: LightDM note: stopped Distro: Fedora Linux 44 (Workstation Edition)
Machine:
  Type: Desktop System: ASUS product: N/A v: N/A serial: <superuser required>
  Mobo: ASUSTeK model: TUF GAMING B550-PLUS WIFI II v: Rev X.0x
    serial: <superuser required> part-nu: SKU Firmware: UEFI
    vendor: American Megatrends v: 3405 date: 12/13/2023
CPU:
  Info: 16-core model: AMD Ryzen 9 5950X bits: 64 type: MT MCP arch: Zen 3+
    rev: 2 cache: L1: 1024 KiB L2: 8 MiB L3: 64 MiB
  Speed (MHz): avg: 1746 min/max: 582/5086 boost: enabled cores: 1: 1746
    2: 1746 3: 1746 4: 1746 5: 1746 6: 1746 7: 1746 8: 1746 9: 1746 10: 1746
    11: 1746 12: 1746 13: 1746 14: 1746 15: 1746 16: 1746 17: 1746 18: 1746
    19: 1746 20: 1746 21: 1746 22: 1746 23: 1746 24: 1746 25: 1746 26: 1746
    27: 1746 28: 1746 29: 1746 30: 1746 31: 1746 32: 1746 bogomips: 217182
  Flags-basic: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a
    ssse3 svm
Graphics:
  Device-1: NVIDIA AD104 [GeForce RTX 4070 SUPER] vendor: CardExpert
    driver: nvidia v: 595.71.05 arch: Lovelace pcie: speed: 2.5 GT/s lanes: 16
    ports: active: HDMI-A-1 empty: DP-1,DP-2,DP-3 bus-ID: 07:00.0
    chip-ID: 10de:2783
  Display: wayland server: X.org v: 1.21.1.22 with: Xwayland v: 24.1.11
    compositor: gnome-shell driver: X: loaded: nvidia unloaded: modesetting
    alternate: fbdev,nouveau,nv,vesa gpu: nv_platform,nvidia,nvidia-nvswitch
    display-ID: 0
  Monitor-1: HDMI-A-1 model: MBU27 res: 3840x2160 dpi: 163 diag: 685mm (27")
  API: OpenGL v: 4.6.0 vendor: nvidia v: 595.71.05 glx-v: 1.4
    direct-render: yes renderer: NVIDIA GeForce RTX 4070 SUPER/PCIe/SSE2
    display-ID: :0.0
  API: Vulkan v: 1.4.341 surfaces: N/A device: 0 type: discrete-gpu
    driver: nvidia device-ID: 10de:2783 device: 1 type: cpu
    driver: mesa llvmpipe device-ID: 10005:0000
  API: EGL Message: EGL data requires eglinfo. Check --recommends.
  Info: Tools: api: glxinfo,vulkaninfo gpu: nvidia-settings x11: xdriinfo,
    xdpyinfo, xprop, xrandr
Info:
  Memory: total: 64 GiB note: est. available: 62.66 GiB used: 6.31 GiB (10.1%)

Here is an excerpt of the command journalctl -b -1 at the moment the crash occurs:

mai 10 14:13:35 fedora kernel: NVRM: GPU at PCI:0000:07:00: GPU-20a73d3d-c881-b6e0-3de3-aa3ab417fb9e
mai 10 14:13:35 fedora kernel: NVRM: Xid (PCI:0000:07:00): 62, 323f0f30 00006a80 00000000 20315e48 20314ad2 20314c30 2031338c 203139f4
mai 10 14:13:35 fedora kernel: NVRM: GPU0 _kgspRpcGspEventPmuHalted: Received signal from GSP that PMU has halted.
mai 10 14:13:35 fedora kernel: NVRM: Xid (PCI:0000:07:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
mai 10 14:13:39 fedora touchegg.desktop[5290]: Error connecting to Touchégg daemon: Could not connect: Connection refused
mai 10 14:13:39 fedora touchegg.desktop[5290]: Reconnecting in 5 seconds...
mai 10 14:13:44 fedora touchegg.desktop[5290]: Error connecting to Touchégg daemon: Could not connect: Connection refused
mai 10 14:13:44 fedora touchegg.desktop[5290]: Reconnecting in 5 seconds...
mai 10 14:13:48 fedora kernel: NVRM: krcWatchdog_IMPL: RC watchdog: GPU is probably locked!  Notify Timeout Seconds: 7
mai 10 14:13:49 fedora touchegg.desktop[5290]: Error connecting to Touchégg daemon: Could not connect: Connection refused
mai 10 14:13:49 fedora touchegg.desktop[5290]: Reconnecting in 5 seconds...
mai 10 14:13:51 fedora at-spi2-registryd[5133]: Disabling unresponsive app with pid 5066
mai 10 14:13:51 fedora kernel: NVRM: GPU0 nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ rs_client.c:844
mai 10 14:13:51 fedora kernel: NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ rs_server.c:259
mai 10 14:13:51 fedora kernel: NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ rs_server.c:1375
...

Here is my /etc/default/grub file :

GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="nvidia-drm.modeset=1 rhgb quiet rd.driver.blacklist=nouveau,nova_core modprobe.blacklist=nouveau,nova_core"
GRUB_DISABLE_RECOVERY="true"
GRUB_ENABLE_BLSCFG=true

How do I fix this? Thank you.

You posted this with a single backtick rather than pre-formatted text with three backticks so it quotes like merde. However, the issue seems to be that the power management (PMU) has failed and requires the card to be reset, which fails.

If I had to put money on it, I’d say it was a driver issue, and I’d try rolling back to a previous driver - I see you’re on 595. Maybe something from the 580 range if this is a recent occurrence. Try the 595m drivers again in a few months when a few more bugs have been worked out.

@jeje777
You posted the outputs with [CODE] tags that did not retain the on-screen formatting and provided a glaring red background for a ll the pasted text.
I edited your post and converted all the code blocks to preformatted text tagged blocks so they are formatted as they appeared on-screen. Those tags are entered by highlighting the pasted text then click the </> button on the toolbar of the text entry window.
Please always use the preformatted text tags for data you copy & paste.

Looking at the data you have posted I see a couple things I would suggest changing.

  1. remove nvidia-drm.modeset=1 from your kernel command line. Done by using sudo grubby --update-kernel=ALL --remove-args="nvidia-drm.modeset"
  2. your firmware (bios) seems out of date. Firmware: UEFI vendor: American Megatrends v: 3405 date: 12/13/2023. Asus shows a new firmware for that mobo dated 2026.
    TUF GAMING B550-PLUS WIFI II|Motherboards|ASUS Global

Please handle both those and post the results.