I915 Error: GUC: TLB invalidation Causes OS to Hang

A number of users, including myself, have been experiencing an issue where the OS becomes unresponsive, nothing is clickable, but the mouse is still able to move. Rebooting the PC is required in order to restore functionality. This happens approximately once a day, often after waking from sleep. The journald error log is included below, as well as reports of this issue and attempted fixes from other users.

The maintainer of i915-sriov-dkms ([BUG] Kernel 6.5.7-200.fc38.x86_64, invalidation response timeout- crash when using virtualized gpu · Issue #118 · strongtz/i915-sriov-dkms · GitHub) traced the fix of this issue to drm/i915: CTB TLB invalidation fix on VM · intel/linux-intel-lts@c75552e · GitHub. The commit is summarized below:

The GuC firmware had defined the interface for Translation Look-Aside Buffer (TLB) invalidation. We should use this interface when invalidating the engine and GuC TLBs. Add additional functionality to intel_gt_invalidate_tlb, invalidating the GuC TLBs and falling back to GT invalidation when the GuC is disabled. The invalidation is done by sending a request directly to the GuC tlb_lookup that invalidates the table. The invalidation is submitted as a wait request and is performed in the CT event handler. This means we cannot perform this TLB invalidation path if the CT is not enabled. If the request isn’t fulfilled in two seconds, this would constitute an error in the invalidation as that would constitute either a lost request or a severe GuC overload.
With this new invalidation routine, we can perform GuC-based GGTT invalidations. GuC-based GGTT invalidation is incompatible with MMIO invalidation so we should not perform MMIO invalidation when GuC-based GGTT invalidation is expected.

What would be the best course of action to resolve this issue for regular users, ideally without rebuilding the kernel?

System Info
OS: Fedora Linux 42 (Workstation Edition)
Firmware Version: R2DET38W (1.23 )
Kernel Version: Linux 6.14.11-300.fc42.x86_64
Windowing System: Wayland
CPU: Intel Core Ultra 7 155H
iGPU: Intel Arc™ graphics
GPU: NVIDIA RTX 500 Ada Generation Laptop GPU 4GB GDDR6 (Nvidia driver installed)

journald error log when OS becomes unresponsive:


Jul 02 16:16:33 fedora kernel: i915 0000:00:02.0: [drm] *ERROR* GT0: GUC: TLB invalidation response timed out for seqno 85162
Jul 02 16:16:35 fedora kernel: i915 0000:00:02.0: [drm] *ERROR* GT0: GUC: TLB invalidation response timed out for seqno 85163
Jul 02 16:16:38 fedora kernel: i915 0000:00:02.0: [drm] *ERROR* GT0: GUC: TLB invalidation response timed out for seqno 85165
Jul 02 16:16:38 fedora kernel: i915 0000:00:02.0: [drm] *ERROR* GT0: GUC: TLB invalidation response timed out for seqno 85164
Jul 02 16:16:39 fedora kernel: Fence expiration time out i915-0000:00:02.0:gnome-shell[3281]:92f2!
Jul 02 16:16:40 fedora kernel: i915 0000:00:02.0: [drm] *ERROR* GT0: GUC: TLB invalidation response timed out for seqno 85167
Jul 02 16:16:40 fedora kernel: i915 0000:00:02.0: [drm] *ERROR* GT0: GUC: TLB invalidation response timed out for seqno 85166
Jul 02 16:16:42 fedora kernel: i915 0000:00:02.0: [drm] *ERROR* GT0: GUC: TLB invalidation response timed out for seqno 85168
Jul 02 16:16:42 fedora kernel: i915 0000:00:02.0: [drm] *ERROR* GT0: GUC: TLB invalidation response timed out for seqno 85169
Jul 02 16:16:44 fedora kernel: i915 0000:00:02.0: [drm] *ERROR* GT0: GUC: TLB invalidation response timed out for seqno 85170
Jul 02 16:16:46 fedora kernel: i915 0000:00:02.0: [drm] *ERROR* GT0: GUC: TLB invalidation response timed out for seqno 85171
Jul 02 16:16:47 fedora kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 12:0:00000000
Jul 02 16:16:47 fedora kernel: i915 0000:00:02.0: [drm] GT0: Resetting chip for stopped heartbeat on rcs0
Jul 02 16:16:47 fedora kernel: i915 0000:00:02.0: [drm] GT0: GuC firmware i915/mtl_guc_70.bin version 70.44.1
Jul 02 16:16:47 fedora kernel: i915 0000:00:02.0: [drm] GT0: GUC: submission enabled
Jul 02 16:16:47 fedora kernel: i915 0000:00:02.0: [drm] GT0: GUC: SLPC enabled

Reports of this issue have been made around the web, including attempts to resolve it, but none were successful. They are documented below:

Did it actually disable GuC in dmesg? That’s what I would have tried first, but maybe newer GPUs require GuC and can’t disable it like that?

I’d try disabling IOMMU and CPU virt in BIOS too.