Hey. So while waiting for someone to hopefully help me properly fix this problem I tried monitoring the problem while it happend multiple times on the 7.0.4 fedora 44 kernel. I used ChatGPT to help me with the tooling, scripts and diagnostic as I am way over my head here. If logs or other stuff from this conversation would be useful let me know and I will add them here. After like one to two hours of monitoring the fail and feeding the AI the info it summed it up like this:
Findings from troubleshooting so far:
The failure does not look like a normal GNOME, shell, or application-level freeze. The strongest evidence points to a kernel/block-layer stall. During the failure, the system becomes unusable, commands hang, shutdown/reboot hangs, and later services such as logind, polkit, PackageKit, flatpak-system-helper, and fprintd start timing out. These service failures appear to be secondary symptoms, not the primary cause.
The most important kernel evidence is that the system reports RCU stalls involving the block layer timeout worker:
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks
Tasks blocked on level-0 rcu_node (CPUs 0-15): P1214
task:kworker/6:1H state:R
Workqueue: kblockd blk_mq_timeout_work
RIP: 0010:blk_mq_timeout_work+0x4d/0x1e0
A later SysRq dump still showed the same area involved:
task:kworker/6:1H state:R
Workqueue: kblockd blk_mq_timeout_work
? blk_mq_timeout_work+0x61/0x1e0
The workqueue dump also showed:
workqueue kblockd
in-flight: 1214:blk_mq_timeout_work for 215s
pending: blk_mq_timeout_work
This suggests the block layer timeout path itself is getting stuck or not making progress, which then causes wider system stalls.
During the failure, PSI showed very high I/O pressure, often around or above 90%:
IO:
some avg10 ~= 90%+
full avg10 ~= 85-90%+
At the same time, CPU and memory pressure were low or normal. This suggests the system was not primarily CPU-starved or out of memory. It was waiting on I/O/kernel progress.
iostat did not show the NVMe device being normally saturated. Device throughput and %util were low or modest while system-wide %iowait was very high. That is an important mismatch: the system behaves as if I/O is blocked, but the disk is not simply busy doing a large amount of normal I/O.
The NVMe health data did not show obvious media failure:
Critical Warning: 0x00
Available Spare: 100%
Percentage Used: 0%
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Btrfs device stats were also clean:
write_io_errs: 0
read_io_errs: 0
flush_io_errs: 0
corruption_errs: 0
generation_errs: 0
So the data collected so far does not indicate a clearly failing SSD or filesystem corruption. The SSD/controller/firmware/kernel interaction may still be involved, but the SMART/NVMe/Btrfs counters do not show direct media errors.
The system was tested with NVMe APST disabled:
nvme_core.default_ps_max_latency_us=0
The failure still occurred.
The system was also tested with PCIe ASPM disabled via kernel parameter:
pcie_aspm=off
The kernel acknowledged this:
PCIe ASPM is disabled
However, lspci -vv still showed ASPM/L1 substates enabled on the NVMe endpoint and its parent bridge, apparently due to firmware/BIOS configuration:
LnkCtl: ASPM L1 Enabled
L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
The kernel also logged:
ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
FADT indicates ASPM is unsupported, using BIOS configuration
So pcie_aspm=off may not fully disable the already-firmware-configured link power states on this machine.
The NVMe controller is detected as:
Samsung Electronics Co Ltd NVMe SSD Controller PM9C1a (DRAM-less)
The actual drive model reported by SMART/inxi is:
Samsung SSD 990 EVO Plus 2TB
The NVMe is attached under:
0000:00:02.4 -> 0000:03:00.0
The issue is much easier to reproduce on newer Fedora kernels, especially 7.0.4-200.fc44.x86_64. Kernel 6.19.13-300.fc44.x86_64 appears much more stable, although not necessarily proven fully safe.
Tools and commands used during troubleshooting:
journalctl
journalctl -k -f
journalctl --list-boots
SysRq via /proc/sysrq-trigger
ps
awk
grep
sed
tail
stat
iostat / sysstat
iotop
/proc/pressure/io
/proc/pressure/cpu
/proc/pressure/memory
smartctl
nvme smart-log
nvme error-log
btrfs device stats
btrfs filesystem usage
lspci -vv
lspci -tv
readlink on /sys/bus/pci/devices
cat /proc/cmdline
cat /proc/sys/kernel/sysrq
sysctl
grubby
Kernel/debug settings used:
log_buf_len=32M
kernel.sysrq=1
nvme_core.default_ps_max_latency_us=0
pcie_aspm=off
rhgb quiet removed
persistent journal enabled via /var/log/journal
Current working hypothesis:
The freeze is most likely caused by a kernel/block-layer/NVMe-related stall, not by ordinary userspace load. The strongest clue is kblockd blk_mq_timeout_work being stuck/in-flight for hundreds of seconds while RCU stalls are reported. The high I/O pressure with low actual NVMe utilization suggests blocked kernel I/O progress rather than normal disk saturation.