I am experiencing the problem that my fedora 31 system freezes completely at least once a day, to the point where I have to use the power button ( ssh and magic sysrq keys don’t work always, or maybe I I do something wrong there).
The sound of the video was stuck in a maybe two seconds loop.
Here is the output of journalctl -b -1 -p 3 for the last boot where it froze:
Summary
-- Logs begin at Sun 2019-12-22 21:08:35 CET, end at Wed 2020-01-08 22:32:15 CET. --
Jan 08 22:24:29 localhost.localdomain systemd-udevd[646]: /usr/lib/udev/rules.d/65-md-incremental.rules:28 Invalid value "/sbin/mdadm -I $env{DEV>
Jan 08 22:24:29 localhost.localdomain systemd-udevd[646]: /usr/lib/udev/rules.d/99-vmware-scsi-udev.rules:5 Invalid value "/bin/sh -c 'echo 180 >>
Jan 08 22:24:29 localhost.localdomain systemd-udevd[646]: /usr/lib/udev/rules.d/99-vmware-scsi-udev.rules:6 Invalid value "/bin/sh -c 'echo 180 >>
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: Not valid error log pointer 0x00000000 for Init uCode
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: Fseq Registers:
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: 0xC0BF2650 | FSEQ_ERROR_CODE
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: 0xA04E95F3 | FSEQ_TOP_INIT_VERSION
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: 0xABBD9C69 | FSEQ_CNVIO_INIT_VERSION
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: 0x0000A056 | FSEQ_OTP_VERSION
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: 0xDDFFF6A7 | FSEQ_TOP_CONTENT_VERSION
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: 0x664E56AB | FSEQ_ALIVE_TOKEN
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: 0xB1097904 | FSEQ_CNVI_ID
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: 0xC22312E4 | FSEQ_CNVR_ID
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: 0x03000000 | CNVI_AUX_MISC_CHIP
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: 0x0BADCAFE | CNVR_AUX_MISC_CHIP
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: 0x0BADCAFE | CNVR_SCU_SD_REGS_SD_REG_DIG_DCDC_VTRIM
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: 0x0BADCAFE | CNVR_SCU_SD_REGS_SD_REG_ACTIVE_VDIG_MIRROR
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: SecBoot CPU1 Status: 0x3040001, CPU2 Status: 0x0
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: Failed to start INIT ucode: -110
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: Firmware not running - cannot dump error
Jan 08 22:24:31 localhost.localdomain kernel: iwlwifi 0000:03:00.0: Failed to run INIT ucode: -110
Jan 08 22:24:32 localhost.localdomain kernel: Bluetooth: hci0: command 0xfc09 tx timeout
Jan 08 22:24:40 localhost.localdomain kernel: Bluetooth: hci0: Failed to send firmware signature (-110)
Jan 08 22:24:55 localhost.localdomain gdm-password][1444]: gkr-pam: unable to locate daemon control file
Jan 08 22:27:25 localhost.localdomain systemd[1455]: Failed to start Mark boot as successful.
What else could be usefull to narrow down the problem source?
Thank you for the help in advance!
Thank you a lot for your suggestions! I am using a Thinkpad X270 with the following specs:
Skylake i5-6300U
8GB RAM
Intel Dual Band Wireless-AC 8260
ADATA SX6000PNP 1TB NVMe disk
Software side:
Dual boot with windows 10
Linux 5.4.7-200.fc31.x86_64
Fedora 31 (Workstation Edition)
And if I run the right command now for journalctl, I get:
Summary
Jan 08 22:28:26 localhost.localdomain systemd-udevd[635]: /usr/lib/udev/rules.d/65-md-incremental.rules:28 Invalid value "/sbin/mdadm -I $env{DEV>
Jan 08 22:28:26 localhost.localdomain systemd-udevd[635]: /usr/lib/udev/rules.d/99-vmware-scsi-udev.rules:5 Invalid value "/bin/sh -c 'echo 180 >>
Jan 08 22:28:26 localhost.localdomain systemd-udevd[635]: /usr/lib/udev/rules.d/99-vmware-scsi-udev.rules:6 Invalid value "/bin/sh -c 'echo 180 >>
Jan 08 22:28:41 localhost.localdomain gdm-password][1496]: gkr-pam: unable to locate daemon control file
Jan 08 22:28:42 localhost.localdomain systemd[1508]: Failed to start Application launched by gnome-session-binary.
Jan 08 22:28:43 localhost.localdomain systemd[1508]: Failed to start Application launched by gnome-session-binary.
Jan 08 22:30:54 localhost.localdomain systemd[1508]: Failed to start Mark boot as successful.
Which does not look more helpfull to me? Also I don’t think the RAM is the problem, sometimes fedora freezes a few minutes into the session. My guess is that it has something to do with either firefox (using it most of the times freezes happen) or the NVMe SSD, as I installed that one myself?
Before I opened this thread here, I was searching for other peoples solutions. But without knowing the cause, I find many different things people suggest and it did not help me further
kernel 5.4.10 is in updates-testing and should go stable soon (maybe Monday). If you want to try sooner, you can upgrade just the kernel, while also enabling updates-testing repo.
I can not say at what point exactly the system was freezing, but both the “nvme timeout” and “USB power management unreliable” are suspicious to me. Anybody knows more about this logs and could brighten me up?
@vits95 I followed your instructions, hope this is better now:
Summary
[ 0.469905] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
[ 0.469905] TAA CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/tsx_async_abort.html for more details.
[ 0.470244] #3
[ 0.475823] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
[ 2.178713] usb: port power management may be unreliable
[ 2.630185] usb 1-7: config 1 interface 1 altsetting 0 endpoint 0x3 has wMaxPacketSize 0, skipping
[ 2.630190] usb 1-7: config 1 interface 1 altsetting 0 endpoint 0x83 has wMaxPacketSize 0, skipping
[ 3.889116] acpi PNP0C14:02: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:01)
[ 3.889553] acpi PNP0C14:03: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:01)
[ 11.314725] printk: systemd: 23 output lines suppressed due to ratelimiting
[ 42.396646] nvme nvme0: I/O 832 QID 2 timeout, aborting
[ 42.528797] nvme nvme0: Abort status: 0x0
[ 42.542773] kauditd_printk_skb: 19 callbacks suppressed
[ 42.915296] systemd-journald[621]: File /run/log/journal/24a9fb3a012f46d894763e6ec4bfa78c/system.journal corrupted or uncleanly shut down, renaming and replacing.
[ 43.594218] resource sanity check: requesting [mem 0xfed10000-0xfed15fff], which spans more than pnp 00:08 [mem 0xfed10000-0xfed13fff]
[ 43.594227] caller snb_uncore_imc_init_box+0x6c/0xb0 [intel_uncore] mapping multiple BARs
[ 43.675869] uvcvideo 1-8:1.0: Entity type for entity Extension 4 was not initialized!
[ 43.675872] uvcvideo 1-8:1.0: Entity type for entity Extension 3 was not initialized!
[ 43.675874] uvcvideo 1-8:1.0: Entity type for entity Processing 2 was not initialized!
[ 43.675876] uvcvideo 1-8:1.0: Entity type for entity Camera 1 was not initialized!
[ 44.128425] thermal thermal_zone3: failed to read out thermal zone (-61)
[ 45.738681] iwlwifi 0000:03:00.0: FW already configured (0) - re-configuring
[ 46.007657] iwlwifi 0000:03:00.0: FW already configured (0) - re-configuring
[ 48.298823] L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.
[ 66.936633] systemd-journald[621]: File /var/log/journal/24a9fb3a012f46d894763e6ec4bfa78c/user-1000.journal corrupted or uncleanly shut down, renaming and replacing.
But I also compared this journal entries to one where I had no freeze and there is no difference. So I guess the cause can not be found here…
Also my Windows 10 which is installed parallel does not show this problem, so sending the laptop to a service-center would probably not help much. I am going to run the SMART test next, but first i need a fedora stick it seems.
Arch Wiki, Talk: Lenovo ThinkPad X1 Carbon (Gen 6)
“NVMe unsafe shutdowns
Does anyone else get non-zero “Unsafe Shutdowns” on NVMe (Model Number: LENSE30512GMSP34MEAT3TA)? smartctl on mine laptop constantly reports those, also I get frequent FS corruption errors after reboot :/”
Your’d mentioned “first i need a fedora stick it seems”. I’ve no idea about how it is handled by default, but possible your need to check the file system?
See, i’d freezes (and some other issues) from using the proprietary blob. It was long time ago.
Also i’d experienced shutdowns because of overheating. Cleaning the laptop and changing the proptietary blob to a free drivers is helped me out.
PS: This year i’d even used proprietary drivers without issue (but only a few weeks) before it started to glitch again for me (the all is on the same laptop).
Which make possible to modify the TDP and the Processor Frequency according to the demand. In some types of CPU, TDP jumps appear to be poorly managed with some kernel versions and may create instability that being said you can test the following:
If you have another versions (older) of the kernel try with them and see if the problem is reproduced.
If you know about your BIOS, you may want to try to deactivate the options such as turbo boost or idle states temporally to verify that this is not the reason for the instability through the management of a certain kernel in Linux of these options.
@xtym I turned off the Intel SpeedStep Technology in the BIOS and observe better responsiveness of the system and no freezes. Your suggestion seems to solve the problem so far!
I will mark your answer as solution if no problems show up in the next three days.
While reading up, I found older reports of problems with SpeedStep aswell:
Apparently the issues described there are fixed by now, but SpeedStep Technology has potential for problems.
One explanation I can think of: The degeneration of the silicon chip or the power electronics maybe increase the risk for complications and that would also explain why my laptop was originally working fine.