Fedora 40 freezing and then crashing

I’m having issues with Fedora freezing and then ultimately restarting. I switched from Windows to Fedora last week and am currently dual booting. I am trying to troubleshoot this for a couple of days now but without any success. It happens intermittently, sometimes it can work for hours, and sometimes it crashes within 15 minutes of startup. What I’ve noticed it that CPU spikes to 100% during some mundane task and then everything freezes and crashes (I’ve caught that on system monitor). The only common denominator I can think of is that I had Librewolf open each time it crashed (I have it on all the time)

Logs aren’t showing anything useful (at least to me), I have the same set of errors on every boot, I’ll post them below. Even tho this sounds like a hardware issue, this doesn’t happen on Windows. Any help is appreciated.

These are the errors I have in the log after the login
Jul 27 15:23:42 fedora kernel: x86/cpu: SGX disabled by BIOS.
Jul 27 13:23:46 fedora kernel:
Jul 27 13:23:46 fedora bluetoothd[913]: Failed to set mode: Failed (0x03)
Jul 27 13:23:46 fedora abrtd[1065]: ‘/var/spool/abrt/oops-2024-07-26-04:49:29-1271-0’ is not a problem directory
Jul 27 13:23:48 fedora /usr/bin/nvidia-powerd[925]: Found unsupported configuration. Exiting…
Jul 27 13:24:08 fedora gdm-password][1960]: gkr-pam: unable to locate daemon control file
Jul 27 13:24:08 fedora gdm[1341]: Gdm: on_display_added: assertion ‘GDM_IS_REMOTE_DISPLAY (display)’ failed
Jul 27 13:24:11 fedora systemd[1977]: Failed to start app-gnome-gnome\x2dkeyring\x2dpkcs11-2341.scope - Application launched by gnome-session-binary.
Jul 27 13:24:11 fedora systemd[1977]: Failed to start app-gnome-gnome\x2dkeyring\x2dsecrets-2336.scope - Application launched by gnome-session-binary.
Jul 27 13:24:11 fedora systemd[1977]: Failed to start app-gnome-gnome\x2dkeyring\x2dssh-2338.scope - Application launched by gnome-session-binary.
Jul 27 13:24:14 fedora gdm[1341]: Gdm: on_display_removed: assertion ‘GDM_IS_REMOTE_DISPLAY (display)’ failed

System info:
System Details Report
Hardware Information:

**Hardware Model:** Gigabyte Technology Co., Ltd. Z390 AORUS MASTER
**Memory:** 32.0 GiB
**Processor:** Intel® Core™ i7-9700K × 8
**Graphics:** NVIDIA GeForce RTX™ 2080
**Disk Capacity:** 4.5 TB

Software Information:
Firmware Version: F10
OS Name: Fedora Linux 40 (Workstation Edition)
OS Build: (null)
OS Type: 64-bit
GNOME Version: 46
Windowing System: X11
Kernel Version: Linux 6.9.10-200.fc40.x86_64

I’ve tried both Wayland and x11, happens on both. I’ll post journalctl --no-hostname | grep -iE 'error|warn|critical'output in the next reply, doesn’t fit here

Here is the output, sorry, it didn’t fit because of character limit. I’ve copied only the errors for the last boot where it happened
journalctl --no-hostname | grep -iE 'error|warn|critical'

That requires we download the file before it can be reviewed. Such storage is not acceptable to many since we do not want to download files onto our system. If you were to use fpaste then it would be visible as a web page without directly downloading.

This command does not limit the output to only the last boot. if you modify that to this then only the last boot entries would be included.
journalctl -b 0 --no-hostname | grep -iE 'error|warn|critical'

3 Likes

I’ve manually copied and pasted the info for the last boot only since there was too many characters. Is this OK?
https://paste.centos.org/view/5201c2eb
This is journalctl -b 0 --no-hostname | grep -iE 'error|warn|critical' output now in the link

Very early in that I see a machine check (mce) (lines 2 & 10-13)

MCE events are, as shown in the log, a hardware issue and must be identified and fixed.

Line 1 gives a pcie port (device) that might be the cause.

1 Like

Seems to be my NVMe SSD which is bought a couple of days ago, or Intel’s PCI bridge. I’ve found another article with a similar issue. Pcieport error (PoisonedTLP+ SwTrigger)

lspci -tv

           +-1b.4-[03]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/

lspci -vv

00:1b.4 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #21 (rev f0) (prog-if 00 [Normal decode])
	Subsystem: Gigabyte Technology Co., Ltd Device 5001
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 123
	Bus: primary=00, secondary=03, subordinate=03, sec-latency=0
	I/O behind bridge: [disabled] [16-bit]
	Memory behind bridge: 54200000-542fffff [size=1M] [32-bit]
	Prefetchable memory behind bridge: [disabled] [64-bit]
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
	BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16+ MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: [40] Express (v2) Root Port (Slot+), IntMsgNum 0
		DevCap:	MaxPayload 256 bytes, PhantFunc 0
			ExtTag- RBE+ TEE-IO-
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 256 bytes, MaxReadReq 128 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
		LnkCap:	Port #21, Speed 8GT/s, Width x4, ASPM not supported
			ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s, Width x4
			TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
		SltCap:	AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
			Slot #24, PowerLimit 25W; Interlock- NoCompl+
		SltCtl:	Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
			Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
		SltSta:	Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
			Changed: MRL- PresDet- LinkState+
		RootCap: CRSVisible-
		RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
		RootSta: PME ReqID 0000, PMEStatus- PMEPending-
		DevCap2: Completion Timeout: Range ABC, TimeoutDis+ NROPrPrP- LTR+
			 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- LN System CLS Not Supported, TPHComp- ExtTPHComp- ARIFwd+
			 AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- ARIFwd-
			 AtomicOpsCtl: ReqEn- EgressBlck-
			 IDOReq- IDOCompl- LTR+ EmergencyPowerReductionReq-
			 10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
		LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
			 EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
		Address: fee00258  Data: 0000
	Capabilities: [90] Subsystem: Gigabyte Technology Co., Ltd Device 5001
	Capabilities: [a0] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
			ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
			PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP-
			ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
			PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
		UESvrt:	DLP+ SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
			ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
			PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CorrIntErr- HeaderOF-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CorrIntErr- HeaderOF-
		AERCap:	First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
		RootCmd: CERptEn+ NFERptEn+ FERptEn+
		RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
			 FirstFatal- NonFatalMsg- FatalMsg- IntMsgNum 0
		ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
	Capabilities: [140 v1] Access Control Services
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
	Capabilities: [150 v1] Precision Time Measurement
		PTMCap: Requester- Responder+ Root+
		PTMClockGranularity: 4ns
		PTMControl: Enabled+ RootSelected+
		PTMEffectiveGranularity: Unknown
	Capabilities: [220 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Capabilities: [250 v1] Downstream Port Containment
		DpcCap:	IntMsgNum 0, RPExt+ PoisonedTLP+ SwTrigger+ RP PIO Log 4, DL_ActiveErr+
		DpcCtl:	Trigger:1 Cmpl- INT+ ErrCor- PoisonedTLP- SwTrigger- DL_ActiveErr-
		DpcSta:	Trigger- Reason:00 INT- RPBusy- TriggerExt:00 RP PIO ErrPtr:1f
		Source:	0000
	Kernel driver in use: pcieport

I’m out of my depth here, I’ve found this as well and tried to test it, but it doesn’t wanna execute the mockbuild command [PATCHv4] pcie: Add driver for Downstream Port Containment - Patchwork
These are the docs I’ve tried to follow to set this up https://docs.fedoraproject.org/en-US/quick-docs/kernel-testing-patches/
Am I at least moving in the right direction?

What I did next is I’ve enabled this Monitoring ECC memory on Linux with rasdaemon | Just another blog And now I’ll wait for the crash to happen again. At the moment ras-mc-ctl --summary returns no errors, I’m assuming I have to crash first for them to be logged

MCE means the CPU hardware failed.

If you are overclocking or under volting the CPU then turn that off.
Otherwise it may be the nvme drive, if it is causing power issues.

Also check that the CPU is not running hot.

The sensors command will report CPU temp for you.
Install the lm_sensors package and run sudo sensors-detect to setup it up.

There was no OC on the CPU, and I’m working on a lot more CPU intensive tasks on Windows and never ran into any issues. This was literally crashing while opening a browser, so I highly doubt it was running hot

What I did tho was I flashed my BIOS from f10 to f11, and it seems to be fixed now. In their patch notes, they say that they’ve “Fixed CPU Vcore and power behavior”. This reset my BIOS to defaults tho, so I had to set the XMP profile once again

Also, the proton drive file I’ve linked in my first reply can be viewed without downloading the file, fyi. I understand why you want people to use fpaste tho.

Actually, no it cannot – At least not on my system