My pc crashes and reboots often when video streaming, and sometimes fails to power on (the red LED - indicating that it’s plugged in/power available - turns green - indicating power on - but screen remains black and nothing else happens. in this state, I can’t turn it off using the power button and have to unplug it).
I’ve been having hardware issues with it on previous OS (now definitely confirmed it’s not software related), including crashes when video streaming (machine check error is produced) and failure to power on completely.
Now Fedora is being kind enough to show me this problem report after a crash:
I checked, and mcelog is actually installed, so my first question is, how can I make it start during boot? I checked its man page and wasn’t able to find that info.
My goal is to find out if the broken part is the easily replacable RAM or SSD, or something unreplacable (and I’d have to scrap this old box completely ).
I’ve probably already excluded the SSD though since I was using a different one with Manjaro before, encountering the same issues, and swapped it out for this exact reason. I also replaced the power supply, it has an external transformer like a laptop because it’s a mini PC.
Any help is greatly appreciated, I’m guessing the best course of action would be to make mcelog more helpful, but other suggestions are welcome. (btw I’m not very familiar with all the workings of Fedora yet, just figuring out dnf and such)
Hmmm.
You stated PC, and indicated that unplugging is needed to power off, yet the inxi output shows that is a laptop.
Is the battery dead and not properly charging? If so then the hardware issue may be contributing to the power problems (and potentially the machine checks). A dead battery that may have one or more shorted cells may easily pull the voltage provided by the charging system below that needed for normal operation.
Replace the battery and get the charging issue fixed or replace the machine would be my suggestion.
So it crashed again, and this time (enabled mcelog daemon now) error report has this output:
The kernel log indicates that hardware errors were detected.
System log may have more information.
The last 20 mcelog lines of system log are:
==========================================
Mai 11 21:34:53 fedora mcelog[704]: Hardware event. This is not a software error.
Mai 11 21:34:53 fedora mcelog[704]: MCE 0
Mai 11 21:34:53 fedora mcelog[704]: CPU 1 BANK 1
Mai 11 21:34:53 fedora mcelog[704]: MISC 86 ADDR 271dd4c0
Mai 11 21:34:53 fedora mcelog[704]: TIME 1683833685 Thu May 11 21:34:45 2023
Mai 11 21:34:53 fedora mcelog[704]: MCG status:
Mai 11 21:34:53 fedora mcelog[704]: MCi status:
Mai 11 21:34:53 fedora mcelog[704]: Uncorrected error
Mai 11 21:34:53 fedora mcelog[704]: Error enabled
Mai 11 21:34:53 fedora mcelog[704]: MCi_MISC register valid
Mai 11 21:34:53 fedora mcelog[704]: MCi_ADDR register valid
Mai 11 21:34:53 fedora mcelog[704]: Processor context corrupt
Mai 11 21:34:53 fedora mcelog[704]: SRAR
Mai 11 21:34:53 fedora mcelog[704]: MCA: Data CACHE Level-0 Snoop Error
Mai 11 21:34:53 fedora mcelog[704]: STATUS bf80000000000184 MCGSTATUS 0
Mai 11 21:34:53 fedora mcelog[704]: MCGCAP c08 APICID 2 SOCKETID 0
Mai 11 21:34:53 fedora mcelog[704]: MICROCODE f0
Mai 11 21:34:53 fedora mcelog[704]: CPUID Vendor Intel Family 6 Model 142 Step 9
There’s something funny here, I’ve come across it already:
Handle 0x003E, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x003D
Error Information Handle: Not Provided
Total Width: 64 bits
Data Width: 64 bits
Size: 8 GB
Form Factor: SODIMM
Set: None
Locator: ChannelA-DIMM0
Bank Locator: BANK 0
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 2667 MT/s
Manufacturer: Samsung
Serial Number: 32237AFE
Asset Tag: 9876543210
Part Number: M471A1K43CB1-CTD
Rank: 1
Configured Memory Speed: 2133 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Handle 0x003F, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x003D
Error Information Handle: Not Provided
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: Unknown
Set: None
Locator: ChannelB-DIMM0
Bank Locator: BANK 2
Type: Unknown
Type Detail: None
Bank 0 has an 8GB module, Bank 2 is empty and there is no Bank 1 - not contained in the entire output (but it’s always the one coming up in the errors).
I dunno, could this be actually referring to the CPU (maybe there are also banks for that? I’ve looked at it though, it’s pretty much soldered into the mobo)??
I don’t know. A little googling suggests that dmidecode can give misleading information if the system is configured to interleave/stripe the memory chips so they behave as one larger and faster chip.
I would guess that your system is configured to treat two chips as a single dual-channel chip. Can you turn off memory interleaving in your BIOS? If so, does the readout then show two separate memory chips?
It might just be the way the chip is designed to function. I don’t know that much about it. But it looks like you have a bad memory chip. I would suggest running Memtest86+ and see what that reports.
Yeah, if you have another chip to try using in the system for a while, that would surely yield different results if the problem is with the current memory chip.
EDIT: Haha, alright it happened right away with the new memory module:
The kernel log indicates that hardware errors were detected.
System log may have more information.
The last 20 mcelog lines of system log are:
==========================================
Mai 12 01:39:08 fedora mcelog[702]: Hardware event. This is not a software error.
Mai 12 01:39:08 fedora mcelog[702]: MCE 0
Mai 12 01:39:08 fedora mcelog[702]: CPU 1 BANK 1
Mai 12 01:39:08 fedora mcelog[702]: MISC 86 ADDR 1ea9a1780
Mai 12 01:39:08 fedora mcelog[702]: TIME 1683848340 Fri May 12 01:39:00 2023
Mai 12 01:39:08 fedora mcelog[702]: MCG status:
Mai 12 01:39:08 fedora mcelog[702]: MCi status:
Mai 12 01:39:08 fedora mcelog[702]: Uncorrected error
Mai 12 01:39:08 fedora mcelog[702]: Error enabled
Mai 12 01:39:08 fedora mcelog[702]: MCi_MISC register valid
Mai 12 01:39:08 fedora mcelog[702]: MCi_ADDR register valid
Mai 12 01:39:08 fedora mcelog[702]: Processor context corrupt
Mai 12 01:39:08 fedora mcelog[702]: SRAR
Mai 12 01:39:08 fedora mcelog[702]: MCA: Data CACHE Level-0 Read Error
Mai 12 01:39:08 fedora mcelog[702]: STATUS bf80000000000114 MCGSTATUS 0
Mai 12 01:39:08 fedora mcelog[702]: MCGCAP c08 APICID 2 SOCKETID 0
Mai 12 01:39:08 fedora mcelog[702]: MICROCODE f0
Mai 12 01:39:08 fedora mcelog[702]: CPUID Vendor Intel Family 6 Model 142 Step 9
Use a magnifier to check that contacts aren’t damaged.
Consider using contact cleaner and enhancer.
Due to the number of computers in autos, you can now buy “contact cleaner/enhancer” at auto parts stores. I used buy it at Radio Shack. Just looked on Amazon and contact enhancers are commonly bought together with canned air.
So is everybody in agreement it’s a RAM issue, and I should be able to get rid of it with a good RAM chip, and clean contacts (never encountered problems with dusty contacts before though, b ut who knows)?
I do have a bottle of Isopropyl ready. Would you say it’ll be enough?
It is possible the problem is with the socket or a component that was damaged by ESD or overheating. Running memtest86+ may help pin down the issue. Can you test the RAM in a different machine? Isopropyl is a cleaner, and should be followed by enhancer.
But the live media allows installing additional software while running.
It is also possible to get a live bootable iso of memtest86+ that could be put onto a usb drive and booted to perform the test.