Machine check error: How to diagnose this hardware problem (find out which component is broken)?

My pc crashes and reboots often when video streaming, and sometimes fails to power on (the red LED - indicating that it’s plugged in/power available - turns green - indicating power on - but screen remains black and nothing else happens. in this state, I can’t turn it off using the power button and have to unplug it).

I just installed Fedora 38 on this machine:

[ben@fedora ~]$ inxi --basic
System:
  Host: fedora Kernel: 6.2.14-300.fc38.x86_64 arch: x86_64 bits: 64
    Desktop: KDE Plasma v: 5.27.4 Distro: Fedora release 38 (Thirty Eight)
Machine:
  Type: Laptop Mobo: ZOTAC model: ZBOX-CI527/CI547 v: XX
    serial: <superuser required> UEFI: American Megatrends v: B331P206
    date: 04/25/2017
CPU:
  Info: dual core Intel Core i5-7200U [MT MCP] speed (MHz): avg: 600
    min/max: 400/3100
Graphics:
  Device-1: Intel HD Graphics 620 driver: i915 v: kernel
  Display: x11 server: X.Org v: 1.20.14 with: Xwayland v: 22.1.9 driver: X:
    loaded: modesetting unloaded: fbdev,vesa dri: iris gpu: i915
    resolution: 1920x1080~60Hz
  API: OpenGL v: 4.6 Mesa 23.0.3 renderer: Mesa Intel HD Graphics 620 (KBL
    GT2)
Network:
  Device-1: Intel Ethernet I219-LM driver: e1000e
  Device-2: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet
    driver: r8169
  Device-3: Intel Wireless 3165 driver: iwlwifi
  Device-4: Intel Bluetooth wireless interface type: USB driver: btusb
Drives:
  Local Storage: total: 489.05 GiB used: 8.64 GiB (1.8%)
Info:
  Processes: 433 Uptime: 47m Memory: 7.65 GiB used: 4.42 GiB (57.7%)
  Shell: Bash inxi: 3.3.26

I’ve been having hardware issues with it on previous OS (now definitely confirmed it’s not software related), including crashes when video streaming (machine check error is produced) and failure to power on completely.

Now Fedora is being kind enough to show me this problem report after a crash:

I checked, and mcelog is actually installed, so my first question is, how can I make it start during boot? I checked its man page and wasn’t able to find that info.

My goal is to find out if the broken part is the easily replacable RAM or SSD, or something unreplacable (and I’d have to scrap this old box completely :frowning_face: ).
I’ve probably already excluded the SSD though since I was using a different one with Manjaro before, encountering the same issues, and swapped it out for this exact reason. I also replaced the power supply, it has an external transformer like a laptop because it’s a mini PC.

Any help is greatly appreciated, I’m guessing the best course of action would be to make mcelog more helpful, but other suggestions are welcome. (btw I’m not very familiar with all the workings of Fedora yet, just figuring out dnf and such)

$ sudo systemctl enable mcelog.service should enable the service. systemctl status mcelog.service will tell you if it is working.

1 Like

Hmmm.
You stated PC, and indicated that unplugging is needed to power off, yet the inxi output shows that is a laptop.

Is the battery dead and not properly charging? If so then the hardware issue may be contributing to the power problems (and potentially the machine checks). A dead battery that may have one or more shorted cells may easily pull the voltage provided by the charging system below that needed for normal operation.

Replace the battery and get the charging issue fixed or replace the machine would be my suggestion.

It’s a mini pc (I swear it has no battery) with a laptop chipset, maybe that’s why inxi thinks it is one.

So it crashed again, and this time (enabled mcelog daemon now) error report has this output:

The kernel log indicates that hardware errors were detected.
System log may have more information.
The last 20 mcelog lines of system log are:
==========================================
Mai 11 21:34:53 fedora mcelog[704]: Hardware event. This is not a software error.
Mai 11 21:34:53 fedora mcelog[704]: MCE 0
Mai 11 21:34:53 fedora mcelog[704]: CPU 1 BANK 1
Mai 11 21:34:53 fedora mcelog[704]: MISC 86 ADDR 271dd4c0
Mai 11 21:34:53 fedora mcelog[704]: TIME 1683833685 Thu May 11 21:34:45 2023
Mai 11 21:34:53 fedora mcelog[704]: MCG status:
Mai 11 21:34:53 fedora mcelog[704]: MCi status:
Mai 11 21:34:53 fedora mcelog[704]: Uncorrected error
Mai 11 21:34:53 fedora mcelog[704]: Error enabled
Mai 11 21:34:53 fedora mcelog[704]: MCi_MISC register valid
Mai 11 21:34:53 fedora mcelog[704]: MCi_ADDR register valid
Mai 11 21:34:53 fedora mcelog[704]: Processor context corrupt
Mai 11 21:34:53 fedora mcelog[704]: SRAR
Mai 11 21:34:53 fedora mcelog[704]: MCA: Data CACHE Level-0 Snoop Error
Mai 11 21:34:53 fedora mcelog[704]: STATUS bf80000000000184 MCGSTATUS 0
Mai 11 21:34:53 fedora mcelog[704]: MCGCAP c08 APICID 2 SOCKETID 0
Mai 11 21:34:53 fedora mcelog[704]: MICROCODE f0
Mai 11 21:34:53 fedora mcelog[704]: CPUID Vendor Intel Family 6 Model 142 Step 9

Also this info:

There’s more output below that, let me know if you think there might be useful info there.

Can anyone glean from this which part seems to be the problem?

I’m not familiar with reading these MCE logs, but the following looks like a RAM chip and address to me.

Mai 11 21:34:53 fedora mcelog[704]: CPU 1 BANK 1
Mai 11 21:34:53 fedora mcelog[704]: MISC 86 ADDR 271dd4c0

If you run sudo dmidecode, do you see a line that reads Bank Locator: CPU 1 Bank 1 and additional information about that RAM chip?

There’s something funny here, I’ve come across it already:

Handle 0x003E, DMI type 17, 40 bytes
Memory Device
	Array Handle: 0x003D
	Error Information Handle: Not Provided
	Total Width: 64 bits
	Data Width: 64 bits
	Size: 8 GB
	Form Factor: SODIMM
	Set: None
	Locator: ChannelA-DIMM0
	Bank Locator: BANK 0
	Type: DDR4
	Type Detail: Synchronous Unbuffered (Unregistered)
	Speed: 2667 MT/s
	Manufacturer: Samsung
	Serial Number: 32237AFE
	Asset Tag: 9876543210
	Part Number: M471A1K43CB1-CTD    
	Rank: 1
	Configured Memory Speed: 2133 MT/s
	Minimum Voltage: 1.2 V
	Maximum Voltage: 1.2 V
	Configured Voltage: 1.2 V

Handle 0x003F, DMI type 17, 40 bytes
Memory Device
	Array Handle: 0x003D
	Error Information Handle: Not Provided
	Total Width: Unknown
	Data Width: Unknown
	Size: No Module Installed
	Form Factor: Unknown
	Set: None
	Locator: ChannelB-DIMM0
	Bank Locator: BANK 2
	Type: Unknown
	Type Detail: None

Bank 0 has an 8GB module, Bank 2 is empty and there is no Bank 1 - not contained in the entire output (but it’s always the one coming up in the errors).

I dunno, could this be actually referring to the CPU (maybe there are also banks for that? :grimacing: I’ve looked at it though, it’s pretty much soldered into the mobo)??

I don’t know. A little googling suggests that dmidecode can give misleading information if the system is configured to interleave/stripe the memory chips so they behave as one larger and faster chip.

One of the comments in this stackexchange post suggests that lshw may give better results.

lshw -class memory
1 Like

Well according to lshw, Bank 1 is the empty one:

$ sudo lshw -class memory
  *-firmware                
       description: BIOS
       vendor: American Megatrends Inc.
       physical id: 0
       version: B331P206
       date: 04/25/2017
       size: 64KiB
       capacity: 16MiB
       capabilities: pci upgrade shadowing cdboot bootselect socketedrom edd int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int14serial int17printer acpi usb biosbootspecification uefi
  *-memory
       description: System Memory
       physical id: 3d
       slot: System board or motherboard
       size: 8GiB
     *-bank:0
          description: SODIMM DDR4 Synchronous Unbuffered (Unregistered) 2667 MHz (0,4 ns)
          product: M471A1K43CB1-CTD
          vendor: Samsung
          physical id: 0
          serial: 32237AFE
          slot: ChannelA-DIMM0
          size: 8GiB
          width: 64 bits
          clock: 2667MHz (0.4ns)
     *-bank:1
          description: [empty]
          physical id: 1
          slot: ChannelB-DIMM0
  *-cache:0
       description: L1 cache
       physical id: 41
       slot: L1 Cache
       size: 128KiB
       capacity: 128KiB
       capabilities: synchronous internal write-back unified
       configuration: level=1
  *-cache:1
       description: L2 cache
       physical id: 42
       slot: L2 Cache
       size: 512KiB
       capacity: 512KiB
       capabilities: synchronous internal write-back unified
       configuration: level=2
  *-cache:2
       description: L3 cache
       physical id: 43
       slot: L3 Cache
       size: 3MiB
       capacity: 3MiB
       capabilities: synchronous internal write-back unified
       configuration: level=3
  *-memory UNCLAIMED
       description: Memory controller
       product: Sunrise Point-LP PMC
       vendor: Intel Corporation
       physical id: 1f.2
       bus info: pci@0000:00:1f.2
       version: 21
       width: 32 bits
       clock: 33MHz (30.3ns)
       configuration: latency=0
       resources: memory:df344000-df347fff

Hmm, but the slot shows ChannelB-DIMM0. For comparison, my PC shows two distinct “DIMMs”.

[/root]# lshw -class memory
  *-firmware                
       description: BIOS
       vendor: LENOVO
       physical id: 0
       version: O2HKT61A
       date: 06/17/2019
       size: 64KiB
       capacity: 8MiB
       capabilities: pci upgrade shadowing cdboot bootselect socketedrom edd int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int9keyboard int14serial int17printer acpi usb biosbootspecification uefi
  *-cache:0
       description: L1 cache
       physical id: 22
       slot: L1 CACHE
       size: 160KiB
       capacity: 160KiB
       clock: 1GHz (1.0ns)
       capabilities: pipeline-burst internal write-back unified
       configuration: level=1
  *-cache:1
       description: L2 cache
       physical id: 23
       slot: L2 CACHE
       size: 1MiB
       capacity: 1MiB
       clock: 1GHz (1.0ns)
       capabilities: pipeline-burst internal write-back unified
       configuration: level=2
  *-memory
       description: System Memory
       physical id: 38
       slot: System board or motherboard
       size: 16GiB
     *-bank:0
          description: DIMM DDR4 Synchronous Unbuffered (Unregistered) 2133 MHz (0.5 ns)
          product: CMK16GX4M2B3200C16
          vendor: Corsair
          physical id: 0
          serial: 00000000
          slot: DIMM 0
          size: 8GiB
          width: 64 bits
          clock: 2133MHz (0.5ns)
     *-bank:1
          description: DIMM DDR4 Synchronous Unbuffered (Unregistered) 2133 MHz (0.5 ns)
          product: CMK16GX4M2B3200C16
          vendor: Corsair
          physical id: 1
          serial: 00000000
          slot: DIMM 1
          size: 8GiB
          width: 64 bits
          clock: 2133MHz (0.5ns)

I would guess that your system is configured to treat two chips as a single dual-channel chip. Can you turn off memory interleaving in your BIOS? If so, does the readout then show two separate memory chips?

I can’t find any option for that :frowning:

would you expect the readout to be the same if I put that RAM in the other slot?
Also thinking if I should try swapping it out for a different module…

It might just be the way the chip is designed to function. I don’t know that much about it. But it looks like you have a bad memory chip. I would suggest running Memtest86+ and see what that reports.

Yeah, if you have another chip to try using in the system for a while, that would surely yield different results if the problem is with the current memory chip.

Okay I swapped out the memory chip for this:

 *-firmware                
       description: BIOS
       vendor: American Megatrends Inc.
       physical id: 0
       version: B331P206
       date: 04/25/2017
       size: 64KiB
       capacity: 16MiB
       capabilities: pci upgrade shadowing cdboot bootselect socketedrom edd int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int14serial int17printer acpi usb biosbootspecification uefi
  *-memory
       description: System Memory
       physical id: 3d
       slot: System board or motherboard
       size: 16GiB
     *-bank:0
          description: SODIMM DDR4 Synchronous Unbuffered (Unregistered) 3200 MHz (0,3 ns)
          product: 16ATF2G64HZ-3G2E1
          vendor: Micron
          physical id: 0
          serial: 21828463
          slot: ChannelA-DIMM0
          size: 16GiB
          width: 64 bits
          clock: 3200MHz (0.3ns)
     *-bank:1
          description: [empty]
          physical id: 1
          slot: ChannelB-DIMM0

Will see if it still brings the crashes…

EDIT: Haha, alright it happened right away with the new memory module:

The kernel log indicates that hardware errors were detected.
System log may have more information.
The last 20 mcelog lines of system log are:
==========================================
Mai 12 01:39:08 fedora mcelog[702]: Hardware event. This is not a software error.
Mai 12 01:39:08 fedora mcelog[702]: MCE 0
Mai 12 01:39:08 fedora mcelog[702]: CPU 1 BANK 1
Mai 12 01:39:08 fedora mcelog[702]: MISC 86 ADDR 1ea9a1780
Mai 12 01:39:08 fedora mcelog[702]: TIME 1683848340 Fri May 12 01:39:00 2023
Mai 12 01:39:08 fedora mcelog[702]: MCG status:
Mai 12 01:39:08 fedora mcelog[702]: MCi status:
Mai 12 01:39:08 fedora mcelog[702]: Uncorrected error
Mai 12 01:39:08 fedora mcelog[702]: Error enabled
Mai 12 01:39:08 fedora mcelog[702]: MCi_MISC register valid
Mai 12 01:39:08 fedora mcelog[702]: MCi_ADDR register valid
Mai 12 01:39:08 fedora mcelog[702]: Processor context corrupt
Mai 12 01:39:08 fedora mcelog[702]: SRAR
Mai 12 01:39:08 fedora mcelog[702]: MCA: Data CACHE Level-0 Read Error
Mai 12 01:39:08 fedora mcelog[702]: STATUS bf80000000000114 MCGSTATUS 0
Mai 12 01:39:08 fedora mcelog[702]: MCGCAP c08 APICID 2 SOCKETID 0
Mai 12 01:39:08 fedora mcelog[702]: MICROCODE f0
Mai 12 01:39:08 fedora mcelog[702]: CPUID Vendor Intel Family 6 Model 142 Step 9

But it is a different memory address. Maybe you have two bad memory chips? :person_shrugging:

1 Like

Make sure there’s no dust in ram slots (you can use isopropyl to clean the contacts) and test them both with fresh (supporting UEFI)

2 Likes

When installing memory you should take anti-static measures.

Use a magnifier to check that contacts aren’t damaged.

Consider using contact cleaner and enhancer.
Due to the number of computers in autos, you can now buy “contact cleaner/enhancer” at auto parts stores. I used buy it at Radio Shack. Just looked on Amazon and contact enhancers are commonly bought together with canned air.

So is everybody in agreement it’s a RAM issue, and I should be able to get rid of it with a good RAM chip, and clean contacts (never encountered problems with dusty contacts before though, b ut who knows)?

I do have a bottle of Isopropyl ready. Would you say it’ll be enough?

It is possible the problem is with the socket or a component that was damaged by ESD or overheating. Running memtest86+ may help pin down the issue. Can you test the RAM in a different machine? Isopropyl is a cleaner, and should be followed by enhancer.

Yeah I can, but the question is how? The Fedora live medium doesn’t have it.

But the live media allows installing additional software while running.
It is also possible to get a live bootable iso of memtest86+ that could be put onto a usb drive and booted to perform the test.

1 Like