MCE hardware errors on fedora 36, Ryzen 5 7600X

fransje · February 23, 2023, 10:46am

I’ve been chasing a hardware error for some time, with little progress. On boot, I get the following message:

kernel: mce: [Hardware Error]: Machine check events logged
kernel: mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 31: ffffffffffffffff
kernel: mce: [Hardware Error]: TSC 0 MISC ff1fffffffffffff SYND ffffffffffffffff IPID ffffffffffffffff 
kernel: mce: [Hardware Error]: PROCESSOR 2:a60f12 TIME 1677145174 SOCKET 0 APIC 2 microcode a601203

Here is a short hardware overview:

System:
  Kernel: 6.1.12-100.fc36.x86_64 arch: x86_64 bits: 64
    Distro: Fedora release 36 (Thirty Six)
Machine:
  Type: Desktop Mobo: ASRock model: X670E Pro RS
    UEFI: American Megatrends LLC. v: 1.11 date: 10/21/2022
CPU:
  Info: 6-core model: AMD Ryzen 5 7600X bits: 64 type: MT MCP cache: L2: 6 MiB

The issue has been there over a range of kernels, from 5.19 to the recent 6.1.12. I also tried booting a livecd of an Ubuntu system I had laying around, with kernel 5.15, with the same result.

What has been tried so far:

A CMOS reset after various kernel updates (in case it was due to some kernel bug, with the error being logged somewhere and therefore showing up even though the issue was fixed.)
MemTest86+ has been run for 4 passes, without errors.
I ran stresstestapp for 1 hour without errors.
I ran mprime95 in blend mode for about 2 hours, without errors.

A good step forward would be to understand if I am facing a hardware error (that could be fixed under warranty) or a software error.

I should add that my system has been stable so far.

Thank you for your help.

caesar · February 23, 2023, 5:39pm

Good description and hardware debugging. A google search suggest a bios bug in your case, start by disabling:
Global C-state control
in bios, and see what happens.
Actually, the message seems harmless, you haven’t noticed nothing unusual, but you’ll never know.
Keep us informed.

fransje · February 24, 2023, 1:37pm

Disabling the Global C-state control did not help unfortunately, and from what I could find it only seemed to be helpful when the mce error was happening on hibernation.

caesar · February 24, 2023, 6:07pm

As you have the same error (exactly the same?) across kernels and distributions, it has to be definitely a hardware problem, maybe the motherboard or the cpu, some pins bent etc.
Here https://is.gd/tUobwo is a good discussion on the subject.
You could try running a single memory module every time and see what happens, because of:
Bank 31: ffffffffffffffff
yet it sounds strange.

fransje · February 24, 2023, 7:36pm

I was in contact with two other guys who had exactly the same error, with the same memory bank. Yet we all had different motherboard manufacturers, ram producers, and different versions of the latest zen 4 generation cpus.

I would say that the probability that we bend the same cpu pins to get the same error is fairly close to zero…

In an other Arch thread they were mentioning changing the bootloader to load the amd-ucode before the initramfs. But as the microcode mentioned in my error is the same as the one loaded by the kernel later on (as per journalctl), I don’t know if that would change anything.

I can try pulling out a memory module.

caesar · February 24, 2023, 8:29pm

This resembles me to problems that some people had with gen1 Ryzen and linux. It turned out eventually to be a glitch in silicon for first products, which were replaced by amd:

I still do not understand whether you have encountered other problems besides that error message, which is sometimes actually harmless.
In any case, parallel to other testing, I would write to Asrock and AMD informing them what is going on, and ask the two other guys to do the same.
I would also try other live distributions, with older kernels, in case it is a software problem.

fransje · February 27, 2023, 1:44pm

Thank you for your reply.

I haven’t encountered any issues so far, but at the price point of my system, I don’t want to risk something showing up down the line once the hardware happens to be out of warranty. Better safe than sorry.

I’ll try contacting both Asrock and AMD.

Kind regards.

fransje · February 27, 2023, 2:10pm

Well, before writing to ASRock, I re-re-checked their BIOS updates, and found out that they (finally, after 3 months) released a bios update last week with an AGESA update. That fixed the issues.

caesar · February 27, 2023, 8:51pm

Glad you solved. It was a hardware problem, as expected.
The moral of this story is: never be an early adopter.

Topic		Replies	Views
Random MCE Errors Ask Fedora	0	456	November 6, 2019
How to find an hardware error (System freeze, automatic reboot) Ask Fedora f36 , f37 , amd , flatpak , gnome , nvidia , workstation	2	875	March 21, 2023
Two of my CPU cores stay off-line Ask Fedora kde , kde-plasma , intel , f39 , f40	2	88	July 23, 2024
Troubleshooting hardware issue Ask Fedora f38 , amd	1	547	August 5, 2023
Something is wrong - memory ecc reports Ask Fedora f39	20	750	November 12, 2023

MCE hardware errors on fedora 36, Ryzen 5 7600X

Related topics