MCE hardware errors on fedora 36, Ryzen 5 7600X

,

I’ve been chasing a hardware error for some time, with little progress. On boot, I get the following message:

kernel: mce: [Hardware Error]: Machine check events logged
kernel: mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 31: ffffffffffffffff
kernel: mce: [Hardware Error]: TSC 0 MISC ff1fffffffffffff SYND ffffffffffffffff IPID ffffffffffffffff 
kernel: mce: [Hardware Error]: PROCESSOR 2:a60f12 TIME 1677145174 SOCKET 0 APIC 2 microcode a601203

Here is a short hardware overview:

System:
  Kernel: 6.1.12-100.fc36.x86_64 arch: x86_64 bits: 64
    Distro: Fedora release 36 (Thirty Six)
Machine:
  Type: Desktop Mobo: ASRock model: X670E Pro RS
    UEFI: American Megatrends LLC. v: 1.11 date: 10/21/2022
CPU:
  Info: 6-core model: AMD Ryzen 5 7600X bits: 64 type: MT MCP cache: L2: 6 MiB

The issue has been there over a range of kernels, from 5.19 to the recent 6.1.12. I also tried booting a livecd of an Ubuntu system I had laying around, with kernel 5.15, with the same result.

What has been tried so far:

  • A CMOS reset after various kernel updates (in case it was due to some kernel bug, with the error being logged somewhere and therefore showing up even though the issue was fixed.)
  • MemTest86+ has been run for 4 passes, without errors.
  • I ran stresstestapp for 1 hour without errors.
  • I ran mprime95 in blend mode for about 2 hours, without errors.

A good step forward would be to understand if I am facing a hardware error (that could be fixed under warranty) or a software error.

I should add that my system has been stable so far.

Thank you for your help.

1 Like

Good description and hardware debugging. A google search suggest a bios bug in your case, start by disabling:
Global C-state control
in bios, and see what happens.
Actually, the message seems harmless, you haven’t noticed nothing unusual, but you’ll never know.
Keep us informed.

Disabling the Global C-state control did not help unfortunately, and from what I could find it only seemed to be helpful when the mce error was happening on hibernation.

As you have the same error (exactly the same?) across kernels and distributions, it has to be definitely a hardware problem, maybe the motherboard or the cpu, some pins bent etc.
Here https://is.gd/tUobwo is a good discussion on the subject.
You could try running a single memory module every time and see what happens, because of:
Bank 31: ffffffffffffffff
yet it sounds strange.

I was in contact with two other guys who had exactly the same error, with the same memory bank. Yet we all had different motherboard manufacturers, ram producers, and different versions of the latest zen 4 generation cpus.

I would say that the probability that we bend the same cpu pins to get the same error is fairly close to zero…

In an other Arch thread they were mentioning changing the bootloader to load the amd-ucode before the initramfs. But as the microcode mentioned in my error is the same as the one loaded by the kernel later on (as per journalctl), I don’t know if that would change anything.

I can try pulling out a memory module.

This resembles me to problems that some people had with gen1 Ryzen and linux. It turned out eventually to be a glitch in silicon for first products, which were replaced by amd:

I still do not understand whether you have encountered other problems besides that error message, which is sometimes actually harmless.
In any case, parallel to other testing, I would write to Asrock and AMD informing them what is going on, and ask the two other guys to do the same.
I would also try other live distributions, with older kernels, in case it is a software problem.

Thank you for your reply.

I haven’t encountered any issues so far, but at the price point of my system, I don’t want to risk something showing up down the line once the hardware happens to be out of warranty. Better safe than sorry.

I’ll try contacting both Asrock and AMD.

Kind regards.

Well, before writing to ASRock, I re-re-checked their BIOS updates, and found out that they (finally, after 3 months) released a bios update last week with an AGESA update. That fixed the issues.

2 Likes

Glad you solved. It was a hardware problem, as expected.
The moral of this story is: never be an early adopter.