System runs, but getting hardware error every 5 minutes

Hi,

I am running this little Lenovo box as my main home server

OS: Fedora Linux 43 (Server Edition) x86_64
Host: 11JN000CGE (ThinkCentre M75q Gen 2)
Kernel: Linux 6.19.8-200.fc43.x86_64
CPU: AMD Ryzen 5 PRO 5650GE
GPU: AMD Radeon Vega Series / Radeon Vega Mobile Series [Integrated]
Memory: 64 GiB

It ran flawlessly for about two years and still seems to function normally and I have not had any crashes or applications causing errors.

This only happened after upgrading the system to Kernel 6.19.6

Message from syslogd@homeserver at Mar 22 18:03:31 ...
 kernel:[Hardware Error]: Deferred error, no action required.

Message from syslogd@homeserver at Mar 22 18:03:31 ...
 kernel:[Hardware Error]: CPU:0 (19:50:0) MC11_STATUS[-|-|-|AddrV|PCC|-|SyndV|UECC|Deferred|Poison|Scrub]: 0x8724bb0800000000

Message from syslogd@homeserver at Mar 22 18:03:31 ...
 kernel:[Hardware Error]: Error Addr: 0x015ba3f300000020

Message from syslogd@homeserver at Mar 22 18:03:31 ...
 kernel:[Hardware Error]: IPID: 0x000700b040000000, Syndrome: 0x0000000000000042

Message from syslogd@homeserver at Mar 22 18:03:31 ...
 kernel:[Hardware Error]: L3 Cache Ext. Error Code: 0

Message from syslogd@homeserver at Mar 22 18:03:31 ...
 kernel:[Hardware Error]: cache level: RESV, tx: INSN

Message from syslogd@homeserver at Mar 22 18:03:31 ...
 kernel:[Hardware Error]: Deferred error, no action required.

Message from syslogd@homeserver at Mar 22 18:03:31 ...
 kernel:[Hardware Error]: CPU:0 (19:50:0) MC14_STATUS[-|-|-|AddrV|PCC|-|SyndV|UECC|Deferred|Poison|-]: 0x8724ba8800000000

Message from syslogd@homeserver at Mar 22 18:03:31 ...
 kernel:[Hardware Error]: Error Addr: 0x0156ec4800000020

Message from syslogd@homeserver at Mar 22 18:03:31 ...
 kernel:[Hardware Error]: IPID: 0x000700b040000000, Syndrome: 0x0000000000000042

Message from syslogd@homeserver at Mar 22 18:03:31 ...
 kernel:[Hardware Error]: L3 Cache Ext. Error Code: 0

Message from syslogd@homeserver at Mar 22 18:03:31 ...
 kernel:[Hardware Error]: cache level: RESV, tx: INSN

Does this really mean my CPU is cooked? I’ve run a few benchmarks that fully utilized the both CPU and GPU but I do not get any errors when the CPU runs at 100%. Only when it idles or is near idling as it seems.

The problem is that this is Lenovo claims I have no warranty left and AMD refers me to Lenovo since this is a OEM CPU.

I really would love to not have to fork out another $250 to replace it with the same CPU, so I hope someone with a bit more knowledge than me can tell me what this actually means!

I’d appreciate any pointers to what is going on!

Thank you!

Probably need to start by posting the full messages from the journal, rather whatever truncated output this is.

That is all that was in the log, I am not truncating anything.

But I went back to my oldest kernel on the system 6.18.6 and it’s been running for an hour now without any error.

I have no idea how you managed to get that output from journalctl.

Regardless, if this is resolved with a change of kernel, maybe you should consider reporting a kernel regression so that the root cause can be addressed.

You posted messages sent to the terminal.
That is related not exactly the same as the full log that will be in the system journal.

The logs seems to mean that there is a hardware issue with the CPU.

Try getting the logs from sudo journalctl -b 0 -p err
What does that report when you boot from the kernal that creates the errors.

The terminal output was the same apart the time stamps. I checked with sudo journalctl -p 3 -b which gave me the same error messages.

It’s quite late and I really have to go to bed, but I will update to a 6.19.x kernel again when I have time to provide data for a regression report.

We are wondering is the ends of the error messages was truncated.

Try piping journalctl via cat to stop it truncating.

e.g. sudo journalctl -p 3 -b | cat

If I were you, I’d run something like stress-ng --cpu 8 --timeout 300s and making sure it all executes properly, everything passes and no errors are reports. Capture the journal output for the executions to offer as evidence.

Maybe run this for the latest kernel and 6.18.6 to demonstrate that one kernel throws these errors and the other doesn’t. If the both throw issues, then I would hazard a guess that you have a potentially failing CPU issue. It’s correcting these errors, according to the log you provided but the cache line gets poisoned which I fear may take it out of the mix until the net CPU reset.

I will try to do that tomorrow.

As I state above I ran quite a few test, both stress-ng and 7z b and a few phoronix suite test and they all passed.

stress-ng: info:  [350365] setting to a 5 mins run per stressor
stress-ng: info:  [350365] dispatching hogs: 12 cpu
stress-ng: warn:  [350365] WARNING! using HPET clocksource (refer to /sys/devices/system/clocksource/clocksource0), this may impact benchmarking performance
stress-ng: info:  [350365] skipped: 0
stress-ng: info:  [350365] passed: 12: cpu (12)
stress-ng: info:  [350365] failed: 0
stress-ng: info:  [350365] metrics untrustworthy: 0
stress-ng: info:  [350365] successful run completed in 5 mins

It is only a kernel error, no need to stress test your cpu or anything. There have been a few people having trouble with 6.19

1 Like

But it’s a error that kernel is reporting about hardware failing…

Does the mobo have the latest firmware?

1 Like

I thought I had the latest but there is a new one from January 2026. Just updated it and I am monitoring

Edit: system is running for 30 minutes now without errors, yesterday that would have been 6 entries in the log.

Thank you all for the help!

1 Like