I am running this little Lenovo box as my main home server
OS: Fedora Linux 43 (Server Edition) x86_64
Host: 11JN000CGE (ThinkCentre M75q Gen 2)
Kernel: Linux 6.19.8-200.fc43.x86_64
CPU: AMD Ryzen 5 PRO 5650GE
GPU: AMD Radeon Vega Series / Radeon Vega Mobile Series [Integrated]
Memory: 64 GiB
It ran flawlessly for about two years and still seems to function normally and I have not had any crashes or applications causing errors.
This only happened after upgrading the system to Kernel 6.19.6
Message from syslogd@homeserver at Mar 22 18:03:31 ...
kernel:[Hardware Error]: Deferred error, no action required.
Message from syslogd@homeserver at Mar 22 18:03:31 ...
kernel:[Hardware Error]: CPU:0 (19:50:0) MC11_STATUS[-|-|-|AddrV|PCC|-|SyndV|UECC|Deferred|Poison|Scrub]: 0x8724bb0800000000
Message from syslogd@homeserver at Mar 22 18:03:31 ...
kernel:[Hardware Error]: Error Addr: 0x015ba3f300000020
Message from syslogd@homeserver at Mar 22 18:03:31 ...
kernel:[Hardware Error]: IPID: 0x000700b040000000, Syndrome: 0x0000000000000042
Message from syslogd@homeserver at Mar 22 18:03:31 ...
kernel:[Hardware Error]: L3 Cache Ext. Error Code: 0
Message from syslogd@homeserver at Mar 22 18:03:31 ...
kernel:[Hardware Error]: cache level: RESV, tx: INSN
Message from syslogd@homeserver at Mar 22 18:03:31 ...
kernel:[Hardware Error]: Deferred error, no action required.
Message from syslogd@homeserver at Mar 22 18:03:31 ...
kernel:[Hardware Error]: CPU:0 (19:50:0) MC14_STATUS[-|-|-|AddrV|PCC|-|SyndV|UECC|Deferred|Poison|-]: 0x8724ba8800000000
Message from syslogd@homeserver at Mar 22 18:03:31 ...
kernel:[Hardware Error]: Error Addr: 0x0156ec4800000020
Message from syslogd@homeserver at Mar 22 18:03:31 ...
kernel:[Hardware Error]: IPID: 0x000700b040000000, Syndrome: 0x0000000000000042
Message from syslogd@homeserver at Mar 22 18:03:31 ...
kernel:[Hardware Error]: L3 Cache Ext. Error Code: 0
Message from syslogd@homeserver at Mar 22 18:03:31 ...
kernel:[Hardware Error]: cache level: RESV, tx: INSN
Does this really mean my CPU is cooked? I’ve run a few benchmarks that fully utilized the both CPU and GPU but I do not get any errors when the CPU runs at 100%. Only when it idles or is near idling as it seems.
The problem is that this is Lenovo claims I have no warranty left and AMD refers me to Lenovo since this is a OEM CPU.
I really would love to not have to fork out another $250 to replace it with the same CPU, so I hope someone with a bit more knowledge than me can tell me what this actually means!
I have no idea how you managed to get that output from journalctl.
Regardless, if this is resolved with a change of kernel, maybe you should consider reporting a kernel regression so that the root cause can be addressed.
If I were you, I’d run something like stress-ng --cpu 8 --timeout 300s and making sure it all executes properly, everything passes and no errors are reports. Capture the journal output for the executions to offer as evidence.
Maybe run this for the latest kernel and 6.18.6 to demonstrate that one kernel throws these errors and the other doesn’t. If the both throw issues, then I would hazard a guess that you have a potentially failing CPU issue. It’s correcting these errors, according to the log you provided but the cache line gets poisoned which I fear may take it out of the mix until the net CPU reset.
As I state above I ran quite a few test, both stress-ng and 7z b and a few phoronix suite test and they all passed.
stress-ng: info: [350365] setting to a 5 mins run per stressor
stress-ng: info: [350365] dispatching hogs: 12 cpu
stress-ng: warn: [350365] WARNING! using HPET clocksource (refer to /sys/devices/system/clocksource/clocksource0), this may impact benchmarking performance
stress-ng: info: [350365] skipped: 0
stress-ng: info: [350365] passed: 12: cpu (12)
stress-ng: info: [350365] failed: 0
stress-ng: info: [350365] metrics untrustworthy: 0
stress-ng: info: [350365] successful run completed in 5 mins