Hi fellow Fedora folks!
First off I want to thank everyone in advance for any help on this matter you can give. I know this is not a fedora specific issue, but I got to start somewhere…
In addition I must apologize if this get a little long winded.
I recently assembled a new computer currently running Fedora 29 mainly for running Blender and doing ray-tracing. However I am running into a nasty issue.
under certain conditions I am getting MCE errors. The system freezes for a second or two then reboots. Upon reboot the journal has the following errors logged:
mce: [Hardware Error]: Machine check events logged
Nov 06 09:59:19 localhost.localdomain kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 0: b200000000070005
Nov 06 09:59:19 localhost.localdomain kernel: mce: [Hardware Error]: TSC 0
Nov 06 09:59:19 localhost.localdomain kernel: mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1573052358 SOCKET 0 APIC 0 microcode 200004d
I am not overclocking in any way, using BIOS defaults.
Temperatures seem to be reasonable.
Here is my config:
Intel(R) Core™ i9-7920X CPU @ 2.90GHz (24 thread)
Asus Prime X299-A motherboard
16Gig corsair DDR-4 2666 DRAM
Nvidia Driver ver 418.74
Cooler Master Liquid cooler
Linux zeus.acetech.net 5.0.13-200.fc29.x86_64 #1 SMP Mon May 6 00:49:54 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
So far the problem is showing itself only in Blender however I don’t use this machine for much else other than games, which run fine. During a render I may get the failure quickly or after an hour. Some renders it never happens at all. In other words it does not fail specifically at the same place every time. However this one specific render im currently working on its very reproducible .
I do not believe it’s a temperature issue. The CPU case temp never gets above 70C and it has done it during other operations in blender in which the CPU was under very low thermal stress.
I have tried to run mcelog in daemon mode before the crash but it does not seem to be logging any information, there is no var/log/mcelog ever created, perhaps it’s never getting the chance or I am using mcelog incorrectly.
Another interesting thing is that no kernel panics are logged, just a lockup and reboot, after which the journal only contains the MCE errors as seen above.
So basically my question(s) are:
Is this most definitely a hardware issue? if so since everything is still under warranty, I need to find the culprit so I can get a replacement…
Is this a CPU only issue, perhaps a firmware update is needed?
could be a RAM issue? I will be running a memtestx86 overnight tonight to rule that out.
My internet searches also found this interesting tidbit which seems to apply to me:
“Are you overclocked? Even if you aren’t doing it intentionally, some of the X299 motherboards have BIOS bugs that improperly (or not at all) implement the AVX and AVX512 clock offsets. This causes the processor to try to run AVX/AVX512 at much higher speeds than the processor is specified for which may cause instability. In effect, many X299 motherboards are improperly overclocking the processor out-of-box because they don’t follow Intel’s specifications.”
So I am not sure how to proceed, and suggestion are greatly appreciated.
It sucks that I spent all thees $$ for a nice rig that wont do what it was specifically purchased to do.