How to find an hardware error (System freeze, automatic reboot)

Hello,

this is my first post so I hope its ok here.

I have the following problems once/twice a day - only hard reset possible

  • system freeze or
  • system automatic reboot

the log says: mce: [Hardware Error]: Machine check events logged

It only happens when using Browser (Firefox, Chrome) or Citrix Workspace.
I play online games for hours and it never happens.

How can I find out if it is a hardware problem and what is wrong.
(Problem also occur on diffrent kernels and Nvidia drivers, fedora 36, Linux Mint, Manjaro)

thank you in advance
Peter

my system:

            .',;::::;,'.                frodo@fedora 
         .';:cccccccccccc:;,.            ------------ 
      .;cccccccccccccccccccccc;.         OS: Fedora Linux 37 (Workstation Edition) x86_64 
    .:cccccccccccccccccccccccccc:.       Kernel: 6.1.15-200.fc37.x86_64 
  .;ccccccccccccc;.:dddl:.;ccccccc;.     Uptime: 22 mins 
 .:ccccccccccccc;OWMKOOXMWd;ccccccc:.    Packages: 2505 (rpm), 39 (flatpak) 
.:ccccccccccccc;KMMc;cc;xMMc:ccccccc:.   Shell: bash 5.2.15 
,cccccccccccccc;MMM.;cc;;WW::cccccccc,   Resolution: 2560x1440 
:cccccccccccccc;MMM.;cccccccccccccccc:   DE: GNOME 43.3 
:ccccccc;oxOOOo;MMM0OOk.;cccccccccccc:   WM: Mutter 
cccccc:0MMKxdd:;MMMkddc.;cccccccccccc;   WM Theme: Adwaita 
ccccc:XM0';cccc;MMM.;cccccccccccccccc'   Theme: Adwaita [GTK2/3] 
ccccc;MMo;ccccc;MMW.;ccccccccccccccc;    Icons: ePapirus [GTK2/3] 
ccccc;0MNc.ccc.xMMd:ccccccccccccccc;     Terminal: gnome-terminal 
cccccc;dNMWXXXWM0::cccccccccccccc:,      CPU: AMD Ryzen 5 1600X (12) @ 3.600GHz 
cccccccc;.:odl:.;cccccccccccccc:,.       GPU: NVIDIA GeForce GTX 1050 Ti 
:cccccccccccccccccccccccccccc:'.         Memory: 2474MiB / 15924MiB 
.:cccccccccccccccccccccc:;,..
  '::cccccccccccccc::;,.          

BOOT_IMAGE=(hd0,gpt2)/vmlinuz-6.1.15-200.fc37.x86_64 root=UUID=8c7a27a0-bb7b-4484-9ebe-d6156adda441 ro rootflags=subvol=root rhgb quiet rd.driver.blacklist=nouveau modprobe.blacklist=nouveau nvidia-drm.modeset=1 initcall_blacklist=simpledrm_platform_driver_init

You seem to have ruled out most software, although different linux versions may all use the same buggy firmware blob. Consider trying nouveau.

Concentrate first on easily fixable problems: cables, connectors, power, and memory. Search here for advice on memtest86+ – if you are using UEFI secure boot you need a recent version.

I live in a area surrounded by ocean and have colleagues who take computers to sea on research vessels, so connections start to fail after a few years. The big name-brand vendors have improved their connectors, but sometimes poor connectors end up in high-end gear. Now that automobiles depend on computers, auto parts stores sell “contact enhancer”. Just reseating connections after inspecting them for signs of corrosion often gets things working for a short time – contact enhancer helps to keep connections good for a long time.

The easy way to check a power supply is to swap with a known good unit. Modern manufacturing tolerances are such that power supplies often fail just after end-of-warranty. I always buy a 2nd laptop power supply, easier now that many laptops can use USB-C power.

Hello,

i have found a solution for the freeze problem = disable C-State in Bios
ASUS UEFI Bios > Advanced\AMD CBS > Global C-state control = Disabled
ASUS UEFI Bios > Advanced\AMD CBS > Power Supply Idle Control = Typical Current Idle

Additional Information:
It seems there is a problem with AMD Ryzens and the C-State. See Bugzilla and Archwiki

Before I changed the Bios settings I got this error message with journalctl now it seems fixed.
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)

2 Likes