Fedora 30 randomly freezes

I have a Intel NUC-KIT NUC8I7BEH with most recent BIOS, a SATA-SSD and a 970 EVO Plus NVMe M.2 SSD.

On the SATA-SSD is a bootable Fedora 30 (upgraded from 29). On NVMe-SSD is a Fedora 30 Silverblue.

Both installations Fedora are freezing at least once a day or more. Constantly. No more mouse, keyboard and no network anymore. No ping possible. No LED for HD-activity.

I stripped all to have only mouse, keyboard and monitor attached.

Very often fsck says FS has to be repaired. I already lost data (profile from Firefox).

What can i do?

If these freezes continue/persist I have to say good-bye to Fedora. Very sad.

I tried a Windows 10-to-go. No freezes.

2 Likes

To begin with, please check the system logs to see what it says at the time of the freeze.

1 Like

Checking logs is always good advice, however if the machine is freezing so completely that the filesystem is corrupted, I wouldn’t be surprised if there’s nothing useful in the logs — whatever message might be output at the time of the freeze probably didn’t get a chance to get written to disk.

It sounds very much like a hardware issue, quite possibly overheating. Several Amazon users have reported overheating issues with these models, on Fedora specifically.

One thing I’d suggest, if you haven’t already, is to install and configure the lm_sensors monitoring package, along with its logging daemon.

# 1. Install the packages
$ sudo dnf install lm_sensors lm_sensors-sensord
# 2. Interactively configure the necessary drivers
$ sudo sensors-detect
# 3. Enable and start the logging daemon
$ sudo systemctl enable sensord
$ sudo systemctl start sensord

Before step 3, you might want to edit /etc/sysconfig/sensord and lower the LOG_INTERVAL from the default 20m to something like 5m (or even 2m or 1m, temporarily), to have a better chance at capturing a picture of the hardware state shortly before a freeze.

After step 2, you’ll know the sensors are configured correctly if the sensors command outputs at least somewhat useful data. What you actually get can vary wildly. On my two machines, I get:

$ sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +38.0°C  (high = +80.0°C, crit = +99.0°C)
Core 0:        +38.0°C  (high = +80.0°C, crit = +99.0°C)
Core 1:        +36.0°C  (high = +80.0°C, crit = +99.0°C)
Core 2:        +34.0°C  (high = +80.0°C, crit = +99.0°C)
Core 3:        +37.0°C  (high = +80.0°C, crit = +99.0°C)
# vs
$ sensors
k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp:   +27.0°C  
Core0 Temp:   +19.0°C  
Core1 Temp:   +36.0°C  
Core1 Temp:   +23.0°C  

acpitz-acpi-0
Adapter: ACPI interface
temp1:        +13.0°C  (crit = +70.0°C)

it8716-isa-0290
Adapter: ISA adapter
in0:          +1.10 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in1:          +2.50 V  (min =  +0.00 V, max =  +4.08 V)
in2:          +1.79 V  (min =  +0.00 V, max =  +4.08 V)
in3:          +3.36 V  (min =  +0.00 V, max =  +4.08 V)
in4:          +3.02 V  (min =  +0.00 V, max =  +4.08 V)
in5:          +1.17 V  (min =  +0.00 V, max =  +4.08 V)
in6:          +2.93 V  (min =  +0.00 V, max =  +4.08 V)
in7:          +3.02 V  (min =  +0.00 V, max =  +2.03 V)  ALARM
Vbat:         +2.05 V  
fan1:        1824 RPM  (min =    0 RPM)
fan2:         883 RPM  (min =    0 RPM)
temp1:        +13.0°C  (low  = +127.0°C, high = +65.0°C)  sensor = thermal diode
temp2:        +34.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = thermistor
temp3:        +25.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = thermistor
intrusion0:  ALARM

The former machine has an unsupported hardware monitoring chip, so all I get is the CPU thermal monitoring. The latter machine uses an it87 hardware sensor, so I get a wealth of readouts (even if the ranges aren’t properly configured to monitor them).

If the machine is overheating, you may be able to install and use the cpupower command to increase the CPU throttling. Or, there may be options in the BIOS to adjust how reactive the system is to thermal load.

2 Likes

I have “journalctl --all --follow” and “dmesg --follow --reltime” running for survey and examined these because of the freezes. There were kernel messages about cpu throttling, but since last BIOS update there are less now. Intel withdraw previous BIOS versions for these NUCs.

I installed lm_sensors with daemon:
Mai 28 12:00:23 f30 sensord[1150]: Chip: coretemp-isa-0000
Mai 28 12:00:23 f30 sensord[1150]: Adapter: ISA adapter
Mai 28 12:00:23 f30 sensord[1150]: Package id 0: 61.0 C
Mai 28 12:00:23 f30 sensord[1150]: Core 0: 55.0 C
Mai 28 12:00:23 f30 sensord[1150]: Core 1: 57.0 C
Mai 28 12:00:23 f30 sensord[1150]: Core 2: 57.0 C
Mai 28 12:00:23 f30 sensord[1150]: Core 3: 55.0 C
Mai 28 12:00:23 f30 sensord[1150]: Chip: acpitz-acpi-0
Mai 28 12:00:23 f30 sensord[1150]: Adapter: ACPI interface
Mai 28 12:00:23 f30 sensord[1150]: temp1: -263.2 C
Mai 28 12:00:23 f30 sensord[1150]: temp2: 27.8 C
Mai 28 12:00:23 f30 sensord[1150]: Chip: iwlwifi-virtual-0
Mai 28 12:00:23 f30 sensord[1150]: Adapter: Virtual device
Mai 28 12:00:23 f30 sensord[1150]: Error getting sensor data: iwlwifi/#0: Can’t read
Mai 28 12:00:23 f30 sensord[1150]: sensor read error (-1)

sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +42.0°C (high = +100.0°C, crit = +100.0°C)
Core 0: +41.0°C (high = +100.0°C, crit = +100.0°C)
Core 1: +40.0°C (high = +100.0°C, crit = +100.0°C)
Core 2: +40.0°C (high = +100.0°C, crit = +100.0°C)
Core 3: +40.0°C (high = +100.0°C, crit = +100.0°C)

acpitz-acpi-0
Adapter: ACPI interface
temp1: -263.2°C
temp2: +27.8°C (crit = +119.0°C)

iwlwifi-virtual-0
Adapter: Virtual device
temp1: N/A

pch_cannonlake-virtual-0
Adapter: Virtual device
temp1: +41.0°C

temp -263,2 is wrong.

I never changed cooling options in BIOS, but may be this happened with BIOS updates. I want also to check with Intel community how to maintain best possible cooling of NUC by cleaning from dust etc.

Thanks for your help.

The temperature is about 40° Celsius for the CPU when freezing. It happens when moving the mouse cursor.

Once again freezing on Silverblue. I lost a lot of work. This shouldn’t happen.

I don’t think it’s because of heating.

I was editing a text. Doing the same for minutes. No heavy loads to the system.

What can I do?

Very bad news from Intel. Intel NUC Linux not supported by Intel!

Is this true?

Ugh, well, if that’s what they say :slightly_frowning_face:

Please do follow up with them to clarify what they mean if you can. It is worth testing with a suggested OS, though—just to verify that your hardware is not at fault.

No Linux on Intel NUCs? Realy?

Linux* Support for Intel® NUC

http://compatibleproducts.intel.com/ProductDetails?activeModule=Intel®%20NUC#

And what about Intel’s own Linux distribution: Clear Linux team uses multiple methods to optimize for performance on Intel products → About | Clear Linux* Project

I’ll check my Intel NUC with Clear Linux distributed by Intel!
I’ll report.

2 Likes

I had similar types of behavior from my AMD desktop when I first built it. I ended up fixing it by changing some BIOS settings related to sleep states. Could be something similar for Intel?

1 Like

Maybe it’s due to Samsung SSD with bad firmware.
Google: samsung ssd freezing

Since updating FW in Samsung SSDs it seems there are no freezings anymore.

1 Like

Well, that’s great news! The 970 Evo is a pretty new device, I can buy that it needed a firmware patch to avoid throwing the motherboard into a bad state. Hopefully there haven’t been any more hangs?

1 Like

To avoid this firmware states of hardware used running Fedora should be tested/checked as deeply as possible and warnings should be given.

For this purpose exists a wonderful tool called: Linux Vendor Firmware Service
https://fwupd.org/

This service should/could be extended to give at least warnings in case when hw with dangerous(freezing) fw is in a system.

Interesting idea, but how would one go about testing ALL the firmware that is out there? Testing only the software that is included in each release is a mammoth task.

https://fedoraproject.org/wiki/QA:Release_validation_test_plan

(You can help! Join the Fedora QA team!: QA/Join - Fedora Project Wiki)

Also developed by a Fedora/RedHat contributor :wink: . It aims to provide firmware updates, not check for errors.

You can read about all the work that goes into even getting firmware listed using LVFS here:

https://blogs.gnome.org/hughsie/

Firmware is really something the manufacturer should be verifying. Your best bet in any of these cases is to push the manufacturer to fix their firmware/hardware. Afterall, you did pay them for it :slight_smile:

1 Like

Sorry. My sentence “To avoid this firmware states of hardware used running Fedora should be tested/checked as deeply as possible and warnings should be given.” is bad. I am not of native english language. I am speaking german.

It is “firmware states(versions) should be checked”. Not “firmware (contents)”.

I wanted to propose that there is a service in Fedora that checks for important firmware updates. Not that it checks the contents of the new FW. Same as for CPU microcode.

My system on which I use Fedora F30 Silverblue uses 2 SSDs Samsung. The service reads out the version number of FW used in my SSDs and looks somewhere if there is an important/urgent FW update available. If yes the service gives me a warning/alert to urgently update to new FW from Samsung.

There are websites trying to collect as much infos about new FW available:
https://www.station-drivers.com/index.php?option=com_remository&Itemid=353&func=select&id=309&lang=fr . May be Fedora has/knows about better sources for infos about FW versions.

No worries, most of us are not native English speakers either :slight_smile:

The LVFS service aims to do just this, but they contact the vendors directly. You can see what progress they’ve made here:

https://fwupd.org/lvfs/vendorlist

Unfortunately, it isn’t enough to be aware of new versions of firmware. The manufacturers have to test and release updates so that these can be used with their hardware. That’s what LVFS is trying to do.

1 Like

So after completing these three steps, where will the logs be saved? There is no obvious file in /var/log.

The logs from sensord? They’re in the journal.

sudo journalctl -t sensord -e

To view them starting from the most recent. You’ll only be browsing the latest 1000 lines by default. Add -b to view everything since the last reboot, or -n 10000 or whatever (# of output lines) to look farther back in time. Alternatively, you can use the --since and --until flags to set a specific timeframe.

2 Likes

I’ve also had lots of random freezes due to my nvme ssd, this here solved the issue for me: [SOLVED] Install Fedora on SSD nVME Western Digital

add " nvme_core.default_ps_max_latency_us=5500" while editing startup parameters in grub

Not sure if this is applicable to your issue as you seem to have the same issue with your SATA, but maybe it helps others which are coming to this thread.

1 Like

Mmm, the Arch Linux wiki has good information (as usual) on NVMe APST (the Autonomous Power State Transition behavior configured by that default_ps_max_latency_us tunable), starting at the APST section I linked to but also continuing into the “Troubleshooting” section that follows.

Long story short, while setting arbitrarily-high values like 5500 probably works, they offer some details on how the nvme get-feature command can be used to examine the device’s APST behavior and determine the best value for a particular SSD, should the timeouts need to be adjusted. And, worst-case, nvme_core.default_ps_max_latency_us can apparently be set to 0 to completely disable APST on systems where it’s a problem. Also good to know, I’d think.