NVMe fails randomly

succinea · February 2, 2025, 3:23pm

The slot looks snug, did a bit of cleaning and now temperature of the NVMe on a fresh boot dropped to 35 C (from 40 C)

It ran for 45mins without any errors, so the cause must have been overheating… Thank you all!

william8000 · February 5, 2025, 5:25am

A good nvme should be able to run to around 60 or 70C.

Switching from LUKS to hardware encryption uses less CPU, which keeps the entire system cooler. Most recent nvme drives always encrypt. Enabling encryption just means protecting the internal key with a password, so it has no run-time cost.

theqlp · February 5, 2025, 6:13am

Are there any BIOS settings that control other portions of PCIe power management that may or may not ignore the parameters that are being set?

Espionage724 · February 5, 2025, 8:05am

Here’s another kernel boot option:

nvme.noacpi=1

Succinea nebula:

The NVMe just crashed again, and after powering it off using the power switch and booting right back up, the number of unsafe shutdowns went up
Data Units Read				: 13,034,382 (6.67 TB)
Data Units Written			: 16,280,338 (8.34 TB)
host_read_commands			: 143,639,501
host_write_commands			: 461,202,839
controller_busy_time			: 941
power_cycles				: 1,574
power_on_hours				: 3,776
unsafe_shutdowns			: 107

Emanuele:

If we compare it with my NVMe drive (29), which has been in use for three years, the unsafe shutdowns on your NVMe seem high.

Data Units Read                         : 58.219.808 (29,81 TB)
Data Units Written                      : 54.265.003 (27,78 TB)
host_read_commands                      : 1.461.223.268
host_write_commands                     : 1.253.834.617
controller_busy_time                    : 1.948
power_cycles                            : 1.960
power_on_hours                          : 6.17
unsafe_shutdowns                        : 29

I have this for a SAMSUNG PM9A1 1TB; I’m confident there’s no way I did anything close to that amount of unsafe shutdowns (guessing it just counts most/all power-offs)

Data Units Read				: 46,224,411 (23.67 TB)
Data Units Written			: 62,394,716 (31.95 TB)
host_read_commands			: 595,358,752
host_write_commands			: 671,779,027
controller_busy_time			: 3,910
power_cycles				: 1,318
power_on_hours				: 13,179
unsafe_shutdowns			: 505

succinea · February 5, 2025, 2:07pm

The NVMe dropped after a while of use, the 40C was just after boot. I think the temperature likely increased significantly then

There hasn’t been any more crashes since cleaning

succinea · February 5, 2025, 2:12pm

I am not aware of any,

however, lspci -vvv does show ASPM L1 Enabled for 01:00.0 Non-Volatile memory controller: KIOXIA Corporation NVMe SSD Controller B G5 (DRAM-less) (prog-if 02 [NVM Express])

succinea · February 5, 2025, 2:21pm

Not exactly sure how exactly the issue has been fixed, but the nvme no longer crashes!

for the future DenverCoder9,

Try cleaning the fan
Try adding the following to your cmdline nvme_core.default_ps_max_latency_us=0 pci_aspm=off iommu.strict=1 pcie_port_pm=off cpufreq.default_governor=performance

Topic		Replies	Views
Errors during suspend with nvidia randomly Ask Fedora workstation	2	235	April 3, 2025
Nvidia suspend issues with weird errors? Ask Fedora f36 , nvidia	3	591	October 13, 2022
I/O error, dev sr0 Ask Fedora nvidia , optical-media	9	3261	November 24, 2023
Laptop does not show Luks password prompt as nvme fails with kernel-6.4.11 & kernel-6.4.12 Ask Fedora problem , f38 , nvme	7	596	September 14, 2023
AMDGPU random error "retry page fault" Ask Fedora amd , amdgpu , gnome , lenovo	3	1035	June 25, 2023

NVMe fails randomly

Related topics