NVMe fails randomly

The slot looks snug, did a bit of cleaning and now temperature of the NVMe on a fresh boot dropped to 35 C (from 40 C) :smiley:

It ran for 45mins without any errors, so the cause must have been overheating… Thank you all!

2 Likes

A good nvme should be able to run to around 60 or 70C.

Switching from LUKS to hardware encryption uses less CPU, which keeps the entire system cooler. Most recent nvme drives always encrypt. Enabling encryption just means protecting the internal key with a password, so it has no run-time cost.

Are there any BIOS settings that control other portions of PCIe power management that may or may not ignore the parameters that are being set?

Here’s another kernel boot option:

nvme.noacpi=1

I have this for a SAMSUNG PM9A1 1TB; I’m confident there’s no way I did anything close to that amount of unsafe shutdowns :stuck_out_tongue: (guessing it just counts most/all power-offs)

Data Units Read				: 46,224,411 (23.67 TB)
Data Units Written			: 62,394,716 (31.95 TB)
host_read_commands			: 595,358,752
host_write_commands			: 671,779,027
controller_busy_time			: 3,910
power_cycles				: 1,318
power_on_hours				: 13,179
unsafe_shutdowns			: 505

The NVMe dropped after a while of use, the 40C was just after boot. I think the temperature likely increased significantly then :slightly_smiling_face:

There hasn’t been any more crashes since cleaning :smiley:

1 Like

I am not aware of any,

however, lspci -vvv does show ASPM L1 Enabled for 01:00.0 Non-Volatile memory controller: KIOXIA Corporation NVMe SSD Controller B G5 (DRAM-less) (prog-if 02 [NVM Express]) :confused:

Not exactly sure how exactly the issue has been fixed, but the nvme no longer crashes!

for the future DenverCoder9,

  • Try cleaning the fan
  • Try adding the following to your cmdline nvme_core.default_ps_max_latency_us=0 pci_aspm=off iommu.strict=1 pcie_port_pm=off cpufreq.default_governor=performance
1 Like