NVMe fails randomly

When?

Unsure exactly when the behavior started, but i am fairly certain it was after a kernel update.

What?

The entire OS becomes unresponsive after a while, almost every command shows “Input/Output” error.

Usually, dmesg is unavailable, but at a single instance i was able to use it, and it provided the following logs,

[ 6967.738389] usb 3-2: USB disconnect, device number 2
[ 7020.150488] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[ 7020.150505] nvme nvme0: Does your device have a faulty power saving mode enabled?
[ 7020.150511] nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
[ 7020.189528] nvme 0000:01:00.0: enabling device (0000 -> 0002)
[ 7020.189902] nvme nvme0: Disabling device after reset failure: -19
[ 7020.199614] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
[ 7020.199614] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
[ 7020.199614] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
[ 7020.199614] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
[ 7020.199642] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
[ 7020.199642] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 5, rd 0, flush 0, corrupt 0, gen 0
[ 7020.199669] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
[ 7020.199673] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 9, rd 0, flush 0, corrupt 0, gen 0
[ 7020.199682] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 10, rd 0, flush 0, corrupt 0, gen 0
[ 7020.199683] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 11, rd 0, flush 0, corrupt 0, gen 0
[ 7020.199700] BTRFS: error (device nvme0n1p3) in btrfs_commit_transaction:2523: errno=-5 IO failure (Error while writing out transaction)
[ 7020.199719] BTRFS info (device nvme0n1p3 state E): forced readonly
[ 7020.199727] BTRFS warning (device nvme0n1p3 state E): Skipping commit of aborted transaction.
[ 7020.199733] BTRFS error (device nvme0n1p3 state EA): Transaction aborted (error -5)
[ 7020.199739] BTRFS: error (device nvme0n1p3 state EA) in cleanup_transaction:2017: errno=-5 IO failure
[ 7020.200300] systemd-journald[758]: /var/log/journal/304ba8ecff994284be958829d59c977c/system.journal: Journal file corrupted, rotating.
[ 7020.200388] systemd-journald[758]: Failed to rotate /var/log/journal/304ba8ecff994284be958829d59c977c/system.journal: Read-only file system
[ 7020.200545] systemd-journald[758]: Failed to vacuum /var/log/journal/304ba8ecff994284be958829d59c977c, ignoring: Input/output error
[ 7020.200953] systemd-journald[758]: Failed to write entry to /var/log/journal/304ba8ecff994284be958829d59c977c/system.journal (24 items, 777 bytes) despite vacuuming, ignoring: Input/output error
[ 7020.201388] systemd-journald[758]: Failed to rotate /var/log/journal/304ba8ecff994284be958829d59c977c/system.journal: Read-only file system
[ 7020.201417] systemd-journald[758]: Failed to vacuum /var/log/journal/304ba8ecff994284be958829d59c977c, ignoring: Input/output error
[ 7020.201423] systemd-journald[758]: /var/log/journal/304ba8ecff994284be958829d59c977c/system.journal: IO error, rotating.
[ 7020.201427] systemd-journald[758]: Suppressing rotation, as we already rotated immediately before write attempt. Giving up.
[ 7020.201570] systemd-journald[758]: Failed to rotate /var/log/journal/304ba8ecff994284be958829d59c977c/system.journal: Read-only file system
[ 7020.201596] systemd-journald[758]: Failed to vacuum /var/log/journal/304ba8ecff994284be958829d59c977c, ignoring: Input/output error
[ 7020.201601] systemd-journald[758]: /var/log/journal/304ba8ecff994284be958829d59c977c/system.journal: IO error, rotating.
[ 7020.201604] systemd-journald[758]: Suppressing rotation, as we already rotated immediately before write attempt. Giving up.
[ 7020.201756] systemd-journald[758]: /var/log/journal/304ba8ecff994284be958829d59c977c/system.journal: IO error, rotating.
[ 7020.201759] systemd-journald[758]: Suppressing rotation, as we already rotated immediately before write attempt. Giving up.
[ 7020.203462] EXT4-fs (nvme0n1p2): shut down requested (2)
[ 7020.203466] Aborting journal on device nvme0n1p2-8.
[ 7020.203475] Buffer I/O error on dev nvme0n1p2, logical block 131072, lost sync page write
[ 7020.203479] JBD2: I/O error when updating journal superblock for nvme0n1p2-8.
[ 7033.466494] coredump: 12409(packagekitd): |/usr/lib/systemd/systemd-coredump pipe failed
[ 7084.796899] systemd-journald[758]: Failed to rotate /var/log/journal/304ba8ecff994284be958829d59c977c/system.journal: Read-only file system (Dropped 37 similar message(s))
[ 7084.796926] systemd-journald[758]: Failed to vacuum /var/log/journal/304ba8ecff994284be958829d59c977c, ignoring: Input/output error (Dropped 37 similar message(s))
[ 7084.796936] systemd-journald[758]: /var/log/journal/304ba8ecff994284be958829d59c977c/system.journal: IO error, rotating. (Dropped 36 similar message(s))
[ 7084.796940] systemd-journald[758]: Suppressing rotation, as we already rotated immediately before write attempt. Giving up. (Dropped 36 similar message(s))
[ 7085.296436] systemd-journald[758]: Failed to rotate /var/log/journal/304ba8ecff994284be958829d59c977c/system.journal: Read-only file system
[ 7085.296490] systemd-journald[758]: Failed to vacuum /var/log/journal/304ba8ecff994284be958829d59c977c, ignoring: Input/output error
[ 7085.296512] systemd-journald[758]: /var/log/journal/304ba8ecff994284be958829d59c977c/system.journal: IO error, rotating.
[ 7085.296520] systemd-journald[758]: Suppressing rotation, as we already rotated immediately before write attempt. Giving up.
[ 7085.797846] systemd-journald[758]: Failed to rotate /var/log/journal/304ba8ecff994284be958829d59c977c/system.journal: Read-only file system
[ 7085.797866] systemd-journald[758]: Failed to vacuum /var/log/journal/304ba8ecff994284be958829d59c977c, ignoring: Input/output error
[ 7085.797875] systemd-journald[758]: /var/log/journal/304ba8ecff994284be958829d59c977c/system.journal: IO error, rotating.
[ 7085.797878] systemd-journald[758]: Suppressing rotation, as we already rotated immediately before write attempt. Giving up.
[ 7148.108645] systemd-journald[758]: Failed to rotate /var/log/journal/304ba8ecff994284be958829d59c977c/user-1000.journal: Read-only file system (Dropped 10 similar message(s))
[ 7148.108673] systemd-journald[758]: Failed to vacuum /var/log/journal/304ba8ecff994284be958829d59c977c, ignoring: Input/output error (Dropped 10 similar message(s))
[ 7148.108698] systemd-journald[758]: /var/log/journal/304ba8ecff994284be958829d59c977c/user-1000.journal: IO error, rotating. (Dropped 10 similar message(s))
[ 7148.108701] systemd-journald[758]: Suppressing rotation, as we already rotated immediately before write attempt. Giving up. (Dropped 10 similar message(s))
[ 7153.919719] systemd-journald[758]: Failed to rotate /var/log/journal/304ba8ecff994284be958829d59c977c/system.journal: Read-only file system
[ 7153.919742] systemd-journald[758]: Failed to vacuum /var/log/journal/304ba8ecff994284be958829d59c977c, ignoring: Input/output error
[ 7153.919752] systemd-journald[758]: /var/log/journal/304ba8ecff994284be958829d59c977c/system.journal: IO error, rotating.
[ 7153.919756] systemd-journald[758]: Suppressing rotation, as we already rotated immediately before write attempt. Giving up.
[ 7157.820330] systemd-journald[758]: Failed to rotate /var/log/journal/304ba8ecff994284be958829d59c977c/user-1000.journal: Read-only file system
[ 7157.820360] systemd-journald[758]: Failed to vacuum /var/log/journal/304ba8ecff994284be958829d59c977c, ignoring: Input/output error
[ 7157.820371] systemd-journald[758]: /var/log/journal/304ba8ecff994284be958829d59c977c/user-1000.journal: IO error, rotating.
[ 7157.820376] systemd-journald[758]: Suppressing rotation, as we already rotated immediately before write attempt. Giving up.

I have tried the following parameters, but it the issue still persists,

  • amd_iommu=off
  • nvme_core.default_ps_max_latency_us=0
  • mem_sleep_default=deep
  • pcie_aspm=off
  • pcie_port_pm=off

I do have these warnings on a frest boot,

 3:13:44 PM kernel: leds platform::micmute: Setting an LED's brightness failed (-5)
 3:13:44 PM kernel: mt7921e 0000:02:00.0: sar cnt = 0
 3:13:43 PM kernel: i2c i2c-6: Failed! (01)
 3:13:43 PM kernel: leds platform::micmute: Setting an LED's brightness failed (-5)
 3:13:43 PM kernel: Bluetooth: hci0: Failed to claim iso interface
 8:43:38 PM kernel: pcie_mp2_amd 0000:03:00.7: amd_sfh_hid_client_init failed err -19
 8:43:37 PM kernel: ACPI Error: AE_NOT_FOUND, During name lookup/catalog (20240827/psobject-220)
 8:43:37 PM kernel: ACPI Error: AE_NOT_FOUND, During name lookup/catalog (20240827/psobject-220)
 8:43:37 PM kernel: ACPI BIOS Error (bug): Could not resolve symbol [\_SB.PCI0.LPC0.EC0], AE_NOT_FOUND (20240827/dswload2-162)

Is this a firmware bug? Are there there any known workarounds?

These messages seem suspicious:

Btrfs detects write errors and forces the filesystem into read-only mode to prevent data corruption:

[ 7020.199614] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
[ 7020.199614] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
[ 7020.199614] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
[ 7020.199614] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
[ 7020.199642] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
[ 7020.199642] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 5, rd 0, flush 0, corrupt 0, gen 0
[ 7020.199669] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
[ 7020.199673] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 9, rd 0, flush 0, corrupt 0, gen 0
[ 7020.199682] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 10, rd 0, flush 0, corrupt 0, gen 0
[ 7020.199683] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 11, rd 0, flush 0, corrupt 0, gen 0
[ 7020.199700] BTRFS: error (device nvme0n1p3) in btrfs_commit_transaction:2523: errno=-5 IO failure (Error while writing out transaction)
[ 7020.199719] BTRFS info (device nvme0n1p3 state E): forced readonly
[ 7020.199727] BTRFS warning (device nvme0n1p3 state E): Skipping commit of aborted transaction.
[ 7020.199733] BTRFS error (device nvme0n1p3 state EA): Transaction aborted (error -5)
[ 7020.199739] BTRFS: error (device nvme0n1p3 state EA) in cleanup_transaction:2017: errno=-5 IO failure

Run a disk check. I do it with the “nvme” command:
sudo nvme smart-log /dev/nvme0n1

A rather large number of unsafe shutdowns… :thinking:

Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning			: 0
temperature				: 100 °F (311 K)
available_spare				: 100%
available_spare_threshold		: 50%
percentage_used				: 1%
endurance group critical warning summary: 0
Data Units Read				: 13,015,646 (6.66 TB)
Data Units Written			: 16,268,886 (8.33 TB)
host_read_commands			: 143,292,198
host_write_commands			: 460,840,341
controller_busy_time			: 940
power_cycles				: 1,572
power_on_hours				: 3,775
unsafe_shutdowns			: 106
media_errors				: 0
num_err_log_entries			: 0
Warning Temperature Time		: 0
Critical Composite Temperature Time	: 0
Temperature Sensor 1			: 100 °F (311 K)
Thermal Management T1 Trans Count	: 0
Thermal Management T2 Trans Count	: 0
Thermal Management T1 Total Time	: 0
Thermal Management T2 Total Time	: 0
1 Like

more information about your system would be nice

can you provide more information about your nvme?

The output from sudo smartctl -x /dev/nvme0 should be sufficient.

Thanks

Yes, I agree, the value is high. From my understanding, the causes can be various: unstable power supply, overheating, motherboard issues, poor contact in the slot, etc.
The drive seems to be in good condition.

1 Like

sure!

Model Number:                       KBG50ZNS512G NVMe KIOXIA 512GB
Serial Number:                      [REDACTED]
Firmware Version:                   11200109
PCI Vendor/Subsystem ID:            0x1e0f
IEEE OUI Identifier:                0x8ce38e
Total NVM Capacity:                 512,110,190,592 [512 GB]
Unallocated NVM Capacity:           0
Controller ID:                      0
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            8ce38e 1001276f62
Local Time is:                      Sun Feb  2 18:10:31 2025 IST
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x00df):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp Verify
Log Page Attributes (0x1e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     83 Celsius
Critical Comp. Temp. Threshold:     85 Celsius
Namespace 1 Features (0x02):        NA_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     4.10W       -        -    0  0  0  0        1       1
 1 +     2.60W       -        -    1  1  1  1        1       1
 2 +     1.30W       -        -    2  2  2  2        1       1
 3 -   0.0500W       -        -    3  3  3  3      800    1000
 4 -   0.0030W       -        -    4  4  4  4     3000   44900

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        40 Celsius
Available Spare:                    100%
Available Spare Threshold:          50%
Percentage Used:                    1%
Data Units Read:                    13,026,919 [6.66 TB]
Data Units Written:                 16,276,391 [8.33 TB]
Host Read Commands:                 143,537,397
Host Write Commands:                461,029,411
Controller Busy Time:               941
Power Cycles:                       1,573
Power On Hours:                     3,776
Unsafe Shutdowns:                   106
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               40 Celsius

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
Num  Test_Description  Status                       Power_on_Hours  Failing_LBA  NSID Seg SCT Code
 0   Short             Aborted: Controller Reset              3774            -     -   -   -    -
 1   Short             Completed without error                2311            -     -   -   -    -
 2   Short             Completed without error                  17            -     -   -   -    -
 3   Short             Completed without error                   0            -     -   -   -    -

unstable power supply

well, after the file system is forced to be read-only, i have to use the power button to forcefully turn-off the system, perhaps it’s just that?

overheating

i’ll try monitoring the temperature of the NVMe

That is not neccessary an issue. All my SSD/NVME drives have some unsafe shutdowns.

Keep track of how often the count increases. It should be rare.

1 Like

i might be wrong, but this indicates a usage of 1% from the Spare Space - so your nvme might be at risk.

nvme shows a temperature of 100°c Mea Culpa, was confused with F, all good here :smiley:

Temperature Sensor 1			: 100 °F (311 K)

while smartctl shows a temperature of 40°c

Temperature Sensor 1:               40 Celsius

that’s a bit weird.

i’ve also 69 unsafe shutdowns on my relative new laptop (about 4 months).

I have been assuming that 100% means no spares have been used.
Is the 1% the amount of the flash that has been used?

nvme shows a temperature of 100°c

100 °Farenheit :slightly_smiling_face: (approximately 40° Celcius) (just saw the edited message)

looks like nvme shows the temprature in Fahrenheit and smarctl in Celsius

i’ve also 69 unsafe shutdowns on my relative new laptop (about 4 months).

Ah, the NVMe has been in use for about 1-2 years, so i suppose the number of unsafe shutdowns should be understable :slightly_smiling_face:

will keep a close eye on it’s increase though.

i don’t think so.

my 1TB nvme holds about 500GB data and smartctl reports.

Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    0%
Data Units Read:                    7.473.527 [3,82 TB]
Data Units Written:                 7.505.983 [3,84 TB]

E: i found this:

Percentage Used: Contains a vendor specific estimate of the percentage of NVM subsystem life used based on the actual usage and the manufacturer’s prediction of NVM life. A value of 100 indicates that the estimated endurance of the NVM in the NVM subsystem has been consumed, but may not indicate an NVM subsystem failure. The value is allowed to exceed 100. Percentages greater than 254 shall be represented as 255. This value shall be updated once per power-on hour (when the controller is not in a sleep state).

Source: https://media.kingston.com/support/downloads/MKP_521.6_SMART-DCP1000_attribute.pdf

so, yeah i guess, i am wrong. Sorry for that.

1 Like

See this Reddit - Dive into anything that explains
that the “Percentable used” is the amount of the device “endurance” has been used based on the amount of data written.

Available Spares will be 100% when none have been used.

The NVMe just crashed again, and after powering it off using the power switch and booting right back up, the number of unsafe shutdowns went up :confused:

Data Units Read				: 13,034,382 (6.67 TB)
Data Units Written			: 16,280,338 (8.34 TB)
host_read_commands			: 143,639,501
host_write_commands			: 461,202,839
controller_busy_time			: 941
power_cycles				: 1,574
power_on_hours				: 3,776
unsafe_shutdowns			: 107

almost forgot, there is another thead with a simmilar Issue:

maybe it is related somehow…

so try to set the power governor to “performance” and see whats happens.

maybe it is related somehow…

Possibly! my current battery is quite low and the governor was set to powersave!

Just changed them to performance, will see if it crashes again :slightly_smiling_face:

If we compare it with my NVMe drive (29), which has been in use for three years, the unsafe shutdowns on your NVMe seem high.

emanu@fedora ~> sudo nvme smart-log /dev/nvme0
[sudo] password di emanu: 
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                        : 0
temperature                             : 34 °C (307 K)
available_spare                         : 100%
available_spare_threshold               : 10%
percentage_used                         : 3%
endurance group critical warning summary: 0
Data Units Read                         : 58.219.808 (29,81 TB)
Data Units Written                      : 54.265.003 (27,78 TB)
host_read_commands                      : 1.461.223.268
host_write_commands                     : 1.253.834.617
controller_busy_time                    : 1.948
power_cycles                            : 1.960
power_on_hours                          : 6.170
unsafe_shutdowns                        : 29
media_errors                            : 0
num_err_log_entries                     : 1

Unstable power refers to unexpected variations in the voltage or current supplied to electronic components. This can include voltage spikes, sudden drops (brownouts), complete interruptions (blackouts), or excessive ripple in the power supply. In computers, unstable power can cause malfunctions, sudden reboots, or unsafe shutdowns, potentially damaging components like the NVMe, which may suffer from data corruption or accelerated wear.

This is a possible cause that comes to mind; it could also be that the NVMe is not properly seated in the slot or other causes as well. It needs to be investigated and monitored to see if the unsafe shutdowns increase in a short period of time.

But these errors in your log are suspicious:

[ 7020.150488] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[ 7020.150505] nvme nvme0: Does your device have a faulty power saving mode enabled?
[ 7020.150511] nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
[ 7020.189528] nvme 0000:01:00.0: enabling device (0000 -> 0002)
[ 7020.189902] nvme nvme0: Disabling device after reset failure: -19
1 Like

hmm, that wasn’t it, crashed yet again. Unsafe shutdowns is now at 108 :frowning:

interesting, taking it apart to check now, should be a good opportunity to clean the fans as well…

I expect the count to increase when you use the power switch to force the a power down.

1 Like