Does smartd output mean my ssd is nearly dead?

Does this output from smartctl -a /dev/sda mean my ssd is nearly dead or could something be wrong? It seems to be working but some .jpg images from my cell phone have gone from a visible picture to “black” indicating random data loss. I figure it could also be due to forcing the power off occasionally which is one of the worst things for a ssd.

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.5-200.fc35.x86_64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron Client SSDs
Device Model:     Crucial_CT512MX100SSD1
Serial Number:    14390D53BD4A
LU WWN Device Id: 5 00a075 10d53bd4a
Firmware Version: MU01
User Capacity:    512,110,190,592 bytes [512 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Dec  2 11:39:12 2021 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		( 2380) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (   6) minutes.
Conveyance self-test routine
recommended polling time: 	 (   3) minutes.
SCT capabilities: 	       (0x0035)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0033   100   100   000    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       54201
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       1124
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   098   098   000    Old_age   Always       -       77
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       86
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       4403
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   063   042   000    Old_age   Always       -       37 (Min/Max 24/58)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_ECC_Cnt 0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       53
202 Percent_Lifetime_Remain 0x0031   098   098   000    Pre-fail  Offline      -       2
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       25200587182
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       804096620
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       1800610916

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     54201         -
# 2  Extended offline    Completed without error       00%     54201         -
# 3  Short offline       Completed without error       00%     54200         -
# 4  Vendor (0xff)       Completed without error       00%     54094         -
# 5  Vendor (0xff)       Completed without error       00%     54082         -
# 6  Vendor (0xff)       Completed without error       00%     53812         -
# 7  Vendor (0xff)       Completed without error       00%     53799         -
# 8  Vendor (0xff)       Completed without error       00%     53532         -
# 9  Vendor (0xff)       Completed without error       00%     53432         -
#10  Vendor (0xff)       Completed without error       00%     52901         -
#11  Vendor (0xff)       Completed without error       00%     51782         -
#12  Vendor (0xff)       Completed without error       00%     51416         -
#13  Vendor (0xff)       Completed without error       00%     51378         -
#14  Vendor (0xff)       Completed without error       00%     51262         -
#15  Vendor (0xff)       Completed without error       00%     51208         -
#16  Vendor (0xff)       Completed without error       00%     51175         -
#17  Vendor (0xff)       Completed without error       00%     51051         -
#18  Vendor (0xff)       Completed without error       00%     50909         -
#19  Vendor (0xff)       Completed without error       00%     50833         -
#20  Vendor (0xff)       Completed without error       00%     50812         -
#21  Vendor (0xff)       Completed without error       00%     50733         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Completed [00% left] (57881389-57946924)
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Disk first went online 2/1/2015. I figure that’s about the average lifetime for an ssd. Sorry if this is not the correct blog for this question.

It seems to mostly be telling you it has lived a long a fruitful life. :smile:

That being said, I personally wouldn’t store any important data on a drive that is 6.5 years old.

1 Like

why that ?
do ssd cells degrade by time ?
(don’t mix with cells currently holding data and the drive is long time offline)

this drive has written ~12 TB only and Attribute 202 means to me: 98 % remaining.

from Specs sheet:

  • Life Expectancy (MTTF): 1.5 million hours => ~4 % used only
  • Endurance: 72TB total bytes written (TBW) => ~17 % used only (in 6.5 years !!!)
  • and regarding Power Loss: => Advanced Features: Power Loss Protection

I would check that drive with the vendor tool
it’s on the same webpage (alas germany and alas changing that here to english switches back german …)

if feasible/not down I would

  • secure erase the drive and
  • setup 10 % Over-Provisioning and
  • have an eye on attribute 202

This is what the “vendor tool” output when run under windows10:

Storage Executive
Current Firmware: MU01

ID	Description				Attribute Data	Units
1	Raw Read Error Rate    0    Errors/Page
5	Retired NAND Blocks    0    NAND Blocks
9	Power On Hours Count    54203    Hours
12	Power Cycle Count    1124    Cycles
171	Program Fail Count    0    NAND Page Program Failures
172	Erase Fail Count    0    NAND Block Erase Failures
173	Average Block-Erase Count    77    Erases
174	Unexpected Power Loss Count    86    Unexpected Power Loss events
180	Unused reserved block count    4403    Blocks
183	SATA Interface Downshift    0    Downshifts
184	Error Correction Count    0    Correction Events
187	Reported Uncorrectable Errors    0    ECC Correction Failures
194	Enclosure Temperature    36    Current Temperature (C)
                                                      58    Highest Lifetime Temperature (C)
196	Reallocation Event Count    0    Events
197	Current Pending Sector Count    0    512 Byte Sectors
198	SMART Off-line Scan Uncorrable Err    0    Errors
199	Ultra-DMA CRC Error Count    53    Errors
202	Percentage Lifetime Used    2    % Lifetime Used
206	Write Error Rate    0    Program Fails/MB
210	RAIN Successful Recovery Page Count    0    TUs successfully recovered by RAIN
246	Cumulative Host Write Sector Count    25205421856    512 Byte Sectors
247	Host Program Page Count    804247711    NAND Page
248	FTL Program Page Count    1800625838    NAND Page


It’s totally different from what smartd reports on linux assuming 100 is “WORST” value. Not sure what “100=WORST” means.

https://stackoverflow.com/questions/37172824/multi-line-blockquote-without-blank-line


Column "current" means the current value. 
Usually it is at 100 when everything is ok. 
Higher values often mean that the attribute has 
never been updated (implies 100).

The column "worst" tells you what worst value SMART 
has ever assigned to this attribute.

"threshold" is the absolute health threshold 
and indicates the value at/below which SMART 
consideres the attribute a failure. 
Most attributes that have a zero threshold are not critical. 
When they decrease, it just means that you drive gets older. 
Other attributes have thresholds 
greater than 0 and are often critical.

If I were to judge that SSD from the data given I would say it is perfectly healthy. Attribute 180 reports over 4400 remaining unused reserve blocks (meaning there have been few if any failed blocks) and 202 reports only 2 % lifetime used.

As long as it does not totally fail suddenly it seems to me perfectly usable.

3 Likes

I guess it’s cause the vendor tools run somethings like “sudo smartctl -t {short, long} /dev/sda” before presenting the data, while you didn’t run one of the test in your first post.

what the values mean:
Smart - Wikipedia.

Actually the command is run by smartd (systemctl status smartd shows it was run many times).

https://en.wikipedia.org/wiki/Solid-state_drive

Device age, measured by days in use, is the main factor in
SSD reliability and not amount of data read or written,
which are measured by terabytes written or drive writes per day.
This suggests that other aging mechanisms, such as “silicon aging”, are at play.
The correlation is significant (around 0.2–0.4).

Maybe disks over a certain age just have errors.

I don’t know if they “just have errors” but I believe it is true that older SSDs have an increasing chance of failure. I have seen that both in the datacenter and in my own personal devices.

That is why I said:

I have lost too much data that way. I now take spinning disks out of use after 3 years and SSDs after 5. That isn’t based on data, it is just my personal comfort level.

All that being said, it isn’t true that all old devices stop working. Some can last a very long time. It comes down to how much you care about that data and how hard it is to recover from the loss/corruption.

I have noticed that often people’s backup strategy doesn’t protect them from data corruption.

1 Like

thank you very interesting. this is the other part of the solution. note I initially suspected there was some data loss but hard to confirm exactly what.

don’t mix:
this just checks health (smartctl option “-H”)
a test is somewhat deeper (option: “-t short” or “-t long”) e.g.

sudo smartctl -t short /dev/sda

how long each test needs to run is visible in your first post:
=> “Short self-test routine” => 2 minutes for the long test
=> “Extended self-test routine” => 6 min. for the long test
you could also see when the last test was done:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     54201

=> at “Power_On_Hours” was “54201”