Is my external hard drive dying?

Found my old external hard drive (2.5", 1TB) with some data on it. It keeps disconnecting and reconnecting when I plug it in and sometimes gives read errors. I’m not sure if it just corrupted over time and can simply be formatted or if it’s beyond saving.

Fedora 43 Cinnamon 6.4.12

sudo smartctl -a /dev/sdb

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   077   064   006    Pre-fail  Always       -       45086885
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   099   099   020    Old_age   Always       -       1566
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   067   060   045    Pre-fail  Always       -       4882972
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       523 (208 111 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       504
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   097   000    Old_age   Always       -       9
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   065   055   040    Old_age   Always       -       35 (Min/Max 29/35)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       13
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       100
193 Load_Cycle_Count        0x0032   098   098   000    Old_age   Always       -       5826
194 Temperature_Celsius     0x0022   035   045   000    Old_age   Always       -       35 (0 10 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       54 (175 191 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       4280023819
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       2869891205
254 Free_Fall_Sensor        0x0032   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

Check cables and connectors for corrosion (green scum) or weak tension between pins and sockets with a “drag” test. Run the “long” test without using suspect cables/connectors. Due to the heavy reliance on computers in modern vehicles, auto parts stores sell contact enhancement fluid that should be used after cleaning corroded connectors.

How is the HDD connected? (e.g. SATA, USB)

In smartctl’s output, “Raw_Read_Error_Rate” is high, but “Reallocated_Sector_Ct” is zero, so at least the media appears to be still good (i.e. the drive isn’t swapping out bad sectors with spares).

As @gnwiii already recommended, check/swap cable connections.

Another cause of random disconnects and reconnects is power, especially via USB.

For a read-only media test, do a badblocks -s /dev/sdb to see if any errors pop up.

The output from dmesg showing the drive connecting/disconnecting would also be very helpful.

The drive is connected via USB. I had my suspicions that it could be the cable even though I saw no mold or anything like that. Today I bought a new one. The drive no longer disconnects every 5-10 seconds, so that’s good.

But now there’s a different problem. The drive’s read speed seems fine and consistent. badblocks -s /dev/sdb took 3 hours and showed no errors. But its write speed quickly drops down and keeps jumping between 0 and 100%. I ran f3write --show-progress=1 --end-at=20 /run/media/[username]/[drive_id] and for a couple of minutes it was like solid 80-100 MB/s. Then it started jumping and a single 1GiB file would take ages. f3read showed no read errors and completed quickly.


Screenshot from Mission Center

I also simply ran sudo dmesg if that what you asked for and there’s like a hundred lines spammed with the same error message:

x86/split lock detection: #DB: CHTTPClientThre/43732 took a bus_lock trap at address: 0xf31e5ad4

You can determine which process is responsible using sudo ps -eL -o pid,ppid,tid,comm,args | grep -F 'CHTTPClientThre' while the messages are being generated. Also use smartctl to check “drive health” and to run a long test.

Edit: I don’t see any mention of the drive model and firmware version. If it is an SMR drive, see https://blog.thefix.it.com/what-are-the-disadvantages-of-smr-drives-for-data-integrity/

See also https://github.com/ValveSoftware/steam-for-linux/issues/13037

The CHTTPClientThre error is certainly caused by Steam which I do have installed and it was running during previous tests. Yesterday I quit Steam and wrote 121GiB to the disk at a constant speed with very little jitter which only appeared after 70% through and disappeared after 90%. Reading was fine all the way through with no errors.

Sadly, when I tried to replicate the results this morning, after writing about 13GiB to the disk it started jumping from 0 to 100% as previously and the process took ages to finish (the limit was 20GiB). Steam was not running in the background.

There were no errors output by dmesg during spikes, just normal stuff I suppose:

usb 2-5: new SuperSpeed USB device number 2 using xhci_hcd
usb 2-5: New USB device found, idVendor=8564, idProduct=7000, bcdDevice=80.00
usb 2-5: New USB device strings: Mfr=2, Product=3, SerialNumber=1
usb 2-5: Product: StoreJet Transcend
usb 2-5: Manufacturer: StoreJet Transcend
usb 2-5: SerialNumber:             [REDACTED]
usb-storage 2-5:1.0: USB Mass Storage device detected
scsi host12: usb-storage 2-5:1.0
usbcore: registered new interface driver usb-storage
usbcore: registered new interface driver uas
scsi 12:0:0:0: Direct-Access     StoreJet Transcend        0    PQ: 0 ANSI: 6
sd 12:0:0:0: Attached scsi generic sg0 type 0
sd 12:0:0:0: [sda] 1953525168 512-byte logical blocks: (1.00 TB/932 GiB)
sd 12:0:0:0: [sda] Write Protect is off
sd 12:0:0:0: [sda] Mode Sense: 43 00 00 00
sd 12:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sda: sda1
sd 12:0:0:0: [sda] Attached SCSI disk

What’s the make, mode and year of manufacturing?

Depending on the vintage, and based on the symptoms above, the slowdown in write speed might be due to thermal recalibration. Like all metals, as the HDD platters heat up, they expand. That in turn shifts the rings of ferrite bits, requiring the arm holding the drive heads to adapt.

The higher the storage density, the worse the problem is. Some newer high-density drives are hermetically sealed and filled with helium to help the drive run cooler by lowering air drag as the platters spin.

Enterprise-class HDDs use more expensive materials and construction to help reduce vibration and thermal recalibration so they tend to maintain consistent write speeds under load.

(It’s one of the reasons why consumer-grade desktop HDDs aren’t generally recommended for use in NAS servers.)

If you haven’t already, definitely do a smartctl -t long /dev/sda before relying on the drive for important data.

1 Like

I am suspecting that this drive may be one of the infamous drives with Shingled Magnetic Recording (SMR) recording technology. The symptoms fit perfectly.

For the first period of use and until the drive reaches about 1/4 capacity there is only one layer of data so it writes quickly. After reaching a certain percentage of capacity the layers begin to be ‘shingled’ in that each track now begins to have overlapping layers. To write the lower layers the system must read the upper layer and save it, then write the lower layer, then rewrite the upper layer to overlap the lower one.

As the disk adds more and more data the writes become slower and slower due to the overhead of the shingling. The drive itself manages this and it all happens within the drive cache.

The earlier request for brand and model of the drive would allow a bit of research to determine if this is an SMR drive or if is one of the enterprise grade drives that uses the much older (and better) CMR technology that does not overlap the tracks.

Examples of drives with the SMR technology include the Seagate Barracuda series and some of the older Western Digital drives for home PCs.

It does seem like it might be due to SMR, but the f3write tests would have likely been on fresh tracks without any need to rewrite adjacent ones, so it is curious.

A while back, I was asked to copy 4.7TB (millions of small files + NTFS) to a brand new portable 5TB 2.5” HDD from Seagate (IIRC, one of the “Expansion” series). Although I’d recommended a CMR-based drive, it ended up being SMR-based. The initial write speed was pretty good until the drive’s RAM buffer filled up, then it slowed down a lot, but maintained a consistent speed (nothing like the pattern @daniel-8371 described during his speed tests).

WD still sells recent 3.5” SMR-based HDDs, but not nearly as common as it was before the lawsuit about 5 years ago for misleading consumers. As part of the settlement, WD’s “Red” series of consumer drives intended for NAS were relabeled – the original CMR-based ones went from “Red” to “Red Plus”, while the SMR-based ones kept the “Red” label.

Not necessarily. Partly due to the amount of data on the drive and partly due to the drive firmware itself, the shingling may occur at any time and at any location.

The shingling process uses the drive cache to perform its task so thru put is greatly hampered once the drive begins shingling the write. Even more so when there may be 3 layers to manipulate.

My suggestion for any drive that has frequent writes is to avoid SMR devices like the plague. It seems they would be deadly for use with a btrfs file system that does copy on write since it rewrites the entire file and not just the part that may have been altered.

@daniel-8371 if you were to provide the make and model of the drive (the full model number as seen with sudo fdisk -l) we can verify if that drive is SMR or CMR technology.

The drive in question is Transcend StoreJet 25M3 1TB. Exact model: TS1TSJ25M3G.

sudo fdisk -l output:

Disk /dev/sda: 931.51 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: Transcend       
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: [REDACTED]

Device     Start       End   Sectors   Size Type
/dev/sda1   2048 253960191 253958144 121.1G Microsoft basic data

I also tried to run sudo smartctl -t long /dev/sda but it’d always abort after 10% in with message “Aborted by host”. I didn’t purposefully stop the process but I suspect after a certain period of time with no activity, the drive just goes to sleep and the test stops.

The drive should not “go to sleep” while running a test. You can run S.M.A.R.T tests on an active system drive, so you could do something like playing an audio file from the drive in loop mode to keep the drive busy. https://ca.transcend-info.com/Products/No-284 is about the case and doesn’t say what drive is used. Some vendors use an assortment of drives under the same model name. The smartctl output may tell you the drive model. I have, however, seen external USB drives that used a model for which I could not find a data sheet.

Thanks, it’s a helpful clue…

To clear up a bit of confusion on my part, are we still talking about the same drive?

I’d been under the impression that you’d purchased a new USB cable or adapter for your old HDD, but since the TS1TSJ25M3G is a prepackaged USB enclosure with a 1TB HDD – either the OEM HDD used by Transcend was replaced with your old HDD, or your old HDD is still around but no longer being used. If it’s the latter, what make/model was the old HDD and USB enclosure/adapter?

The reason I ask is because if there are two different drives/enclosures in play, both exhibiting similar write-performance issues, the common denominator is the USB controller in your computer.

When using an external drive via USB, there are at least three pieces of hardware in play – the HDD/SSD, the USB adapter and the USB controller in the computer. Some USB adapters are better than others, and the same goes for USB controllers.

Some additional useful debugging commands:

  • lsusb –tree lists the various USB devices detected by the Linux kernel.
  • lspci -vv dumps some info about the various PCI devices including USB controllers.

The drive is the same. I only swapped the USB cable.

So, I disassembled the drive and the label on the actual HDD says Seagate Mobile HDD 1TB. Part number: 1RK172-570. Model: SDC001

So it’s a ST1000LM035 per Seagate’s model labeling scheme.

The spec sheet and user manual don’t mention the recording technology used, but a 3rd-party list labels it as SMR.

I’d use SMR HDDs for long-term storage that needs to be available 24/7 (e.g. audio/video media, raw data), but not in an everyday portable – especially one that sees updates to existing files – because the odds of data loss are greater, and data recovery more difficult, for SMR compared to CMR.

Also, with SMR, the choice of filesystem is important for performance and data integrity.

1 Like

That is what I expected and confirms that a significant part of the problem when writing is the SMR tech used.

1 Like

I guess that answers the question then. Thank you everyone who was involved in the discussion. I’ll try to make sure the next drive I buy isn’t an SMR drive.