Found my old external hard drive (2.5", 1TB) with some data on it. It keeps disconnecting and reconnecting when I plug it in and sometimes gives read errors. I’m not sure if it just corrupted over time and can simply be formatted or if it’s beyond saving.
Check cables and connectors for corrosion (green scum) or weak tension between pins and sockets with a “drag” test. Run the “long” test without using suspect cables/connectors. Due to the heavy reliance on computers in modern vehicles, auto parts stores sell contact enhancement fluid that should be used after cleaning corroded connectors.
In smartctl’s output, “Raw_Read_Error_Rate” is high, but “Reallocated_Sector_Ct” is zero, so at least the media appears to be still good (i.e. the drive isn’t swapping out bad sectors with spares).
As @gnwiii already recommended, check/swap cable connections.
Another cause of random disconnects and reconnects is power, especially via USB.
For a read-only media test, do a badblocks -s /dev/sdb to see if any errors pop up.
The output from dmesg showing the drive connecting/disconnecting would also be very helpful.
The drive is connected via USB. I had my suspicions that it could be the cable even though I saw no mold or anything like that. Today I bought a new one. The drive no longer disconnects every 5-10 seconds, so that’s good.
But now there’s a different problem. The drive’s read speed seems fine and consistent. badblocks -s /dev/sdb took 3 hours and showed no errors. But its write speed quickly drops down and keeps jumping between 0 and 100%. I ran f3write --show-progress=1 --end-at=20 /run/media/[username]/[drive_id] and for a couple of minutes it was like solid 80-100 MB/s. Then it started jumping and a single 1GiB file would take ages. f3read showed no read errors and completed quickly.
You can determine which process is responsible using sudo ps -eL -o pid,ppid,tid,comm,args | grep -F 'CHTTPClientThre' while the messages are being generated. Also use smartctl to check “drive health” and to run a long test.
The CHTTPClientThre error is certainly caused by Steam which I do have installed and it was running during previous tests. Yesterday I quit Steam and wrote 121GiB to the disk at a constant speed with very little jitter which only appeared after 70% through and disappeared after 90%. Reading was fine all the way through with no errors.
Sadly, when I tried to replicate the results this morning, after writing about 13GiB to the disk it started jumping from 0 to 100% as previously and the process took ages to finish (the limit was 20GiB). Steam was not running in the background.
There were no errors output by dmesg during spikes, just normal stuff I suppose:
usb 2-5: new SuperSpeed USB device number 2 using xhci_hcd
usb 2-5: New USB device found, idVendor=8564, idProduct=7000, bcdDevice=80.00
usb 2-5: New USB device strings: Mfr=2, Product=3, SerialNumber=1
usb 2-5: Product: StoreJet Transcend
usb 2-5: Manufacturer: StoreJet Transcend
usb 2-5: SerialNumber: [REDACTED]
usb-storage 2-5:1.0: USB Mass Storage device detected
scsi host12: usb-storage 2-5:1.0
usbcore: registered new interface driver usb-storage
usbcore: registered new interface driver uas
scsi 12:0:0:0: Direct-Access StoreJet Transcend 0 PQ: 0 ANSI: 6
sd 12:0:0:0: Attached scsi generic sg0 type 0
sd 12:0:0:0: [sda] 1953525168 512-byte logical blocks: (1.00 TB/932 GiB)
sd 12:0:0:0: [sda] Write Protect is off
sd 12:0:0:0: [sda] Mode Sense: 43 00 00 00
sd 12:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sda: sda1
sd 12:0:0:0: [sda] Attached SCSI disk
Depending on the vintage, and based on the symptoms above, the slowdown in write speed might be due to thermal recalibration. Like all metals, as the HDD platters heat up, they expand. That in turn shifts the rings of ferrite bits, requiring the arm holding the drive heads to adapt.
The higher the storage density, the worse the problem is. Some newer high-density drives are hermetically sealed and filled with helium to help the drive run cooler by lowering air drag as the platters spin.
Enterprise-class HDDs use more expensive materials and construction to help reduce vibration and thermal recalibration so they tend to maintain consistent write speeds under load.
(It’s one of the reasons why consumer-grade desktop HDDs aren’t generally recommended for use in NAS servers.)
If you haven’t already, definitely do a smartctl -t long /dev/sda before relying on the drive for important data.
I am suspecting that this drive may be one of the infamous drives with Shingled Magnetic Recording (SMR) recording technology. The symptoms fit perfectly.
For the first period of use and until the drive reaches about 1/4 capacity there is only one layer of data so it writes quickly. After reaching a certain percentage of capacity the layers begin to be ‘shingled’ in that each track now begins to have overlapping layers. To write the lower layers the system must read the upper layer and save it, then write the lower layer, then rewrite the upper layer to overlap the lower one.
As the disk adds more and more data the writes become slower and slower due to the overhead of the shingling. The drive itself manages this and it all happens within the drive cache.
The earlier request for brand and model of the drive would allow a bit of research to determine if this is an SMR drive or if is one of the enterprise grade drives that uses the much older (and better) CMR technology that does not overlap the tracks.
Examples of drives with the SMR technology include the Seagate Barracuda series and some of the older Western Digital drives for home PCs.
It does seem like it might be due to SMR, but the f3write tests would have likely been on fresh tracks without any need to rewrite adjacent ones, so it is curious.
A while back, I was asked to copy 4.7TB (millions of small files + NTFS) to a brand new portable 5TB 2.5” HDD from Seagate (IIRC, one of the “Expansion” series). Although I’d recommended a CMR-based drive, it ended up being SMR-based. The initial write speed was pretty good until the drive’s RAM buffer filled up, then it slowed down a lot, but maintained a consistent speed (nothing like the pattern @daniel-8371 described during his speed tests).
WD still sells recent 3.5” SMR-based HDDs, but not nearly as common as it was before the lawsuit about 5 years ago for misleading consumers. As part of the settlement, WD’s “Red” series of consumer drives intended for NAS were relabeled – the original CMR-based ones went from “Red” to “Red Plus”, while the SMR-based ones kept the “Red” label.
Not necessarily. Partly due to the amount of data on the drive and partly due to the drive firmware itself, the shingling may occur at any time and at any location.
The shingling process uses the drive cache to perform its task so thru put is greatly hampered once the drive begins shingling the write. Even more so when there may be 3 layers to manipulate.
My suggestion for any drive that has frequent writes is to avoid SMR devices like the plague. It seems they would be deadly for use with a btrfs file system that does copy on write since it rewrites the entire file and not just the part that may have been altered.
@daniel-8371 if you were to provide the make and model of the drive (the full model number as seen with sudo fdisk -l) we can verify if that drive is SMR or CMR technology.
I also tried to run sudo smartctl -t long /dev/sda but it’d always abort after 10% in with message “Aborted by host”. I didn’t purposefully stop the process but I suspect after a certain period of time with no activity, the drive just goes to sleep and the test stops.
The drive should not “go to sleep” while running a test. You can run S.M.A.R.T tests on an active system drive, so you could do something like playing an audio file from the drive in loop mode to keep the drive busy. https://ca.transcend-info.com/Products/No-284 is about the case and doesn’t say what drive is used. Some vendors use an assortment of drives under the same model name. The smartctl output may tell you the drive model. I have, however, seen external USB drives that used a model for which I could not find a data sheet.
To clear up a bit of confusion on my part, are we still talking about the same drive?
I’d been under the impression that you’d purchased a new USB cable or adapter for your old HDD, but since the TS1TSJ25M3G is a prepackaged USB enclosure with a 1TB HDD – either the OEM HDD used by Transcend was replaced with your old HDD, or your old HDD is still around but no longer being used. If it’s the latter, what make/model was the old HDD and USB enclosure/adapter?
The reason I ask is because if there are two different drives/enclosures in play, both exhibiting similar write-performance issues, the common denominator is the USB controller in your computer.
When using an external drive via USB, there are at least three pieces of hardware in play – the HDD/SSD, the USB adapter and the USB controller in the computer. Some USB adapters are better than others, and the same goes for USB controllers.
Some additional useful debugging commands:
lsusb –tree lists the various USB devices detected by the Linux kernel.
lspci -vv dumps some info about the various PCI devices including USB controllers.
I’d use SMR HDDs for long-term storage that needs to be available 24/7 (e.g. audio/video media, raw data), but not in an everyday portable – especially one that sees updates to existing files – because the odds of data loss are greater, and data recovery more difficult, for SMR compared to CMR.
Also, with SMR, the choice of filesystem is important for performance and data integrity.
I guess that answers the question then. Thank you everyone who was involved in the discussion. I’ll try to make sure the next drive I buy isn’t an SMR drive.