Erratic File Transfer Performance

Background: Made a backup of corrupt NTFS partition to an external USB HDD. Deleted NTFS partition and created new EXT4 partition as replacement.

Problem: While using rsync (or even cp) to restore backup, observe very erratic transfer performance. There are thousands of files of similar size (100 - 200 MiB). For no obvious reason, some files transfer in seconds, while others hang for several minutes (or hours).

SMART tests on both drives report no problems. Why the erratic performance?

UPDATE: Problem seems to be getting worst. Even a restart and resume does not improve performance.

That could be a symptom of the file cache is filling up and the cached file data must be physically written to the disk before being made available for new data.

1 Like

If so, any way to correct the issue?

I don’t know any fix but I do have an advice:
NEVER unplug the drive before you properly opt for “eject” or “expel” (sorry I don’t remember how it translates in english) and you get the message “now you can safely unplug the drive”.

It happened to me several times, I copied files back and forth, then some time passed and I believed whatever cache was already flushed and I took it off but NOPE, it was still going so I got the directory/files corrupted as result.

My guess is it depends on the CPU - RAM state in the moment you are copying. I do have few of both.

As @vekruse pointed out, this could be a cache issue (and also an RPM speed issue) and entirely depends on what type of disk your external USB HDD is. Firstly you’re limited to the bandwidth of the USB (which often times starts at X mb but will eventually drop to Y mb and maybe even Z mb) and then if it fills up the cache, that cache needs to write to disk before it’s available for more data.

This is a big problem between HDDs that use CMR (conventional magnetic recording) and SMR (shingled magnetic recording). SMR drives allow much more capacity at the expense of cache performance and work best for cold storage, while CMR drives often times will have a lower capacity, but larger caches and are better for hot usage. Every single 2.5" laptop HDD you see over 4TB will be SMR because until some new technology comes along that’s about as much as you can fit in that form factor and SMR is the technology you have to use to make it work.

A big stink occurred several years ago when I think it was Western Digital were selling RED NAS drives that were SMR and not labelling them as such. Their performance was absolutely terrible in a NAS environment (where data was constantly being written / read from). So they started to explicitly label their drives as either CMR or SMR in the tech specs, which a lot of other manufacturers had already been doing for years.

I use Toshiba HDDs in my own NAS but you can see the difference between the standard surveillance drives and the pro surveillance drives. The standard drive uses SMR, spins at up to 5400rpm and has as little as 128MB on the 2TB drives and 256MB on the 4TB and 6TB disks. You can see on the pro drives, they’re 7200rpm and have 512MB of cache.

1 Like

OK. Beginning to suspect target drive is failing.

clayton@voron:/run/media/clayton/10B0-16C2/TV$ mount | grep sda1
/dev/sda1 on /mnt/Data type ext4 (rw,nosuid,nodev,relatime,emergency_ro,x-gvfs-show,x-gvfs-name=Data)

Note emergency_ro.When I run

sudo fsck.ext4 -f /dev/sda1

Errors are detected and ‘corrected’. I can resume the restore and performance is good for a few files, then it hangs again.

Running fsck finds more errors. Looks like this cycle could continue forever.

Anyone have a conclusion OTHER THAN failing disk?

If you have data on that disk which is important to you, back it up now. Then run a full smart test on the drive using smartmontools.

To install smart tools…

sudo dnf install smartmontools

Get information on the drive with…

sudo smartctl -i /dev/[device]

That will tell you if SMART is available on the drive and whether it’s enabled or not. The output will be…

SMART support is: Available - device has SMART capability.
SMART support is: Disabled

If SMART support is disabled you can enable it with…

sudo smartctl -s on /dev/[device] 

You can get the current SMART data with…

sudo smartctl -a /dev/[device]

What you’re looking for is anything listed with error rate and reallocated sector count. If an error is found, it is reallocated to spare sectors that have been set aside for just such an occurrence. That’s what the “correction” is. But there are a limited number of blocks available for this.

If error rates and reallocated sector counts are 0 then there is no issue.

You can perform a short test with…

sudo smartctl -t short /dev/sda

Or you can perform a long test with…

sudo smartctl -t long /dev/sda

There are two other types of tests that can be performed, those are conveyance which checks for damage that may have occurred during transport and select which allows you to specify certain ranges of the logical block addresses.

This will tell you whether your drive is failing or not.

I ran this test last week and nothing was reported. I ran the short test today and there are multiple prefailure warnings reported.

Time to get a new drive (and NOT another WD).

Fortunately, this was my restore drive. My backup drive reports no problems (yet).

Thank you for the detailed help!

UPDATE: Replaced WD drive with Seagate IronWolf Drive. 2 TB restore completed quickly and with no errors.

2 Likes

Follow Up: Subsequently experienced erratic behavior with the new drive as well. Replaced the SATA cable and, so far, problems have disappeared. Just an FYI, something else worth testing.

1 Like

Worth slapping that original drive back in with the replaced SATA cable and seeing if it still throws hissy fits? I’ve had one SATA cabe go bad on me and it also exhibited itself as lots of failures in the windows event log and hanging or slow transfers to that specific disk. It threw me for a bit as I had several partitions on there and couldn’t work out why sometimes E: would have issues, then Z: and then G and so on,.. until I realised they all had the same disk in common. Nothing in SMART and nothing when I ran disk test software and so on, but it was the cable.

If memory serves, it was the connection post which was loose and would move fractionally with heat, just enough to cause a fault and then it would cool off and make a decent connection again. Spent weeks chasing that.

In this case, I was hearing chattering noises from the WD drive (part of why I decided to replace it). Sure enough, when I disassembled the drive, I found several scratches on the platters. Who knows, perhaps the faulty cable somehow caused the damage. At any rate, it was FUBAR. And yes, SMART didn’t really tell me anything until the very end (although, in all fairness, I only ever ran short tests, so perhaps it could have if long tests were run).

1 Like

Use the ‘sync’ mount option for the drive where the data is being copied, that prevents the cache use and writes the data directly to the drive.