A question about a faulty disk

this question might seem silly, but I’m a computer newbie, so :slightly_smiling_face:
I purchased my laptop (and the hdd in it) back in 2018, and I guess that can be enough to wear out an hdd.

the issue is, just 2 weeks after installing Fedora, the system and the apps get noticeably slower, lags now happen a lot more often, sometimes some apps crash. At the same time, first few days after install, everything is very fast, no issues.

someone said it’s probably because, some time after install, I reach my hdd’s faulty sectors, after I told them that my hdd has tons of read and/or write errors (and some other errors too). This was found out through a SMART scan (which somehow still gave it a “Disk is OK” overall assestment). Sending the results of it here.

I know you’d advice me to replace is with a good ssd. I would totally do that, if there was an opportunity right now. Yes, my situation is that difficult. Perhaps someday I can. We’ll see.

so while it hasn’t happened yet, my question is:
is there a way to determine which exact sectors of my disk are faulty, where they are, and then somehow reinstall my system and allocate it while avoiding them? My hdd is around 900 GBs, and my system uses from 20 to 50 GBs at average, so… perhaps it’s somehow possible?

Hi Blind,

Looking at the report from SMART, your HDD is fine … so, let’s see if we can figure out the actual problem :slight_smile:

Here are a couple of things to look at:

  1. open a terminal / command line
  2. type “top” at the prompt
    How much MiB Mem do you have? (RAM)
    How much MiB Swap (total, free,used)
  3. when things seem to be slowing down, hit 1 on the keyboard in the top window
    Here is what to look for:
  4. look at the CPU utilization per CPU — are any pegged at 100%?
  5. are there any processes hitting 100% or greater?
  6. in the column with “n.n wa,” what values are you seeing there?

Post the results here and I’ll have a look to see if I can help :slight_smile:

1 Like

The drive will attempt to replace bad sectors itself, but only in response to a failure to read data (not sure a write will work).

I am not sure what the best way is to force rhe reallocation.
Maybe try reading the whole disk with dd?
Others may have better suggestions.

Hi Barry,

Most HDD/SDD will relocate data automatically when a sector is marked bad by the on-board disk controller … IF there are any good sectors available … Per the report, the Relocated Sector Count is 0 … so, not likely bad sectors

hmm, I’m surprised, considering that it does show that at least 1 sector is unfixable

just in case I misunderstood the questions, here are the results:
(also, keep in mind, this is results with Firefox playing a video, having this forum open, and the Discord app too)

top - 23:44:08 up  4:18,  2 users,  load average: 3,58, 3,25, 3,16
Tasks: 329 total,   1 running, 327 sleeping,   0 stopped,   1 zombie
%Cpu(s): 15,2 us,  4,9 sy,  0,0 ni, 77,5 id,  0,0 wa,  1,8 hi,  0,6 si,  0,0 st 
MiB Mem :   6877,2 total,    570,7 free,   4148,8 used,   2449,0 buff/cache     
MiB Swap:   6877,0 total,   6013,0 free,    864,0 used.   2728,4 avail Mem 

hit 1 in the top window?.. I’m sorry, I don’t quite get that
also, now that you say “when things seeem to be slowing down.”
it does (obviously) happen when opening a lot of tabs. But there were also several times when the system was very slow right after launch. I would only open the terminal, and even that would take minutes to launch. Also, every first-launch-per-session of Firefox is really slow, the app becomes unresponsive a lot, until a few minutes later.

is that visible from the result I sent, or…?

not any that I could see, no

… to be honest with you, I can’t find that column anywhere lol

you mean, to just run dd (this exact command, with nothing added to it)?

oh right, I forgot about another issue that I suppose I should mention.
I use ethernet, and sometimes, when opening a lot of tabs at the same time (usually with at least one video playing), my internet connection stops loading anything, and it’s tray icon goes “?”. My connections go up to 5000 or 10000 (I know this through an app that shows active connections). It is fixed easily by closing Firefox and waiting a minute. Then everything loads again, the tray icon looks normal, and I can reopen Firefox. This happens at least once or twice a day.
I don’t experience any of this on any other devices connected to the same network.

Hi Blind,

`

%Cpu(s): 15,2 us,  4,9 sy,  0,0 ni, 77,5 id,  **0,0 wa**,  1,8 hi,  0,6 si,  0,0 st 
MiB Mem :   6877,2 total,    570,7 free,   4148,8 used,   2449,0 buff/cache     
MiB Swap:   6877,0 total,   6013,0 free,    864,0 used.   2728,4 avail Mem 

Ok, You have 6.8GB of RAM installed with 4GB consumed
and
You are starting to use a little Swap (.86GB)
You are not waiting on disk I/O (0,0 wa) at the moment

Ok, next small test, let’s see how fast the disk is …

  1. open a terminal
  2. type cd – let’s make certain we are in your home directory
  3. dd if=/dev/zero of=TEST.DD bs=4096 count=1M — this will create a file called TEST.DD and fill it with zeros just as fast as the disk can take it
  4. do #3 about 10 times in fast succession — we want to fill the buffer so that we can get the buffer-to-platter transfer rate
  5. now do dd if=TEST.DD of=/dev/null — this gives us an idea on how fast we can read form the same disk

What are your results?

… Hmmm …
Model and brand of your computer?

1048576+0 records in
1048576+0 records out
4294967296 bytes (4,3 GB, 4,0 GiB) copied, 29,2285 s, 147 MB/s

… I hope I understood this correctly, and didn’t make a bad mistake.
I opened a lot of terminal tabs, running this same command in each one (simultaneously). I thought I should do that, since you said “fast succession”, but I can’t get fast because each command takes like a minute to finish.
some of the tabs would take a few seconds to actually make the blind@linux: prompt, so it wasn’t a perfect fast succession. But well.
I stopped the video, thinking it would just freeze. My computer was responsive at first. After a few minutes, it all froze, the connection went “?” again. A few minutes later, it mostly unfroze. The commands started finishing gradually, took about 10-15 minutes for all of them to finish. Here are the results of the last one (the rest of them are pretty similar):

1048576+0 records in
1048576+0 records out
4294967296 bytes (4,3 GB, 4,0 GiB) copied, 1215,94 s, 3,5 MB/s

I really hope that doing all those commands at the same time didn’t break anything about my system. Please let me know if that’s the case.

8388608+0 records in
8388608+0 records out
4294967296 bytes (4,3 GB, 4,0 GiB) copied, 199,936 s, 21,5 MB/s

ASUS X550IK
more info just in case:
CPU: AMD FX-9830P RADEON R7, 12 COMPUTE CORES 4C+8G (4) @ 3.00 GHz
GPU 1: AMD Radeon RX 560 Series [Discrete]
GPU 2: AMD Radeon R7 Graphics [Integrated]

Hi Blind,

1048576+0 records in
1048576+0 records out
4294967296 bytes (4,3 GB, 4,0 GiB) copied, 29,2285 s, 147 MB/s

Ok, this tells me that the maximum transfer rate is likely 1.47Gbps bus to buffer

Yeah, you can’t do a good test running the writes in parallel … but have enough info to give at least a good guess …
So , the disk is not very fast 21MBps read and about 5MBps ~ 10MBps write

The “stalls” are likely being caused by a combination of things:

  1. slow disk — this manifests as a system wide stall when sever – everything is blocked waiting for I/O

  2. not much RAM — this makes things even worse – writing to swap

  3. CPU not exactly a rocket-ship … and because of #1 and #2 CPU is consumed with spin-locks / blocked waiting for the I/O to disk …

I am willing to bet that while you were running these small tests that the CPU was maxed and the I/O Wait (wa) was also quite hi …

Oddly enough … I have the same laptop (the one I usually lend to friends when theirs go kaput for some reason) … It’s not very powerful to begin with :slight_smile:
But, it’s reliable and does not overheat when loaded up with work … on the down side, like yours, it stalls a lot when loaded up with a lot of open browser windows … or multiple applications running concurrently …

What can you do without buying a new machine?

  1. put and SSD in it – this will help with the slow program loads
  2. See if you have one open memory slot … mine has just one so … you could add 8GB~16GB of RAM — this will keep you from using swap when you have a lot of concurrent apps running

This will help “some” but it is still gong to be a relatively slow machine as compared to something a couple of years newer …

Good news though … your HDD is healthy :slight_smile:

2 Likes

Your CPU is definitely under high load.

(It could be caused by processes waiting for I/O, disk or network).
Best would be to identify which process is taking up that much CPU queue time.

1 Like

hmm… shouldn’t we count the first result though (before all the simultaneous commands)?
because it said 29,2285 s, 147 MB/s

well, perhaps? I did mention that I had Discord and Firefox open, with a video playing, and well, this forum open. If all this counts, then yeah.

yeah, I guess so. I wasn’t expecting A LOT from it. Just sometimes the slowness really seemed out of place. Sometimes it would take the terminal a few minutes to start (after a fresh computer start), and a regular sudo dnf upgrade --refresh would take several minutes to just list the packages to ypdate and prompt me “Is this ok? [yes/no]”. Again, both those things would work much faster first few days after Fedora install. Everything would, in fact, be much more responsive and fast. But after 2 weeks, all this. It just feels weird. And sort of impacts my day-to-day computer usage sometimes

yeah… perhaps someday, if I ever get an option to do that. Same about more RAM.

are you completely sure? Taking a look at every parameter from the SMART scan?
(sorry, I guess I just got tired from reinstalling my system, and don’t wanna do that again for a while)

also… you still didn’t say. Could anything about my system break, after running those commands simultaneously?

what would be the correct way to do that?

Hi Blind,

No, this will not break/damage your machine …UNLESS it overheats … and that machine you have is pretty good about staying cool under heavey load unless you plug/cover the fan vents …

1 Like

good to know, thanks
and thank you a lot for helping me in general

1 Like

@einer sorry, I also forgot one question. Someone was suggesting defragmentation of the disk. Would that help? Especially considering I still have a Windows 10 partition on this drive, which idk if I ever defragmented (but I don’t use it anyway).
or does a newly installed Fedora need no fragmentation at all?

Fedora/Linux don’t generally have disk fragmentation issues … and defragging Windows will have no effect on Linux/Fedora UNLESS you are running Fedora/Linux as a virtual machine within Windows, with a virtual disk on a Windows filesystem … :slight_smile:

Maybe the drive’s firmware can map the bad away if you invoke the sanitize operation.

BTW in my experience, /dev/zero is not going to produce a useful test, try /dev/urandom (or some other source of more varied data).

Hi Stephen,

The idea behind using /dev/zero is that it takes minimal CPU where /dev/random or /dev/urandom goes after junk in RAM … either will work and produce about the same results IF the CPU and RAM are pushing data to the disk at about the same rate :slight_smile:
Where /dev/urandom and /dev/random are most fun is when you are pushing the data through a compressor … you get a bit more real world results vs /dev/zero :slight_smile:

@einer,

With rll encoding, on disk cache and hdd firmware optimizations writing zeros is not a valid test. Then add in OS buffering and optimizations and it becomes an even less valid test. Some of this can be worked around. A common technique is to dd from one storage device populated with representative data to the device being tested as that is not a load on the cpu. Having the source and destination storage devices on different busses pretty much eliminates contention as well. If you only have the one storage device /dev/urandom (some implementations are not much impact on cpu utilization) is better than /dev/zero. Using /dev/zero can provide a little information but in this case where many seek errors are being counted by the drive firmware a better test is needed. Maybe adding seeks to different parts of the drive could be helpful as well but this additional complexity can be added later.

There are some good articles on the topic which will take some time to track down again. Rereading them will likely cause me to have to adjust my current understanding somewhat as well. Nevertheless the testing I have done showed quite starkly that /dev/zero test results were generally unhelpful.

Hi Stephen,

All good points … that’s why I try to remember to qualify this type of test as “quick-and-dirty” … the object is to get a general idea of what the device can do. Also, going from one storage device to another adds the limitation of the device being read may not be capable of saturating the writing device … so, not such a good test … :slight_smile:
Are there better more comprehensive tests that can be done? … you bet!
Are those tests as available as using dd? … not usually :slight_smile:

1 Like