Making fstab more robust

I had this line in my /etc/fstab:

UUID=72c035e6-fa0a-4a43-b511-035cdf5ce7e0 /data           ext4    errors=remount-ro 0       1

It seems that there is a problem with that disk, which rendered my system unbootable (this thingy that tries to give me a shell during boot but fails because I haven’t given root a password). I had to boot a live medium to comment that line.

How can I make this process more robust? The /data partition isn’t actually neccessary to boot the system. Thus if there is a problem with it, I don’t want my system rendered unbootable.

Is such an entry in fstab still the way to go to add additional partitions to the system? I’m not focused on fstab and if nowadays additional partitions are ought to be added via systemd or whatever I’m eager to learn about it.

-

The second part of this question would be, how should I proceed with this harddrive? I’ve just mounted the partition manually and it seems to be working fine.

You can use

noauto 

or/and

nofail 

options.

nofail: Prevents the system from halting the boot process if the mount operation fails. The system will continue to boot, but the device will not be mounted.

noauto: Prevents the filesystem from being mounted automatically when the system boots or when the mount -a command is run.

You can also use the combination of both noauto,nofail. Stuff like this is especially useful for external drives or network shares

4 Likes

I’ve added nofail to fstab. I take from this that I should do this always to non-essential partitions.

Afterwards I ran fsck on the partition, which found a bunch of stuff, like making some inode narrower; but I also lost some data.

How worried should I be about this? Is this a strong signal that the harddrive is failing and I should replace it immediately? Or is this part of normal operation?

I’d be worried. Make sure you have proper backups of everything that is important.

You can assess your drives health using smartmontools:

sudo smartctl -H /dev/sd<X>
sudo smartctl -a /dev/sd<X>

The output can be long and confusing. Look for:

  • Reallocated_Sector_Ct (anything >0 is bad, the drive is dying),
  • Reported_Uncorrect (anything >0 is bad)
  • Current_Pending_Sector (anything > 0 is bad),
  • and maybe UDMA_CRC_Error_Count (this is more about transfer (sata cabling issue) than the actual drive).

If you search for “SMART attributes”, you will find the meaning of each attribute, for example this, but there is many more sites explaining it.

You can initiate a self-test of the drive that runs in the background using

sudo smartctl -t short /dev/sd<X>

or

sudo smartctl -t long /dev/sd<X>

(the long test can take an hour or longer)

You can check the results later using sudo smartctl -l selftest /dev/sd<X>

1 Like

Thank you for all the explanations. I’ve run the commands you mentioned.

-H reports PASSED

-a: all the values you highlighted are 0

-t short: Completed without error

The long test is still running, and will continue to run for some time. I will report back, once that terminates.

In the mean time, can I be cautiously optimistic that I can employ this drive a while longer?

It looks as if your drive is currently healthy. file system corruption can have many reasons, it’s not necessarily a failing drive.
Keep an eye on those S.M.A.R.T values and make sure you always have a good backup.

You might want to run a test on your RAM memory with something like MemTest86+ (https://www.memtest.org), as bad memory can cause all kinds of problems.