Advanced hard disk tools

From Restart Wiki
Jump to: navigation, search

Advanced tools for diagnosing and recovering hard disk problems.

Summary

The pressure to deliver ever greater storage capacities means that hard disk vendors push the storage density to such a level that they can only just read the data back. This is shown dramatically in a [YouTube video] which demonstrates that you can give a hard disk a hard time simply by shouting at it!

Most of the techniques and utilities described below can be used equally on SSDs (Solid State Disks), but before doing so you should understand their unique features and problems by reading SSD Migration and Troubleshooting.

The S.M.A.R.T. data returned by a hard disk can give a useful indication of its health but its interpretation is vendor-specific and not fully documented. There are other tools that can be used which may be much more informative.

Safety

Warning03.png
Some of these tools can be DANGEROUS and should only be used in a kill-or-cure situation or if you are certain that a full system backup is available and you are sure you know what you are doing.

Hard Disk Error Handling

Before proceeding, it's important to understand how hard disks handle errors.

The space on a hard disk is made up of sectors, each 512 bytes, or for large disks, 4096 bytes long. Each has Error Correction Code (ECC) bits appended, and using these the disk firmware can correct read errors up to a certain number of consecutive bits. Such errors are corrected by the disk without bothering the user or the host computer, retrying the read operation multiple times if necessary. Furthermore, if the disk finds a sector which is becoming marginal (i.e. it only just managed to correct an error) it may automatically allocate a spare sector and rewrite the data to the spare, marking the original as bad.

The strategy for doing this may vary, but is roughly as follows:

  1. If a read is (eventually) successful, provided the number of retries and/or the level of error correction required (if any) were below a certain threshold, then do nothing.
  2. If the retries and/or error correction were above that threshold, but the data was still recovered, remap the sector (i.e. rewrite the data to a spare sector and mark the original as bad).
  3. If the data couldn't be recovered, mark the sector as "unstable", increment the count of "pending" sectors and return an error to the host computer.
  4. If the sector had previously been marked unstable but is now read correctly, remap it and decrement the "pending sectors" count.
  5. If a write occurs to a sector marked as "unstable", remap it by diverting the write to a spare sector and decrement the "pending sectors" count.

SSDs operate a similar but often vendor-specific strategy.

The number of pending sectors is reported directly in the S.M.A.R.T. data.

Speedfan

Speedfan is the easiest of hard disk tools to use. It runs under Windows and gives vital statistics of a system, such as temperatures. But a very useful additional function is that it reports SMART data, and more than that, can compare the state of your hard disk with an online crowd-sourced database to show how your hard disk is ageing compared to others of the same model. Select the S.M.A.R.T. tab and click the "Hard disk" box to select yours. Review the results, then click the "Perform an in-depth online analysis" in order to get a customised online report, which will appear in a browser window.

ddrescue

ddrescue is a linux tool for copying disks sector by sector. As such, it resembles the venerable Unix and linux utility dd but that is as far as the resemblance goes. Unlike dd, it will persist if it gets a read or write failure, and keeps a log of the blocks it has successfully processed. This makes it useful for recovering data from a disk which you fear might fail completely at any moment. Running it a first time, it will recover all the data it easily can, placing the least possible extra strain on the disk. If the disk is still functional you can run it again as often as you like, and each time it'll only try to copy the disk sectors it has previously failed to read, or you can intersperse the runs with another tool such as Spinrite.

If not already installed, you can install ddrescue with the shell command

sudo apt-get install gddrescue

The command man ddrescue gives full details of options, but these are many and worth familiarising oneself with in advance.

You may have several disks plugged into the computer: the Linux system disk (possibly a bootable memory stick), the disk to be copied from, and the disk to be copied to. Be sure to double check which is which, or you may be heading for a disaster!

As a simple example, to clone a failing disk /dev/sdb to a replacement /dev/sdc of equal or larger size, use the following commands:

ddrescue -f /dev/sdb /dev/sdc logfile
ddrescue -f -r3 /dev/sdb /dev/sdc logfile

(If you have used ddrescue before, be sure to use a different logfile name, or delete the old logfile first, or it will assume you are continuing a previous rescue.)

The first command will hopefully complete the job, recoding in logfile what it actually achieved. It concentrates on copying as much as it can without lingering on bad blocks, on the assumption that the disk could fail completely at any time. The -f flag is required to force overwriting of the destination of it's an existing file or disk.

The second command retries up to 3 times any blocks recorded in the logfile as having failed. The logfile is updated accordingly, to show what has now been achieved.

The command:

ddrescuelog -t logfile

reports the content of the logfile.

gdisk

Gdisk is a Linux command line tool for partitioning hard disks, included as standard in some distros such as SystemRescueCD but installable on any other. Conceptually, it is similar to the DOS fdisk command but is much more flexible, can cope with large or GPT-partitioned disks, and may be able to partition or repartition a disk that other tools refuse to on account of bad sectors.

On most Linux systems you should be able to install gdisk, if not already present, by typing

sudo apt-get install gdisk

at a command prompt.

MHDD

MHDD screenshot

MHDD is a low level diagnostic and maintenance tool that runs under MSDOS or FreeDOS. It's best run from a DOS bootable USB memory stick. If you have Spinrite on a memory stick, add this to it too.

Faced with a slow running computer, MHDD will show very clearly whether the problem is a failing hard disk, performing many retries in order to read data.

Warning03.png
WARNING: MHDD will destroy data if not used with care.

MHDD is basically a user interface to the ATA command set and more. A key feature is that it accesses the disk direct rather than going through the BIOS and hence gets a more accurate and uncensored view. For example it can read the SMART data even if the BIOS hides it. The following is a very brief survival guide to the most useful functions.

Config and Command line flags

Normally MHDD disables access to the primary disk as a precaution, on the assumption that this may well be what DOS is running from. This won't be the case if you're booting from a USB memory stick.

On running for the first time, MHDD will create a folder CFG in the current folder, containing a file MHDD.CFG.

To enable the primary disk, edit MHDD.CFG to contain the line:

#PRIMARY_ENABLED=TRUE

(By default, this is set to FALSE.) Alternatively, you can launch MHDD with the command:

MHDD /ENABLEPRIMARY

Commands

MHDD issues a prompt, to which you can type a range of (case-insensitive) commands. To get started, try the following in order.

PORT - Issue this command first to get a list of disks, then select the one you want by number.

Note that if your disk is not shown, you may need to go into the BIOS settings and set the SATA controller mode to ATA or Legacy. Don't forget to set it back again afterwards or the computer may not boot.

EID - Report extended ID information from the disk. Double-check that this is the disk you intended to select with the PORT command.

SMART ATT - Report values of SMART attributes. (The F8 key is a synonym for this command.) Pay special attention to:

  • Read error rate
  • Relocated sectors count
  • Relocate event count
  • Current pending sectors

See the Wikipedia S.M.A.R.T. article for further details, and also below.

SCAN - Scan the disk

The disk is scanned, giving a graphic display of its state and showing access times, hence revealing sectors which take an excessive time to read. (The screenshot above shows a very sickly disk with many bad sectors.)

Several options are offered in a pop-up menu, which should initially be left as their defaults. In particular:

  • Start, End - Start and end sectors for the scan, defaulting to the entire disk.
  • Remap - Attempt to remap bad sectors, provided they can be read correctly (even with difficulty).
  • Erase delays - Erase sectors which take a long time to read, whether or not correctly. The data will be lost but this should cause them to be remapped.

To make sense of the Remap and Erase delays options, see Hard Disk Error Handling above.

Blocks of 255 sectors are each represented in the graphic display by a single blob. A brighter greyscale or a coloured blob indicates a slow read, suggesting the disk had trouble reading a sector. An "x" indicates a sector was unreadable.

CX - Perform random seeks and reads, and report the average access time.

HELP - Gives a list of all commands with brief descriptions.

MAN <command> - Gives a fuller description of the named command.

Log files

Log files of a session are recorded in text format in an automatically created sub-folder LOG. These are appended to for each session. To start fresh logs you can delete or rename the folder complete, optionally copying it to another disk.

hdparm

Warning03.png
Many of the facilities provided by hdparm are flagged as DANGEROUS, or even EXCEPTIONALLY DANGEROUS. DO NOT USE THIS OPTION!! You have been warned!

hdparm is a command line utility which runs under Linux, providing direct access to many features and options of a disk drive only available at the hardware level. It may be useful as a last resort, after reading the documentation carefully and weighing up the risks and possible benefits.

In particular, a hard disk from a video recorder which refuses to respond to other tools may have been set to power up in standby mode. The command

hdparm -S 0 /dev/sda

may succeed in setting it into a more cooperative mode.

If the disk is in a very bad way the computer could fail to boot with it plugged in. Boot with the power cable to the troublesome disk disconnected and reconnect it once booted. If the disk isn't automatically recognised, at a root prompt type:

ls /sys/class/scsi_host

You should get a list of hosts, e.g. host0 host1 host2. Pick one (here host0 is taken) where the drive might reside and type:

echo "- - -" > /sys/class/scsi_host/host0/scan

(The 3 minus signs are separated by spaces.) If you chose the right host the missing disk should appear in Disk Manager.

Spinrite

Introduction

Spinrite is a sophisticated hard disk maintenance utility, which works much harder than most others in order to recover data from failing disks. In many cases, Spinrite can use the erroneous results of many failed reads, together with error correction bits, to deduce what the data must have been. This allows Spinrite to rewrite the recovered data, causing the disk to write it to alternate spare sectors.

An important distinction needs to be understood between the function of Spinrite and the DOS or Windows CHKDSK utility. The latter operates on the logical structure of the file system, i.e. how the disk space is organised as folders and files and free space is managed, simply regarding the disk as large pool of numbered blocks. It therefore cannot correct faults in the disk itself. Spinrite, on the other hand, works on the disk as a pool of blocks with no concern about how they might be used. It is therefore equally applicable to any hard disk however it's formatted, even if taken from another device such as a PVR or hard disk iPod, just so long as it can be connected to a machine that Spinrite can run on.

Spinrite has been around for many years, and the current version 6.0 was released 10 years ago. An Internet search may reveal criticisms of it, some of it ill-informed and vituperous. Nevertheless, many unsolicited testimonials indicate that it remains a very valuable tool, whether or not it will fix the disk you are faced with today.

Limitations

Version 6.0 has been known to crash when run against some modern drives, broadly in the class of 250GB upwards.

Spinrite exercises a disk fairly hard, so if it's already in a very poor state there's a risk that it may fail completely before Spinrite can complete, potentially with the loss of all your data. In such a case, consider first using a tool such as ddrescue to recover as much data as is still readable without error. You might then run Spinrite, perhaps just on the troublesome area, before re-running ddrescue to try and recover the remaining data.

A drawback of Spinrite is that it can take many hours to run to completion, especially if it has to work hard to recover data. However, a computer which fails to boot may be suffering from a bad sector in the first few hundred MB, which it may be able to repair in much less time.

A new version of Spinrite is awaited which will use very much larger buffers in order to achieve a very considerable increase in operating speed. It should also include enhancements to free it from the DOS dependence, allowing it to be run on a MAC, and to improve compatibility with the most modern drives. A suspend mode, too, is expected, which should allow it to be started during a Restart Party and then put into a low power state to give the owner time to take it home and plug it in to complete, even on a weak battery.

How to use

Warning03.png
On an SSD, you should only use Spinrite on Level 2, as Level 4 will cause excessive ageing.


Spinrite runs under MSDOS or FreeDOS and comes as a bootable CD image or .exe file, but is quite easily installed on a bootable memory stick. It costs $89 for a personal licence but comes with a no quibble satisfaction or money back guarantee. You may find a Restarter at a Restart Party who has a copy. (Strictly, the Personal licence only allows you to use it on computers you personally own, but the author Steve Gibson has repeatedly said that he's comfortable with its non-commercial use by licensees to help someone out. If it digs them out of a hole, you might suggest they purchase their own licence.)

Several modes of operation are used, but levels 2 and 4 serve virtually all purposes. In level 2, it does its utmost to recover data from any bad sectors, remapping it to a spare. Level 4 takes much longer, performing an in-depth analysis of the entire disk surface. This reads the disk with error correction disabled, causing it to remap any sectors which are becoming marginal long before they start to present a risk. Conversely, it will revert a remapped sector, copying the data back, if a transient event such as a shock or electrical noise had caused a good sector to be remapped.

Spinrite is designed for use with conventional (spinning) hard disks but it can be used on an SSD on Level 2 (read-only data recovery, which doesn't cause wear) and can force it to reallocate marginal cells since it reads with error correction disabled. There have been multiple reports of this resulting in an SSD regaining as-new performance after having become slow.

While Spinrite is running it's best not to move the computer and to protect it from vibration and shocks, as these can cause soft error which may result in disk sectors being unnecessarily remapped.

Under the Hood

To understand how Spinrite achieves its magic, first read the section Hard Disk Error Handling above.

Problems arise if a read error is beyond what the ECC can correct, even retrying the read a number of times. In this case, the disk returns an error to the host computer and Windows may retry a few more times before giving up. Should the file be overwritten, the disk will recognise that this was a troublesome sector and heave a sigh of relief that nobody wanted the data after all. It will then mark the old sector as bad and write the new data to a spare sector.

But if you really did want the data in the failing sector, you're out of luck unless you have a copy of Spinrite (or a deep pocket for a commercial data recovery service). What this does is to try very much harder to read the data than either the disk itself or Windows, using all the tricks in the book and some more. If it eventually persuades the disk to perform a good (or at least an error-corrected) read, the disk will itself reallocate the data. However, Spinrite's magic is that it can often use the partial data received from many failed reads in order to reconstruct what the original data actually must actually have been, in which case it writes that back, once again, causing the disk to reallocate the data to a spare sector. In the worst case, Spinrite will write back to the disk as much data as it managed to recover from the sector, as it may be that not all of the data in the sector was needed anyway.

Spinrite has several operating levels, but is almost always used either in Level 2 or 4. Level 2 does its utmost to recover data and in the process will cause bad sectors to be swapped out with spares. Level 4 additionally gives each sector a thorough work-out with error correction disabled, having temporarily saved the contents of the sector to a spare. This provokes a a remapping of any sectors which are becoming weak, even though still serviceable by the disk's own criteria. Level 4 operates under an ultra-cautious strategy, ensuring that the data in any sector under test has been successfully written to a spare before starting the test. You can therefore safely abort a run at any point without loosing data.