(866) 366-3640 - support@sagonet.com
Sago Logo
Banner

   
Log in / create account Page Discussion History Go to the site toolbox
Smart

Contents

Intro

All hard drives will eventually fail. They are electro-mechanical devices with bearings, motors and an armature that moves around A LOT. The amazing thing is they last as long as they do.

You should ALWAYS backup your server's sites or critical information. This is extremely easy to do.

SMART allows you to have a good degree of prediction on when the drive will reach its end of life. Using this just may save you from a catastrophe.

Drive Salvage

It is far easier to attempt to copy a drive, transfer information from it, etc. prior to a total drive failure. At that point you may be looking at $1,500 to $3,000 for data recovery. If you can predict this from occurring, so much the better.

SMART Drive Analysis

Hard drives allow you to monitor various critical attributes which can indicate future drive failure. There is a lot of information available on this subject, and some of it is rather difficult to decipher for simple application.

smartclt is the primary command to analyze overall drive health. This is installed as:

yum install smartmontools

The official site for this tool is: http://smartmontools.sourceforge.net/ which has information and documentation.

In order to activate SMART on a device, use:

smartctl -s on /dev/sda

Note that even if SMART is disabled in BIOS, this will activate it on the drive - it is not necessary to also have it on in BIOS.

Usage is pretty straight forward, for example:

smartctl -a /dev/sda 
(or /dev/hda, etc. as applicable)

Which would return results similar to the following:

== START OF INFORMATION SECTION ==
Device Model:     WDC WD800BB-00CAA1
Serial Number:    WD-WMA8E4070296
Firmware Version: 17.07W17
User Capacity:    80,026,361,856 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   5
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue Dec 18 14:01:28 2007 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

== START OF READ SMART DATA SECTION ==
SMART overall-health self-assessment test result: PASSED

Here you can see general information about the drive characteristics and the overall SMART health assessment: PASSED

HOWEVER: This PASSED value is based on analysis of the SMART attributes described below. If one or more of the below attributes passes a manufacturer pre-determined value, this changes to FAIL.

You should be able to predict this failure WELL ahead of this value changing to "FAIL" - that is what this is all about.

Analyzing SMART Attributes

The is where the intelligent interpretation comes in. smartctl -a will return the values of all the criteria monitored by that hard drive.

Manufacturers vary in the specific attributes they support and how they are calculated. Primary attributes then to be fairly universal though.

Some are more important that others. Some are general information, some are directly related to drive health and life expectancy.

For example, smartctl -a displays:


ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate     0x000b   200   200   051    Pre-fail  Always  -  0
3 Spin_Up_Time            0x0007   112   095   021    Pre-fail  Always  -  3491
4 Start_Stop_Count        0x0032   100   100   040    Old_age   Always  -  177
5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always  -  0
7 Seek_Error_Rate         0x000b   200   200   051    Pre-fail  Always  -  0
9 Power_On_Hours          0x0032   050   050   000    Old_age   Always  -  36913
10 Spin_Retry_Count        0x0013   100   100   051    Pre-fail  Always -  1
11 Calibration_Retry_Count 0x0013   100   100   051    Pre-fail  Always -  0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always  - 156
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always -  0    
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always -  0
198 Offline_Uncorrectable   0x0012   200   200   000    Old_age   Always -  0
199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always -  0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline - 0

SMART Error Log Version: 1
No Errors Logged

Now, of the above the truly vital attributes to look at are ones such as:

9 Power_On_Hours    0x0032   050   050   000    Old_age   Always  -  36913

Drive Life:

There are 8,760 hours in a year. Five years is 43,800. Most drive warranties are good for 2 to 5 years. In the real world, drives over 30,000 to 40,000 are approaching end of life. If you see a drive with 30,000+ hours AND it has other issues, it is just time to replace it. However, this is much a matter of opinion - it could run to 50,000 hours. You have to look at the whole picture.

MTFB:

One often hears about Mean Time Between Failure ratings. Between failure figures of 200,000 to 500,000 to over a million hours may be quoted by a manufacturer. These figures are essentially meaningless. Let's see ONE IDE drive that has run for 57 years, just one. (That's 500,000 hours)

5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always  -  0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always -  0

These indicate sectors which have been designated bad and have been or are being moved to good ones. It is not uncommon for a drive to have some bad sectors over its life. In this case, if there are 140 reallocated sectors or more (140 is the threshold value) then SMART is going to say failure is immanent.

Analyzing SMART Attributes - SCSI

Smartctl documentation for SCSI drives can be found at http://smartmontools.sourceforge.net/smartmontools_scsi.html

SCSI Error Logs are quite different from IDA & SATA. Look under the "Total Uncorrected Error" column for significant errors. Also, Non-media error count can be a guide, however this number can be a bit nebulous. The "Total Uncorrected Errors" column is something to watch, errors here can indicate poor drive health. See sourceforge.net above for more info.

Note: Some newer SCSI drives support background scanning. Currently the SCSI drives in the IBMs and Dell 2850s do not support this. I do not know about the SAS drives in the newer servers. If supported this scan can be started with:

smartctl --log=background

SCSI sense codes are reported in drive error messages such as:

[+6708 72410001 002a9858 0:7] scsi disk: CHECK CONDITION on disk 0:6:5:0
       Read of logical block 509856, count 128
       disk sd45a, block 254920, 65536 bytes
       Valid = 1, Error code = 0x70
       Segment number = 0x00, Filemark = 0, EOM = 0, ILI = 0
       Sense key = 0x1, "RECOVERED ERROR"
       Information = 0x00 0x07 0xc7 0xe4

Additional info on SCSI sense codes can be found here

SMART Test

You can run a SMART test on the drive without taking the server off-line. While these tests are not the end-all dignostic, they are another layer of analyses and as they don't involve down time are a good place to start:

smartctl -t long /dev/sda

When complete you will see the results in the SMART information:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       137         -
# 2  Conveyance offline  Completed without error       00%       738         -
# 3  Conveyance offline  Completed without error       00%       964         -

Cpanel SMART Error Reports

Cpanel will notify you of SMART errors. However it just reporting, it can't really tell what is going on. Thus, you can get notifications such as:

The command that cpanel ran polls the drive for any errors which have occurred
on the drive at any time. The result show:

[root@server admin]# /usr/sbin/smartctl -q errorsonly -H -l selftest -l error
/dev/sda
Please note the following marginal Attributes:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
203 Run_Out_Cancel 0x000b 253 139 180 Pre-fail Always
In_the_past 0

As stated, these are errors that have occured AT ANY TIME. These errors may have been from 2 years ago, etc. You can execute the same command Cpanel used in the shell, for example:

/usr/sbin/smartctl -q errorsonly -H -l selftest -l error /dev/sda

In this case three values are given: 253 139 180

At one point this attribute Run_Out_Cancel(which is the number of EEC errors)dipped to 139 (worst value) - however its current value is 253 (this is as good as that value gets.) If it was to get (and stay) less than 180 (threshold) the drive would be considered bad.

Scripts

/usr/local/etc/smartd.conf Settings for smartd which controls monitoring

There are example scripts included with smartmontools. These are usually located in a location like:

/usr/share/doc/smartmontools-5.33/examplescripts

See the README for more information. There are many other scripts you can find on line people have developed to run as cron jobs, etc. and watch attributes, sending notification when criteria are met, etc.

Resources

Here are some good sites with additional information:

http://www.linuxjournal.com/article/6983 Good introduction article by the developer for smartctl Bruce Allen (Professor of Physics at the University of Wisconsin - Milwaukee)

http://en.wikipedia.org/wiki/S.M.A.R.T. Definitions for the SMART attributes

http://smartlinux.sourceforge.net/smart/attributes.php Definitions for the SMART attributes (more)

http://www.almico.com/sfarticle.php?id=2 Good article on what the threshold values mean and how these three sets of figures work.

White Papers:

http://www.deepspar.com/pdf/DeepSparDiskImagingWhitepaper3.pdf White paper on disk imaging, SMART, etc.

http://members.ozemail.com.au/~steven.mcleod/SMART_Anti_Forensics.pdf White paper on drive duplication forensics

http://www.calce.umd.edu/whats_new/2003/1203.pdf Paper on reliability of hard drive


http://research.google.com/archive/disk_failures.pdf

Retrieved from "http://kb.sagonet.com/Smart"