SMARTCTL - CHECK YOUR DISK

We are going to be looking at another one the very cool tools linux has for your disks here, specifically smartctl. This command makes use of BIOS SMART error detection, and SMART stands for Self Monitoring Analysis and Reporting Technology. It is something that all current and recent ATA and SCSI disks should have. And as always you use anything here at your own risk and that results may differ depending on your hardware and setup.

What is it?
The smartctl command allows you to monitor and check your disks for a wide range of possible errors. It will give you a ton of information regarding your disks, and can help pre-empt any costly disk failures and subsequent data losses, forewarned is forearmed after all. We will be going through some of the uses of this command. Also remember that smartctl must be run as root.

What do we have...?
The first thing we'll do is see what our drive has to say about the SMART featureset that it uses. We'll start with..

Command smartctl -i /dev/hda
Result === START OF INFORMATION SECTION ===
Device Model: WDC WD100EB-00CGH0
Serial Number: WD-WMA9N5043718
Firmware Version: 24.A4G24
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 5
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Sat Jan 15 21:44:06 2005 SAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

We can see that SMART support is enabled. We also probably want to see the different methods the SMART featureset our disk supports. We can find this, along with other information, by using...

Command smartctl -c /dev/hda
Result Offline data collection capabilities:
(0x3b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities:
(0x0003) Saves SMART data before entering power-saving mode.
Supports SMART auto save timer.
Error logging capability:
(0x01) Error logging supported.
No General Purpose Logging support.

Pretty nifty is it not.

Lets test...
Now we are going to test our disk. There are a couple of different ways of doing this. The big thing here is to realise the difference between offline and captive modes of tests. Offline means that the test will run will the machine is functioning normally, whereas captive should only be run when the disk is not in use and is not mounted. Offline and captive also do data collection, whereas other smartctl tests do actual disk-diagnosis test. So with that out the way, lets test...

Command smartctl -t offline /dev/hda
Result === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART off-line routine immediately in off-line mode".
Drive command "Execute SMART off-line routine immediately in off-line mode" successful.
Testing has begun.
Please wait 742 seconds for test to complete.
Test will complete after Sat Jan 15 22:26:05 2005

We also use -t switch to do the disk self-diagnosis tests, for this we will use the short and the long options. The difference between the two is basically the length of time and testing each do. It is pretty self-explanatory which options does a quick test and which one takes a bit longer.

Command smartctl -t short /dev/hda
Result === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Sat Jan 15 22:18:39 2005
Command smartctl -t long /dev/hda
Result === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 9 minutes for test to complete.
Test will complete after Sat Jan 15 22:35:27 2005

If you want to continuously do offline checks smartctl can help you with that as well - if your drive supports that particular SMART feature.

Command smartctl -o on /dev/hda
Result === START OF ENABLE/DISABLE COMMANDS SECTION ===
SMART Automatic Offline Testing Enabled every four hours.


Let check the results...
Well we have done our testing, but now we want to see the results. Do we panic or not? Do we initiate our DRP or breathe a sigh of contentment? Lets find out by starting with a basic disk health query..

Command smartctl -H /dev/hda
Result === START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

Looks good. Next lets take a look at the logfile generated by our tests..

Command smartctl -l selftest /dev/hda
Result
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours)
# 1 Extended offline Completed without error 00% 281
# 2 Short offline Completed without error 00% 281
# 3 Short offline Completed without error 00% 278
Command smartctl -l error /dev/hda
Result === START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
No Errors Logged

Good, good. All our tests passed without any errors. Lets do one final check and query the actual attributes of our drive...

Command smartctl -l selftest /dev/hda
Result
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0007 109 101 021 Pre-fail Always - 1933
4 Start_Stop_Count 0x0032 100 100 040 Old_age Always - 400
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0
9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 3558
10 Spin_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0
11 Calibration_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 396
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0009 200 200 051 Pre-fail Offline - 0

When looking at this table we can see that there are no entries under the WHEN_FAILED column which means that none of the attributes have failed yet. The attributes to keep an eye on are the ones of type "Pre-fail", when these fail it means serious problems for your disk. We also need to understand some of the other columns; VALUE is the current measurement, WORST is the worst value for that attribute which has ever being measured, and THRESH is a manufacturer designated threshold value. If any measurements ever equal or fall below the THRESH value, then that attribute is deemed to have failed. Remember to keep a sharp look-out for any problems so you can take corrective action.

That's it for a quick run through smartctl, I hope it's been helpful. Remember it can do a lot more, so look at the rest of the options, and happy fiddling.


-The homepage for smartctl - http://smartmontools.sourceforge.net/