# Smartmon "best practices"

## depontius

I've been running smartmon for years, but with a simple "/dev/hda -a" for a configuration. Other than spew into my logs, which I do glance/grep at more than occasionally, this doesn't really do spit. Recently I tried RTFM, so I've moved to "/dev/hda -a -o on -S on -s (S/../.././14|L/../../6/15) -m addr@work,addr@home"

I suspect this is better, but looks as if it will really only email failing self-tests. I suspect I also want email of prefail attributes that are approaching the threshold value, or some such. I've noticed over time that the attributes tend to oscillate a few lsb, but tend to not really move, or at least only move very slowly.

ISTHERE a "best practices" suggested smartd.conf that would be recommended for ordinary usage in home desktops and servers?

----------

## danja

Hi there, 

after we've witnessed the death of 4 out of 5 raid0 drives at the place I work last week, I rushed to install smartd on my home server that holds nearly a terabyte on JBOD. I was wondering around with the question similar to yours and found this nice blog entry - http://scottstuff.net/blog/articles/tag/smart , not that it's somewhat noble but it pushed me towards watching and organizing my logging system, especially the precious smartd messages.

My configuration is a pure mixup of disks and hardware, where:

/dev/hdg, /dev/hde are sittin on Promise PDC20269 (PCI card)

/dev/hda, /dev/hdc on VIA vt8233 (mobo built on)

/dev/sda, /dev/sdb on SiI 3112 (PCI card)

further

/dev/md0 consists of /dev/hdg /dev/hde

/dev/md1 of /dev/sda /dev/sdb /dev/hdc

and /dev/hda is the system disk.

mostly WD drives plus one Maxtor and one Seagate

After all tinkering I came up with quite paranoid /etc/smartd.conf, consisting of mostly same parameters for all the drives which I took from example code of /etc/smartd.conf itself.

```
# The word DEVICESCAN will cause any remaining lines in this

# configuration file to be ignored: it tells smartd to scan for all

# ATA and SCSI devices.  DEVICESCAN may be followed by any of the

# Directives listed below, which will be applied to all devices that

# are found.  Most users should comment out DEVICESCAN and explicitly

# list the devices that they wish to monitor.

# DEVICESCAN -H -a -m root 

/dev/hde -H -a -o on -S on -s (S/../.././02|L/../../6/03) -m root

/dev/hdg -H -a -o on -S on -s (S/../.././02|L/../../6/03) -m root

/dev/hda -H -a -o on -S on -s (S/../.././02|L/../../6/03) -m root

/dev/hdc -H -a -o on -S on -s (S/../.././02|L/../../6/03) -m root

/dev/sda -d ata -H -a -o on -S on -s (S/../.././02|L/../../6/03) -m root

/dev/sdb -d ata -H -a -o on -S on -s (S/../.././02|L/../../6/03) -m root

```

as you see, first a have been using DEVICESCAN which appeared not to do the right thing with sata drives, since they explicitly require -d ata parameter. Perhaps I could just have added '-d ata' to DEVICESCAN line but I rather do it for each sata disk I have.  So far it works fine and I got some "bad sectors" messages via email about both of my sata disks, which didn't happen before.

would love to hear some other experiences!

----------

## depontius

Wow, first word I've heard on this topic, and I've posted it a number of places.

Your configuration was pretty close to what I moved to recently, so I've just added the '-H' flag to mine, to match it. Just 2 systems left to update, once they're turned on for the day.

As for the article, I'm not sure but what the Hardware_ECC_Recovered messages were really a red herring. I get lots of "bouncing lsb" messages across various systems on various attributes. I suspect the first real problem was the "Self-Test Log error count increased from 0 to 1", and the rest was just chaff. Presumably were this to happen to your or me, we'd now get an email about the event? I'm also under the impression we'd get an email of any attributes hit the threshold?

As a result of my mentioned scare, I beefed up my smart configuration. I also set up some raid-1 space on my server. I've moved the mail for my imap server onto it, as well as /home, and have begun serving that on nfs. (This is a home network.)

Thanks

Edit... There have been a number of articles recently on hard disk reliability, and some of those articles have cast doubt on how useful SMART really is for predicting failure. But then again, aside from SMART and raid/mirroring, what else can one do?

----------

## Cyker

SMART is pretty accurate - If it says something failed, usually the disk is dying and should be replaced ASAP.

However, a disk can still suddenly fail without SMART even noticing something's up, so don't rely on it as a silver bullet for HD monitoring.

----------

## depontius

 *Cyker wrote:*   

> SMART is pretty accurate - If it says something failed, usually the disk is dying and should be replaced ASAP.

 

I was not referring to what SMART says failed, rather referring to attributes having a habit of jittering a few lsb, even for years. Having read the SMART docs a bit more, I see that this jitter isn't necessarily bad, but you have to watch for the value to hit a threshold. (It would even be nice to know when it hits a "new low value" beyond what the jitter has been.)

 *Cyker wrote:*   

> However, a disk can still suddenly fail without SMART even noticing something's up, so don't rely on it as a silver bullet for HD monitoring.

 

I've read the recent articles too, but as I said, beyond SMART and raid/mirroring, what else can you do?

----------

## quantumsummers

Another thing to consider is hddtemp.  Its in portage, & there is a gkrellm2 plugin.  I have found it nice to view my HD temps in addition to SMART monitoring.  Alarms can be set, & there is a script on gentoo-wiki that will shut your machine down if you have an HD failure.  Though be careful with this as it can cause problems (immediate shutdown during boot time) if you have previous errors on any drive.

Good Luck,

QuantumSummers

----------

