# [Solved] Smart monitor messing with hard drive?

## AllanS

My system has been nearly-frozen with a problem associated with my second hard drive. The second drive goes into a state where nearly all disk activity is blocked. The first drive is OK, but unable to do much due to the blockage created by the second drive. I thought temperature was a problem so I had smartd installed so that I could detect problems. I also wrote a small script to use smartctl to monitor the disk drive temperature and send an email if it got too hot.

Smartd does not indicate any problems, even after both short and long self-tests. Neither does the manufacturer's diagnostics. But lately it seems that the common event before the disk locks up is my script which uses "smartctl -all" to dump all the information and extract the current temperature. The relevant section from my log file is:

```

Feb  9 22:15:01 galactica cron[6230]: (root) CMD (/root/monDisks.sh)

Feb  9 22:15:09 galactica [205096.924881] hdb: drive_cmd: status=0x7f { DriveReady DeviceFault SeekComplete DataRequest CorrectedError Index Error }

Feb  9 22:15:09 galactica [205096.924895] hdb: drive_cmd: error=0x7f { DriveStatusError UncorrectableError SectorIdNotFound TrackZeroNotFound AddrMarkNotFound }, LBAsect=9343692930943, high=556927, low=8355711, sector=0

Feb  9 22:15:09 galactica [205096.924915] ide: failed opcode was: 0xb0

Feb  9 22:15:29 galactica [205116.918654] hdb: dma_timer_expiry: dma status == 0x61

Feb  9 22:15:39 galactica [205126.917308] hdb: DMA timeout error

Feb  9 22:15:39 galactica [205126.917322] hdb: dma timeout error: status=0x58 { DriveReady SeekComplete DataRequest }

Feb  9 22:15:39 galactica [205126.917330] ide: failed opcode was: unknown

Feb  9 22:15:41 galactica de: failed opcode was: unknown

Feb  9 22:15:41 galactica [205128.831104] hdb: task_out_intr: status=0x51 { DriveReady SeekComplete Error }

Feb  9 22:15:41 galactica [205128.831110] hdb: task_out_intr: error=0x04 { DriveStatusError }

Feb  9 22:15:41 galactica [205128.831115] ide: failed opcode was: unknown

Feb  9 22:15:41 galactica [205128.832433] hdb: task_out_intr: status=0x51 { DriveReady SeekComplete Error }

Feb  9 22:15:41 galactica [205128.832439] hdb: task_out_intr: error=0x04 { DriveStatusError }

Feb  9 22:15:41 galactica [205128.832443] ide: failed opcode was: unknown

```

The last three lines then repeat forever until I did a SYSRQ to emergency sync and remount. I waited for a while until the first disk had a chance to sync, then I could SYSRQ reset. I then booted up on a Gentoo install CD and checked the disks for errors (none this time) and checked for SMART errors (none at all).

What I found weird is the sector number identified in the error message - it is way past the end of my disk drive. So my question is: is this a smart monitor utility bug, a kernel bug, or a bad disk drive?

I have none of the power management software enabled (disabled in BIOS, not in the kernel), so it is not directly related to that (maybe indirectly if I follow some of the other posts).

The drive setup is:

IDE0:

MAXTOR 120GB DiamondMax 6Y120P ATA/133 jumpered as master on the end of the 80-wire IDE cable

Western Digital Caviar WD3000JB 300 GB jumpered as slave in the middle of the 80-wire IDE cable (this is the offending disk)

IDE1: (does not seem to be involved, but included for completeness)

AOpen DVD/CDRW jumpered as master at the end of the 80-wire IDE cable

Seagate STT320000A tape drive jumpered as slave in the middle of the 80-wire IDE cable.

Some basic info:

```
 uname -a

Linux galactica 2.6.18-gentoo-r6 #1 PREEMPT Tue Jan 2 12:22:58 EST 2007 i686 AMD Athlon(TM) XP1600+ AuthenticAMD GNU/Linux

This is on an ASUS A7V266C motherboard with 1.5 GB of SDRAM

```

```

 hdparm /dev/hdb

/dev/hdb:

 multcount    = 16 (on)

 IO_support   =  1 (32-bit)

 unmaskirq    =  1 (on)

 using_dma    =  1 (on)

 keepsettings =  0 (off)

 readonly     =  0 (off)

 readahead    = 256 (on)

 geometry     = 36481/255/63, sectors = 586072368, start = 0

```

```

hdparm -i /dev/hdb

/dev/hdb:

 Model=WDC WD3000JB-00KFA0, FwRev=08.05J08, SerialNo=WD-WMAMR1153985

 Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }

 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=65

 BuffType=unknown, BuffSize=8192kB, MaxMultSect=16, MultSect=16

 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=268435455

 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}

 PIO modes:  pio0 pio3 pio4 

 DMA modes:  mdma0 mdma1 mdma2 

 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 

 AdvancedPM=no WriteCache=enabled

 Drive conforms to: Unspecified:  ATA/ATAPI-1 ATA/ATAPI-2 ATA/ATAPI-3 ATA/ATAPI-4 ATA/ATAPI-5 ATA/ATAPI-6

 * signifies the current active mode

```

```

emerge --info

Portage 2.1.1-r2 (default-linux/x86/2006.1, gcc-4.1.1, glibc-2.5-r0, 2.6.18-gentoo-r6 i686)

=================================================================

System uname: 2.6.18-gentoo-r6 i686 AMD Athlon(TM) XP1600+

Gentoo Base System version 1.12.6

Last Sync: Fri, 09 Feb 2007 16:50:01 +0000

app-admin/eselect-compiler: [Not Present]

dev-java/java-config: 1.3.7, 2.0.31

dev-lang/python:     2.3.5-r2, 2.4.3-r4

dev-python/pycrypto: 2.0.1-r5

dev-util/ccache:     [Not Present]

dev-util/confcache:  [Not Present]

sys-apps/sandbox:    1.2.17

sys-devel/autoconf:  2.13, 2.61

sys-devel/automake:  1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r2, 1.10

sys-devel/binutils:  2.16.1-r3

sys-devel/gcc-config: 1.3.14

sys-devel/libtool:   1.5.22

virtual/os-headers:  2.6.17-r2

ACCEPT_KEYWORDS="x86"

AUTOCLEAN="yes"

CBUILD="i686-pc-linux-gnu"

CFLAGS="-march=athlon-xp -O2 -pipe -fomit-frame-pointer"

CHOST="i686-pc-linux-gnu"

CONFIG_PROTECT="/etc /usr/kde/3.5/env /usr/kde/3.5/share/config /usr/kde/3.5/shutdown /usr/share/X11/xkb /usr/share/config"

CONFIG_PROTECT_MASK="/etc/env.d /etc/env.d/java/ /etc/gconf /etc/java-config/vms/ /etc/revdep-rebuild /etc/splash /etc/terminfo /etc/texmf/web2c"

CXXFLAGS="-march=athlon-xp -O2 -pipe -fomit-frame-pointer"

DISTDIR="/usr/portage/distfiles"

FEATURES="autoconfig distlocks metadata-transfer sandbox sfperms strict"

GENTOO_MIRRORS="http://gentoo.osuosl.org/"

LINGUAS="en"

MAKEOPTS="-j2"

PKGDIR="/usr/portage/packages"

PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --delete-after --stats --timeout=180 --exclude='/distfiles' --exclude='/local' --exclude='/packages'"

PORTAGE_TMPDIR="/var/tmp"

PORTDIR="/usr/portage"

SYNC="rsync://rsync.us.gentoo.org/gentoo-portage"

USE="x86 3dnow X Xaw3d a52 aac alsa alsa_cards_ali5451 alsa_cards_als4000 alsa_cards_atiixp alsa_cards_atiixp-modem alsa_cards_bt87x alsa_cards_ca0106 alsa_cards_cmipci alsa_cards_emu10k1 alsa_cards_emu10k1x alsa_cards_ens1370 alsa_cards_ens1371 alsa_cards_es1938 alsa_cards_es1968 alsa_cards_fm801 alsa_cards_hda-intel alsa_cards_intel8x0 alsa_cards_intel8x0m alsa_cards_maestro3 alsa_cards_trident alsa_cards_usb-audio alsa_cards_via82xx alsa_cards_via82xx-modem alsa_cards_ymfpci alsa_pcm_plugins_adpcm alsa_pcm_plugins_alaw alsa_pcm_plugins_asym alsa_pcm_plugins_copy alsa_pcm_plugins_dmix alsa_pcm_plugins_dshare alsa_pcm_plugins_dsnoop alsa_pcm_plugins_empty alsa_pcm_plugins_extplug alsa_pcm_plugins_file alsa_pcm_plugins_hooks alsa_pcm_plugins_iec958 alsa_pcm_plugins_ioplug alsa_pcm_plugins_ladspa alsa_pcm_plugins_lfloat alsa_pcm_plugins_linear alsa_pcm_plugins_meter alsa_pcm_plugins_mulaw alsa_pcm_plugins_multi alsa_pcm_plugins_null alsa_pcm_plugins_plug alsa_pcm_plugins_rate alsa_pcm_plugins_route alsa_pcm_plugins_share alsa_pcm_plugins_shm alsa_pcm_plugins_softvol apache2 arts bash-completion bcmath berkdb bitmap-fonts bzip2 cddb cdr clamav cli cracklib crypt cscope ctype cups curl curlwrappers dbus dlloader doc dri dts dv dvb dvd dvdr dvdread elibc_glibc encode esd exif fbcon ffmpeg firefox flac foomaticdb fortran freetds ftp gdbm gif gimp gnome gnutls gphoto2 gpm gps gtk gtk2 gtkhtml hal iconv imagemagick imap imlib input_devices_evdev input_devices_keyboard input_devices_mouse ipv6 isdnlog java javascript jbig jpeg jpeg2k kde kernel_linux lcd_devices_bayrad lcd_devices_cfontz lcd_devices_cfontz633 lcd_devices_glk lcd_devices_hd44780 lcd_devices_lb216 lcd_devices_lcdm001 lcd_devices_mtxorb lcd_devices_ncurses lcd_devices_text lcms libg++ linguas_en midi mikmod mime mmap mmx mng motif mp3 mpeg mssql ncurses nls nptl nptlonly nsplugin nvidia oci8 odbc offensive ogg opengl oracle oss pam pcntl pcre pdf perl plotutils png ppds pppd prelude python qt3 qt4 quicktime readline reflection ruby samba sasl scanner sdl session sharedmem slp soap spell spl sse ssl svg tcl tcltk tcpd theora threads tidy tiff tk truetype truetype-fonts type1-fonts udev unicode usb userland_GNU v4l vcd vhosts video_cards_nvidia video_cards_vesa vim-syntax vorbis win32codecs wmf wxwindows xinetd xml xmlrpc xorg xosd xpm xprint xscreensaver xsl xv xvid zlib"

Unset:  CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LANG, LC_ALL, LDFLAGS, PORTAGE_RSYNC_EXTRA_OPTS, PORTDIR_OVERLAY

```

```

smartctl --version

smartctl version 5.36 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen

Home page is http://smartmontools.sourceforge.net/

smartctl comes with ABSOLUTELY NO WARRANTY. This

is free software, and you are welcome to redistribute it

under the terms of the GNU General Public License Version 2.

See http://www.gnu.org for further details.

CVS version IDs of files used to build this code are:

Module: atacmdnames.c    revision: 1.13  date: 2006/04/12     

  uses: atacmdnames.h    revision: 1.5   date: 2006/04/12     

Module: atacmds.c        revision: 1.168 date: 2006/04/12     

  uses: atacmds.h        revision: 1.81  date: 2006/04/12     

  uses: configure.in     revision: 1.113 date: 2005/11/27     

  uses: extern.h         revision: 1.41  date: 2006/04/12     

  uses: int64.h          revision: 1.13  date: 2006/04/12     

  uses: utility.h        revision: 1.43  date: 2006/04/12     

Module: ataprint.c       revision: 1.164 date: 2006/04/12     

  uses: atacmdnames.h    revision: 1.5   date: 2006/04/12     

  uses: atacmds.h        revision: 1.81  date: 2006/04/12     

  uses: ataprint.h       revision: 1.28  date: 2006/04/12     

  uses: configure.in     revision: 1.113 date: 2005/11/27     

  uses: extern.h         revision: 1.41  date: 2006/04/12     

  uses: int64.h          revision: 1.13  date: 2006/04/12     

  uses: knowndrives.h    revision: 1.16  date: 2006/04/05     

  uses: smartctl.h       revision: 1.23  date: 2006/04/12     

  uses: utility.h        revision: 1.43  date: 2006/04/12     

Module: knowndrives.c    revision: 1.139 date: 2006/04/05     

  uses: atacmds.h        revision: 1.81  date: 2006/04/12     

  uses: ataprint.h       revision: 1.28  date: 2006/04/12     

  uses: configure.in     revision: 1.113 date: 2005/11/27     

  uses: extern.h         revision: 1.41  date: 2006/04/12     

  uses: int64.h          revision: 1.13  date: 2006/04/12     

  uses: knowndrives.h    revision: 1.16  date: 2006/04/05     

  uses: utility.h        revision: 1.43  date: 2006/04/12     

Module: os_linux.c       revision: 1.82  date: 2006/04/12     

  uses: atacmds.h        revision: 1.81  date: 2006/04/12     

  uses: configure.in     revision: 1.113 date: 2005/11/27     

  uses: int64.h          revision: 1.13  date: 2006/04/12     

  uses: os_linux.h       revision: 1.24  date: 2006/04/12     

  uses: scsicmds.h       revision: 1.57  date: 2006/04/12     

  uses: utility.h        revision: 1.43  date: 2006/04/12     

Module: scsicmds.c       revision: 1.85  date: 2006/04/12     

  uses: configure.in     revision: 1.113 date: 2005/11/27     

  uses: extern.h         revision: 1.41  date: 2006/04/12     

  uses: int64.h          revision: 1.13  date: 2006/04/12     

  uses: scsicmds.h       revision: 1.57  date: 2006/04/12     

  uses: utility.h        revision: 1.43  date: 2006/04/12     

Module: scsiprint.c      revision: 1.107 date: 2006/04/12     

  uses: configure.in     revision: 1.113 date: 2005/11/27     

  uses: extern.h         revision: 1.41  date: 2006/04/12     

  uses: int64.h          revision: 1.13  date: 2006/04/12     

  uses: scsicmds.h       revision: 1.57  date: 2006/04/12     

  uses: scsiprint.h      revision: 1.20  date: 2006/04/12     

  uses: smartctl.h       revision: 1.23  date: 2006/04/12     

  uses: utility.h        revision: 1.43  date: 2006/04/12     

Module: smartctl.c       revision: 1.143 date: 2006/04/12     

  uses: atacmds.h        revision: 1.81  date: 2006/04/12     

  uses: ataprint.h       revision: 1.28  date: 2006/04/12     

  uses: configure.in     revision: 1.113 date: 2005/11/27     

  uses: extern.h         revision: 1.41  date: 2006/04/12     

  uses: int64.h          revision: 1.13  date: 2006/04/12     

  uses: knowndrives.h    revision: 1.16  date: 2006/04/05     

  uses: scsicmds.h       revision: 1.57  date: 2006/04/12     

  uses: scsiprint.h      revision: 1.20  date: 2006/04/12     

  uses: smartctl.h       revision: 1.23  date: 2006/04/12     

  uses: utility.h        revision: 1.43  date: 2006/04/12     

Module: utility.c        revision: 1.61  date: 2006/04/12     

  uses: configure.in     revision: 1.113 date: 2005/11/27     

  uses: int64.h          revision: 1.13  date: 2006/04/12     

  uses: utility.h        revision: 1.43  date: 2006/04/12     

smartmontools release 5.36 dated 2006/04/12 at 17:39:01 UTC

smartmontools build host: i686-pc-linux-gnu

smartmontools build configured: 2006/12/29 22:07:18 UTC

smartctl compile dated Dec 29 2006 at 17:07:35

smartmontools configure arguments: '--prefix=/usr' '--host=i686-pc-linux-gnu' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--datadir=/usr/share' '--sysconfdir=/etc' '--localstatedir=/var/lib' '--build=i686-pc-linux-gnu' 'CFLAGS=-march=athlon-xp -O2 -pipe -fomit-frame-pointer' 'build_alias=i686-pc-linux-gnu' 'host_alias=i686-pc-linux-gnu'

```

The smartd daemon is running as well.

Please let me know what other information I may provide to help determine the real cause.

Thanks,

-AllanLast edited by AllanS on Wed May 21, 2008 11:18 pm; edited 1 time in total

----------

## SoylentGreen

well, if you say the problem happens if you execute your script.. what is the code of that script?

btw.. you dont need a script to check the temperature, just emerge hddtemp  :Wink: 

```

hddtemp /dev/sda /dev/sdb

/dev/sda: SAMSUNG SP2504C: 37 C

/dev/sdb: SAMSUNG HD400LJ: 39 C

```

 *Quote:*   

> 
> 
> What I found weird is the sector number identified in the error message - it is way past the end of my disk drive
> 
> 

 

did you resize the partition(s) on that drive? or did you create partitions using windows?

what gives if you fsck all partitions of that drive?

----------

## AllanS

The disk (hdb) has one partition and was created using Linux utils. The one partition (hdb1) is formatted with reiserfs 3.6. The disk has never been repartitioned or resized. The disk is my "backup" disk for other computers (using BackupPC). The script uses "smartctl --all" and is:

```
#! /bin/bash

export PATH=$PATH:/usr/sbin:/bin

#

# Name of file to contain collected data for future analysis

# Format: "date time-TZ disk tempC"

DB=/root/disktempDB

DATE=$(date --rfc-3339=seconds)

#set -x

for i in hda hdb

do

        temp=$(smartctl --all /dev/$i | grep ^194 | awk '{print $10}')

        echo $DATE /dev/$i $temp >> $DB.$i.txt

        if [ $temp -gt 40 ]

        then

                let tempF=$temp*9/5+32

                echo /dev/$i is too hot at $temp C or $tempF F | mail -s "Disk Temp $i" me@example.com

        fi

done

```

Didn't know about hddtemp - thanks!

I'm wondering if there is an occasional bug with smartmontools where it doesn't fill in a parameter (because it supposed to be unused) but the disk reacts to it badly. I don't know - just grasping at straws.

----------

## SoylentGreen

dunno much about awk, sorry. but:

if smartmontools behave fine if ran on the cmdline (smartctl that is) then it must be something in your script?   :Shocked: 

hmm.. just guessing..

btw.. "mail -s" ? have a look at "man smartd", it is able to send emails by itself  :Wink: 

//btw: your smartd conf is OK?

----------

## AllanS

I do not know if smartmontools run fine from the command line, only that if I run them often that the failure occurs when the smartctl --all is run frequently.

The output of smartctl --all is very complete, but I only wanted the temperature. The "grep" only looks for the line that starts with "194" which contains the temperature. The "awk" part prints out the tenth token (the actual temperature). The net result is that the temperature for the disk is captured in the variable "temp" for use by the script to see if it exceeded my temperature limit (40 degrees C). I send it to me using "mail" so that I can see it.

I do know that smartd can send email and it is currently configured to do so. However, that only happens if there is an error or a parameter is exceeded. Since the operating temperature of this disk is supposed to be up to 55 C, I will not get an error message from smartd until the disk is at that point. I wanted notification at a lower point because I had some concern about my case. However, it looks like the airflow and power supply are fine. So I am looking at other causes for this problem.

The script is fine. What is different with this script vs. the command line is the frequency of use (every 10 minutes). So I believe there is some race condition or other side effect where running this script frequently causes the disk to lock up.

Later today or tomorrow I may be brave enough to run a script in an infinite loop that just does smartctl over and over again to see if it can provoke this problem. Does anyone know of what else should be recorded during a test like this that will help isolate the cause of the problem?

Or is this an issue where smartd and smartctl are both sending commands and stepping on each other?

----------

## AllanS

Sorry, meant to post my smartd.conf:

```

/dev/hda -a -d ata -o on -S on -s (S/../.././06|L/../../0/07) -m me@example.com

/dev/hdb -a -d ata -o on -S on -s (S/../.././05|L/../../0/06) -m me@example.com

```

----------

## AllanS

I do not think smartd was interfering with smartctl. Looking at the logs, the smartd daemon checks the disks every 1800 seconds (30 minutes). The timestamp is consistently at xx:18:39 or xx:48:39 when a message is logged (the initial log message during boot where it fork()'ed is at 13:18:39). The last failure I had was at 22:15:09 which is when smartd would not be accessing the disk. Also, no self-tests were scheduled at this time.

It looks like it is solely associated with smartctl. Of course, the "--all" parameter means that several different ioctl() calls are made back-to-back. This normally should not be a problem, but maybe my Western Digital disk does not like such activity?

----------

## SoylentGreen

just again to make this clear:

if you run smartctl from the commandline everything is fine,

and if you run it from your bash script it is not

is this really the case?

----------

## AllanS

I seldom if ever run smartctl from the command line. I mostly run the smartctl from the batch file via CRON. Something like 0.01% command line and 99.99% batch file via CRON. In the batch file via CRON the smartctl program works 99.99% of the time. When it fails the disk locks up.

At least, those are the symptoms. The actual cause it what we are chasing.

I do not think this has to do with command line vs. batch file unless there are some side effects of environment variables that I do not know about. CRON typically runs in a minimal environment (stripped down PATH, minimal environment variables, etc.). If smartctl depends or changes on some environment variables, then that might affect the whole thing.

----------

## SoylentGreen

 *AllanS wrote:*   

> I seldom if ever run smartctl from the command line. I mostly run the smartctl from the batch file via CRON.

 

that was not my question.   :Shocked: 

you are (obviously) running smartd as demon, and additionally you are running a bash script via cron, right?

that would be all wrong, because smartd (depending on your parametres) does that all by itself.

i fear, its called twice. maybe even at the same time?

----------

## AllanS

I agree that smartd is running and that I am calling smartctl from CRON.

I also agree that it could be "doing the same thing twice" and "at the same time". I think that could be a problem (with the program, not its use).

The smartd is monitoring the system for conditions that exceed specific failure levels. If you remember, I started doing this to monitor the disk temperatures because I thought there was a problem with excessive heat. While that is not the root problem, until you suggested hddtemp I had no other way to monitor the disks' temperature.

So I needed both uses (smartd for failure / pre-failure monitoring and smartctl for temperature monitoring in general) to make sure the system was OK.

So, my current thesis is that the smartmontools is causing a problem with the disk to make it lock up. This is only justified with the coincidence that 8 seconds after calling "smartctl --all" that the disk locked up. What makes this point more convincing is that the last time this happened, my script called "smartctl --all" and then 12 seconds later the disk locked up. While I admit it is a coincidence, twice in a row the same action preceded the disk lockup.

It tends to make one wonder what's going on   :Wink: 

The smartd and smartctl programs use ioctl() to access information from the disk. There should be nothing that these ioctl() should do that places the disk in a failure condition (or else why are we running these programs?). From the logs that I have I do not believe both smartd and smartctl were trying ioctl() functions at the same time. So either there is an unknown problem with smartctl, or my disk is truly failing but just hasn't gone all the way, or there is something else completely different going on here.

Any ideas?

BTW, I do appreciate your assistance. Hopefully we can figure this out.

----------

## devsk

 *Quote:*   

> If you remember, I started doing this to monitor the disk temperatures

 please use hddtemp and report back if the problem stays. Its uses single ioctl and asks for vendor specific attribute (usually 194, but varies) which you need to configure in its database. If you are lucky, your drive will already be in db, otherwise do a 'smartctl -a' once and put that information in the database. one ioctl by hddtemp is much lighter than '-a' with smartctl. Highly likely that your problem will not be seen again.

nice thing with hddtemp is that gkrellm2 automatically picks up hddtemp temps and shows them.

----------

## AllanS

I have switched to hddtemp and also cut back the monitoring to once per hour. If it happens again, I shall let you know.

Thanks,

-Allan

----------

## SoylentGreen

 *AllanS wrote:*   

> 
> 
> So I needed both uses (smartd for failure / pre-failure monitoring and smartctl for temperature monitoring in general) to make sure the system was OK.
> 
> 

 

no. that script you wrote is redundant, sorry. what about haveing a peek at "man smartd.conf"?

```

To track temperature changes of at least 2 degrees, use:

               -W 2

              To log informal messages on temperatures of at least 40 degrees, use:

               -W 0,40

              For warning messages/mails on temperatures of at least 45 degrees, use:

               -W 0,0,45

              To combine all of the above reports, use:

               -W 2,40,45

```

 :Razz: 

----------

## AllanS

Slick!

 :Smile: 

----------

## SoylentGreen

 *AllanS wrote:*   

> Slick!
> 
> 

 

ROFL  :Smile: 

yeah, how i understand it, smartctl is just a commandline tool that doesnt need smartd at all. so to speak: it doesnt need smartd to be running.

so there actually could be a problem running both at the same time (well, depends probably what smartd is just doing while you smartctl acceses the drive as well).

but again - i can just guess that. OTOH:

it leaves a bad smell that smartctl is displaying errors on sectors your drive doesnt actually have (beyond the end of your drive).

if i would be you that would puzzle me most. hmm.. if you fire up cfdisk, does that one display the size of disk and partitions correctly?

and, you did fsck the drive from a bootcd and no errors are shown with your filesystem?

anyway, while we are at this topic, something what is puzzleing me is:

```

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline       Completed without error       00%      3636         -

# 2  Extended offline    Interrupted (host reset)      10%      3587         -

# 3  Extended offline    Completed without error       00%      3545         -

# 4  Extended offline    Interrupted (host reset)      10%      3543         -

# 5  Extended offline    Interrupted (host reset)      10%      3543         -

# 6  Extended offline    Completed without error       00%      3542         -

# 7  Short offline       Completed without error       00%      2551         -

# 8  Short offline       Completed without error       00%      1848         -

# 9  Short offline       Completed without error       00%      1830         -

#10  Extended offline    Completed without error       00%      1443         -

#11  Short offline       Completed without error       00%       582         -

#12  Extended offline    Completed without error       00%       245         -

#13  Extended offline    Completed without error       00%       159         -

#14  Short offline       Completed without error       00%       138         -

#15  Short offline       Completed without error       00%        32         -

#16  Extended offline    Interrupted (host reset)      70%        32         -

```

where is above data *stored*   :Shocked: 

yes, i looked hard, but its not somewhere a file on the disk. is it somewhere in the data area of the hdd?

how large will above information grow?

well, i am planning to run a short selftest on a dayly basis, but wondering if i will run out of space in the data area. or if i waist space that could better be used by the drive management itself, storing bad blocks or whatever?

just wondering. the docs of smartd/smartctl tell nothing about this.

----------

## eccerr0r

Not sure if this will help, but are there other issues, like RAM corruption at work?

Does the disk work fine when you change disk controllers?

I have this one disk (actually, more than one, but not nearly as bad) that when put as slave on a Promise Ultra 66 would emit a lot of IDE UDMA errors but works perfectly fine as single/master.  Probably not exactly the same problem, but, might help debugging.

----------

## SoylentGreen

 *eccerr0r wrote:*   

> 
> 
> I have this one disk (actually, more than one, but not nearly as bad) that when put as slave on a Promise Ultra 66 would emit a lot of IDE UDMA errors but works perfectly fine as single/master.  Probably not exactly the same problem, but, might help debugging.

 

good shot eccerr0r!

yes, there are IDE/ATA devices not working well as slave, i only know this from CD-ROMS, though..

maybe his maxtor 120GB is causing the trouble on the IDE Bus.. that one seems to be his oldest drive.

OTOH, still doesnt explain the errors beyond of the device, does it?

i am almost sure the partition table is not correct.

----------

## AllanS

Much is in my original post.

The data is stored in RAM on the controller, I believe. None of the data is on the disk platters themselves.

The partition table is fine.

```
Command (m for help): p

Disk /dev/hdb: 300.0 GB, 300069052416 bytes

255 heads, 63 sectors/track, 36481 cylinders

Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System

/dev/hdb1               1       36481   293033601   83  Linux

```

The Maxtor is the master and is older.

I wanted to eliminate RAM as a problem so I ran extensive self-tests and they all passed. Not that it rules out memory as a problem   :Wink: 

The disk controller is on the MOBO, so it cannot be changed out.

I originally had this disk cabled with the DVD/CD as master, and it was not working very well, so I made it slave with the MAXTOR as master.

I agree, smartctl should not be reporting sectors / blocks past the end of my drive. It is possible that the disk would normally accept the commands that smartctl sends to it but punts when it sees a bad sector number even if the command doesn't involve a sector (e.g. seek).

As I said in the original post I fsck'd the disk from a Linux install CD (2006.1). The CD also has smartctl so I could check for SMART errors and there were none.

The last thing is that while smartd can be setup to send me warning messages, I also wanted to collect the data to see what trends there are. In the mornings (when backups are run) the disks' temperatures rise noticeably (but usually not above 40 C). Having a script tracking the temperature let me see when the "hot" times were. Smartd cannot do that (or did I miss that section in the manual?).

Bottom line: it seems smartctl can cause a problem with my disk, sometimes. It shouldn't. Either it is a disk problem or a smartctl problem. I posted to their lists but they have not responded (yet?). Although I thought people here might respond.

----------

## SoylentGreen

excuse me, but are you on dope or what? <lol>

 *AllanS wrote:*   

> 
> 
> Having a script tracking the temperature let me see when the "hot" times were. Smartd cannot do that (or did I miss that section in the manual?).
> 
> 

 

how often do i have to repost this:

```

To track temperature changes of at least 2 degrees, use: 

                -W 2 

               To log informal messages on temperatures of at least 40 degrees, use: 

                -W 0,40 

               For warning messages/mails on temperatures of at least 45 degrees, use: 

                -W 0,0,45 

               To combine all of the above reports, use: 

                -W 2,40,45 

```

 :Shocked:   :Shocked:   :Shocked:   :Shocked: 

above is an excerpt of "man smartd.conf" and contains everything you need!

it will *email* you every temperature changes of your drive! *everytime* it changes!!

you even called that "slick!" in your very own reply   :Shocked: 

what are you missing? what am i missing?

you get an email as soon as temperature changes, what else do you need?

sorry, this is beyond my scope.

OTOH: the temperature has absolutely *nothing* to do with smartctl finding bad sectors out of range. <- this is the problem where you should have a look at.

----------

## AllanS

I understood your message, but I wanted a recording of temperature every X minutes, not a message when it exceeded a temperature at a point in time.

It is "slick" that smartd will email things that cross boundaries that are not failures. But parsing email messages of temperatures that exceed particular set points is not the same that as recording temperatures every X periods. I did understand, but it did not match my needs.

I do agree that the most obvious problem is smartctl. It is doing something wrong when it can provoke a disk failure as it seems to do. A piece of data is that the sector is outside the legal range. I find this suspect and a point where the investigation should continue. Unfortunately, I am not a smartd expert, only an informed novice. So while I can point out obvious potential problems, I cannot (yet) find the solution. I don't have that type of bandwidth.

I am hoping that the smartmontools people are watching and can suggest a direction to investigate this problem and possibly pose a solution. Otherwise I must decode the kernel code for IDE ioctl() to understand what the kernel is doing, then try to understand what the IDE Controller (firmware) is doing, and match up the symptoms to make sense. I know I can do this, I just don't have the time (kids and mortgages take precedence).

So, can you directly help with smartmontools utilities?

Please understand, I am not trying to be critical of your assistance. But you yourself have said that the problem is smartctl, not how I monitor the temperatures  :Smile: 

----------

## danja

 *SoylentGreen wrote:*   

> excuse me, but are you on dope or what? <lol>
> 
> how often do i have to repost this:
> 
> ```
> ...

 

hi there,

i'm staring at man smartd.conf (5.36) and see NO -W option what so ever.

Not that I needed, more just wondering.

----------

## SoylentGreen

using 5.37 here (~arch)

----------

## AllanS

I believe I have found the solution to this problem and it was not smart monitoring. My Ethernet card (PCI 3COM 3C905C) was in slot 5 of my chassis. That slot shares the interrupt with my IDE-based disk drives. It is interfering with my disks.

I was copying a friends computer (disk to my disk over the network using DD & ssh and then back again) and could reliably crash my system. I had previously thought it might be my graphics card, but it is on a different slot and IRQ. I didn't think the Ethernet card did enough to cause any damage. I was wrong.

I have moved the card to slot 2 which shares the interrupt with the sound card. Since then my system has been solid (knock on wood   :Smile:  )

So, to share the knowledge, keep IRQs for PCI-based systems unique for IDE disk drives. Do not share them with any other hardware if you can.

----------

