# Several Drives - One of them Making Noise (replaced)

## Bigun

I believe I have a drive that has been on the road to failing for quite a while.  Here lately noises (loud clicking) have been becoming more and more frequent.  I haven't seen any evidence in the dmesg output that would alert me as to which drive or what was going wrong.

Problem is that I have 4 drives in this machine.  Three of them are in a software raid array, and one OS drive.  I think it may be the OS drive, but I really have no way to prove which one it is.  As I said before, dmesg doesn't seem to be showing anything, and /proc/mdstat shows all is well.  Is there anything else I can check?

----------

## eccerr0r

Backup first.

Can you shutdown the machine?  That could let you do some single drive tests...

If not, post the SMART registers, if it's really having issues, it should be marked in the logs despite no issues reported to the OS as fatal.

----------

## Bigun

I have several spares, my only fear is that I want to move the OS drive from PATA to SATA... and I'm afraid that will screw up the partition order of my SATA drives and screw up my array.

And yes, I can shut it down, I don't think it's bad off enough to fail just yet.

----------

## idella4

Bigun,

emerge a package called smartmontools.  I only know about it.  I don't know how to use it.  It's very technical.

It installs a smartd and smartctl.  You can use these to check the state of a drive or a partition.

Read the man smartctl.  It will at least isolate a troublesome hardware device. though the data output may be too technical.

----------

## Bigun

Looks like my guess is right:

```
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen

Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===

Model Family:     Maxtor DiamondMax 80 family

Device Model:     Maxtor 98196H8

Serial Number:    V805FRAC

Firmware Version: ZAH814Y0

User Capacity:    81,964,302,336 bytes

Device is:        In smartctl database [for details use: -P show]

ATA Version is:   6

ATA Standard is:  ATA/ATAPI-6 T13 1410D revision 0

Local Time is:    Sun Jul 25 08:47:49 2010 Local time zone must be set--see zic m

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

See vendor-specific Attribute list for marginal Attributes.

General SMART Values:

Offline data collection status:  (0x00) Offline data collection activity

                                        was never started.

                                        Auto Offline Data Collection: Disabled.

Self-test execution status:      (   0) The previous self-test routine completed

                                        without error or no self-test has ever 

                                        been run.

Total time to complete Offline 

data collection:                 (  30) seconds.

Offline data collection

capabilities:                    (0x1b) SMART execute Offline immediate.

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new

                                        command.

                                        Offline surface scan supported.

                                        Self-test supported.

                                        No Conveyance Self-test supported.

                                        No Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

                                        power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

                                        No General Purpose Logging support.

Short self-test routine 

recommended polling time:        (   2) minutes.

Extended self-test routine

recommended polling time:        (  60) minutes.

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000a   253   252   000    Old_age   Always       -       563

  3 Spin_Up_Time            0x0027   193   186   063    Pre-fail  Always       -       15996

  4 Start_Stop_Count        0x0032   253   253   000    Old_age   Always       -       1572

  5 Reallocated_Sector_Ct   0x0033   116   116   063    Pre-fail  Always       -       345

  6 Read_Channel_Margin     0x0001   253   253   100    Pre-fail  Offline      -       0

  7 Seek_Error_Rate         0x000a   251   250   000    Old_age   Always       -       4

  8 Seek_Time_Performance   0x0027   252   244   187    Pre-fail  Always       -       48448

  9 Power_On_Minutes        0x0032   229   229   000    Old_age   Always       -       687h+49m

 10 Spin_Retry_Count        0x002b   246   219   223    Pre-fail  Always   In_the_past 7

 11 Calibration_Retry_Count 0x002b   253   249   223    Pre-fail  Always       -       0

 12 Power_Cycle_Count       0x0032   247   247   000    Old_age   Always       -       2423

196 Reallocated_Event_Count 0x0008   253   253   000    Old_age   Offline      -       0

197 Current_Pending_Sector  0x0008   251   250   000    Old_age   Offline      -       2

198 Offline_Uncorrectable   0x0008   253   252   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x0008   199   140   000    Old_age   Offline      -       58

200 Multi_Zone_Error_Rate   0x000a   253   252   000    Old_age   Always       -       0

201 Soft_Read_Error_Rate    0x000a   253   204   000    Old_age   Always       -       193

202 TA_Increase_Count       0x000a   001   001   000    Old_age   Always       -       280

203 Run_Out_Cancel          0x000b   235   235   180    Pre-fail  Always       -       9208

204 Shock_Count_Write_Opern 0x000a   253   252   000    Old_age   Always       -       0

205 Shock_Rate_Write_Opern  0x000a   253   252   000    Old_age   Always       -       0

207 Spin_High_Current       0x002a   248   230   000    Old_age   Always       -       5

208 Spin_Buzz               0x002a   252   242   000    Old_age   Always       -       1

209 Offline_Seek_Performnce 0x0024   253   253   000    Old_age   Offline      -       0

 96 Unknown_Attribute       0x0004   253   253   000    Old_age   Offline      -       0

 97 Unknown_Attribute       0x0004   253   253   000    Old_age   Offline      -       0

 98 Unknown_Attribute       0x0004   253   253   000    Old_age   Offline      -       0

 99 Unknown_Attribute       0x0004   253   253   000    Old_age   Offline      -       0

100 Unknown_Attribute       0x0004   253   253   000    Old_age   Offline      -       0

101 Unknown_Attribute       0x0004   253   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1

Warning: ATA error count 11856 inconsistent with error log pointer 5

ATA Error Count: 11856 (device log contains only the most recent five errors)

        CR = Command Register [HEX]

        FR = Features Register [HEX]

        SC = Sector Count Register [HEX]

        SN = Sector Number Register [HEX]

        CL = Cylinder Low Register [HEX]

        CH = Cylinder High Register [HEX]

        DH = Device/Head Register [HEX]

        DC = Device Command Register [HEX]

        ER = Error register [HEX]

        ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 11856 occurred at disk power-on lifetime: 7662 hours (319 days + 6 hours)

  When the command that caused the error occurred, the device was in an unknown state.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  04 51 05 01 00 00 a0  Error: ABRT

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  a1 00 05 01 00 00 a0 08      03:13:58.560  IDENTIFY PACKET DEVICE

  ca 00 08 81 a7 92 e7 08      03:13:50.464  WRITE DMA

  ca 00 08 09 c9 81 e7 08      03:13:50.464  WRITE DMA

  ca 00 10 19 39 7e e7 08      03:13:50.448  WRITE DMA

  ca 00 01 12 39 7e e7 08      03:13:50.432  WRITE DMA

Error 11855 occurred at disk power-on lifetime: 7662 hours (319 days + 6 hours)

  When the command that caused the error occurred, the device was in an unknown state.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  04 51 05 01 00 00 a0  Error: ABRT

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  a1 00 05 01 00 00 a0 08      03:02:56.560  IDENTIFY PACKET DEVICE

  ca 00 10 49 64 59 e1 08      03:02:48.448  WRITE DMA

  ca 00 02 42 64 59 e1 08      03:02:48.432  WRITE DMA

  ca 00 10 19 39 7e e7 08      03:02:48.432  WRITE DMA

  ca 00 01 12 39 7e e7 08      03:02:48.416  WRITE DMA

Error 11854 occurred at disk power-on lifetime: 7661 hours (319 days + 5 hours)

  When the command that caused the error occurred, the device was in an unknown state.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  04 51 01 01 00 00 a0  Error: ABRT

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  a1 00 01 01 00 00 a0 00      00:06:41.376  IDENTIFY PACKET DEVICE

Error 11853 occurred at disk power-on lifetime: 7661 hours (319 days + 5 hours)

  When the command that caused the error occurred, the device was in an unknown state.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  04 51 05 01 00 00 a0  Error: ABRT

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  a1 00 05 01 00 00 a0 08  17d+11:18:23.344  IDENTIFY PACKET DEVICE

  ca 00 01 41 64 59 e1 08  17d+11:18:15.296  WRITE DMA

  ca 00 01 41 64 59 e1 08  17d+11:18:15.296  WRITE DMA

  c8 00 80 41 64 59 e1 08  17d+11:18:15.280  READ DMA

  c8 00 80 c1 64 59 e1 08  17d+11:18:15.280  READ DMA

Error 11852 occurred at disk power-on lifetime: 7513 hours (313 days + 1 hours)

  When the command that caused the error occurred, the device was in an unknown state.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  04 51 01 01 00 00 a0  Error: ABRT

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  a1 00 01 01 00 00 a0 00      00:06:39.664  IDENTIFY PACKET DEVICE

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline       Completed without error       00%       147         -

# 2  Short offline       Completed: read failure       60%        76         4296454

Device does not support Selective Self Tests/Logging

```

I'm no expert, but it looks like the spindle is failing to spin up part of the time, and this is the OS drive.

Now, onto the next issue, can drives be added to a software array by UUID rather than the device name?  I'm afraid when I replace this PATA drive with a SATA drive it will knock the 3 drives I have in an array out of order.

----------

## idella4

I see the smartctl did the job.  I'm not sure but I think selection by uuid is a grub directed issue..  Read grub man pages.

----------

## dmpogo

If it is just spin-up,   it may be power supply issue

----------

## eccerr0r

Should be OK to refer by device name, the superblock of the array will be back annotated to the raid partitions...

Since you now know which drive is failing, and you have an (ugh) mixed system, what you should do is:

shutdown machine (if not hotpluggable)

attach the new disk to the system

restart machine (if you shut it down)

At this point you should copy/create the partition table to/on the new disk.

mdadm --fail each of the bad disk's partitions (you have to specify the array/uuid)

mdadm --remove the bad disk's partitions (same)

mdadm --add the new disk's partitions to their respective arrays (at this point the uuid of the array will be marked on the new disk)

* at no time do you ever use --build, --create, --force, or --update ... if you think you need to, step back and check docs again, you should not need to use any of these options!!! *

at this point it should start rebuilding the array (which will take a while).  After the rebuild completes, then remove the old disk from the system (once again, if not hotplug, need to shutdown).

The question now is if the array will still autoassemble.  If they're all the same driver (like libata versus legacy ide) it should... but I'd be kind of wary that a mixed system may not reassemble automatically unless you start up the array via uuid and all disks are detected when the assembly occurs.  This is only an issue at boot time with kernel autodetect; if you assemble "manually" (via mdadm versus letting kernel autodetect) it shouldn't be an issue.

----------

## devsk

Make backup first. Like NOW!

Do you have all four drives on the same rail from power supply? Make sure to distribute the drives to different rails (at least put them on separate leads). The startup time current requirements from drives can be order of magnitude higher than normal powered up use. Particularly true for older drives.

If it stays after that as well, I would say boot your system using livecd and check on all drives for clicking by powering up one at a time and putting light random read load on them (like 'dd' read from disk at random places in a couple of windows), and also looking at SMART data.

----------

## Bigun

 *eccerr0r wrote:*   

> ..Said stuff about rebuilding array..

 

This is the OS drive, it isn't in an array.

----------

## eccerr0r

 *Bigun wrote:*   

>  *eccerr0r wrote:*   ..Said stuff about rebuilding array.. 
> 
> This is the OS drive, it isn't in an array.

 

Make a new post about it.

Once again I dislike, though supported, non-identical drive setups.  And yes mdadm should support handling by uuid, but you'd be dealing with the /dev/mdX device anyway? It should handle growing to a nonidentical subsystem just fine.

And just in case it wasn't clear from the many people who have said this (regardless if it was a failing disk or array growth:

MAKE A BACKUP NOW.

----------

## Bigun

I got the message loud and clear.  Doing this right now:

```
dd bs=1M if=/dev/hda of=/dev/sda
```

It's 80Gb, so I'll probably update tomorrow.

----------

## paul138

It's called The Tick of Death, commonly associated here with Seagate drives  :Wink:  Regardless, I hope your recovery went well. I hate when I get a tick.

----------

## Bigun

Drive is copied and I have the new one in.

It looks like the new drive pushed all drive letters up one.  Luckily I have my fstab to look for UUID's, when the dd finished the copies it looks like it took the UUID's with it.

Now my biggest fear is getting the array back up.

The three drives involved were:

```
/dev/sda

/dev/sdb

/dev/sdc
```

Now they are:

```
/dev/sdb

/dev/sdc

/dev/sdd
```

Will this be an issue?

----------

## Bigun

**FULL UPDATE**

All I had to do was install grub on the new drive and it just worked.  Having the fstab reference the UUIDs and having the RAID drives autodetected seemed to have worked beautifully.  No data loss, worked first try.

Now the only question I have is.... trash the original?  Or keep just in case?

----------

## devsk

 *Bigun wrote:*   

> **FULL UPDATE**
> 
> All I had to do was install grub on the new drive and it just worked.  Having the fstab reference the UUIDs and having the RAID drives autodetected seemed to have worked beautifully.  No data loss, worked first try.
> 
> Now the only question I have is.... trash the original?  Or keep just in case?

 RMA it?... :Very Happy: 

----------

## Bigun

 *devsk wrote:*   

>  *Bigun wrote:*   **FULL UPDATE**
> 
> All I had to do was install grub on the new drive and it just worked.  Having the fstab reference the UUIDs and having the RAID drives autodetected seemed to have worked beautifully.  No data loss, worked first try.
> 
> Now the only question I have is.... trash the original?  Or keep just in case? RMA it?...

 

80Gb IDE?  Really?    :Laughing: 

----------

## devsk

 *Bigun wrote:*   

>  *devsk wrote:*    *Bigun wrote:*   **FULL UPDATE**
> 
> All I had to do was install grub on the new drive and it just worked.  Having the fstab reference the UUIDs and having the RAID drives autodetected seemed to have worked beautifully.  No data loss, worked first try.
> 
> Now the only question I have is.... trash the original?  Or keep just in case? RMA it?... 
> ...

 Why not? They will likely upgrade it to 160GB SATA...if its still under warranty though... :Wink: 

----------

## lost+found

 *Bigun wrote:*   

> I believe I have a drive that has been on the road to failing for quite a while.  Here lately noises (loud clicking) have been becoming more and more frequent. ...

 

I recently upgraded to libata/pata_sis (kernel driver) plus some updates, and noticed my drive is reset in the end of every bootup * now. Looks like this in the logs: 

```
Jul 29 17:00:52 [kernel] ata1.00: configured for UDMA/100

Jul 29 17:00:52 [kernel] ata1.01: configured for UDMA/33

Jul 29 17:00:52 [kernel] ata1: EH complete
```

...besides dma it changes the acoustic management setting of the drive to "loud and fast" as a default setting. It makes louder clicking noises. Are you sure your drive is failing? Maybe check this using hdparm...

```
# hdparm -M /dev/sda

/dev/sda:

 acoustic      =  0 (128=quiet ... 254=fast)

# hdparm -M128 /dev/sda

/dev/sda:

 setting acoustic management to 128

 acoustic      = 128 (128=quiet ... 254=fast)
```

"acoustic =  0" means off, makes the same noise as fast on my drive. 128 is quiet, as I was used to using libide....

* UPDATE 31/07/10 13:53: Did some testing. It's KDM resetting my drive...  :Evil or Very Mad: 

----------

