# ouch, bad block in /var/tmp partition, recompile all?

## phsdv

 :Sad:  I decide to check my disk status and found that one disk has a bad sector.

```
# smartctl -l error /dev/hdg

Error 7 occurred at disk power-on lifetime: 4597 hours (191 days + 13 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 60 1f 8b 63 e3  Error: UNC 96 sectors at LBA = 0x03638b1f = 56855327

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  c8 da 70 0f 8b 63 e3 00      01:54:05.300  READ DMA

  c8 da 70 0f 8b 63 e3 00      01:54:00.200  READ DMA

  c8 da 08 07 8b 63 e3 00      01:54:00.200  READ DMA

  ca da 08 b7 28 b8 e2 00      01:54:00.200  WRITE DMA

  ca da 20 1f fa ca e1 00      01:54:00.200  WRITE DMA

```

Question: does this mean I have 96 bad sectors?

Disk /dev/hdg is part of a software raid0 system. And checking the fstab

```
# fdisk -lu /dev/hdg

Disk /dev/hdg: 30.7 GB, 30738677760 bytes

255 heads, 63 sectors/track, 3737 cylinders, total 60036480 sectors

Units = sectors of 1 * 512 = 512 bytes

   Device Boot      Start         End      Blocks   Id  System

/dev/hdg1              63    48837599    24418768+  fd  Linux raid autodetect

/dev/hdg2        48837600    58621184     4891792+  fd  Linux raid autodetect

/dev/hdg3        58621185    60034904      706860   82  Linux swap / Solaris

```

I found that the faling LBA (56855327) is located on hdg2.

Cheking with mdadm I could see which logical partition uses this

```
# mdadm -D /dev/md1

/dev/md1:

        Version : 00.90.01

  Creation Time : Sat Aug 20 08:54:45 2005

     Raid Level : raid0

     Array Size : 9783424 (9.33 GiB 10.02 GB)

   Raid Devices : 2

  Total Devices : 2

Preferred Minor : 1

    Persistence : Superblock is persistent

    Update Time : Sat Aug 20 08:54:45 2005

          State : clean

 Active Devices : 2

Working Devices : 2

 Failed Devices : 0

  Spare Devices : 0

     Chunk Size : 64K

           UUID : 02883ddf:9cd95cab:38ba4952:33d755e9

         Events : 0.1

    Number   Major   Minor   RaidDevice State

       0      33        2        0      active sync   /dev/hde2

       1      34        2        1      active sync   /dev/hdg2

```

md1 is mounted on /var/tmp, fstab:

```
/dev/md/1 on /var/tmp type reiserfs (rw,noatime)
```

So, a bad sector, (or even 96?) on /var/tmp, which means that maybe all package that I compiled since I did setup /var/tmp are potential corrupt?

To correct this I am supposed to write all zeros in these sectors right? And then the disk will automatically reallocate these sectors?

Or can I just add this block to the bad block list using reiserfstune?

PaulLast edited by phsdv on Fri Oct 07, 2005 10:35 pm; edited 1 time in total

----------

## Marctraider

You dont need to recompile your things if anything happens to /var/tmp, only when a compilation is in process.

A bad sector cannot just be overwritten by zeroes and its all fixed... bad sectors are normally physical damage problems on the platters of the drive...

Your only option is to try a low-level format (BE SURE THAT ITS POSSIBLE AND THAT IT WONT DESTROY YOUR DISK) but there is always a risk...

If you get bad sectors it will only increase by time most of the time... Your disk just gets old or isnt in good health anymore, perhaps you should consider replacing the drive if the number of bad sectors increase.

Edit: a program called 'badblocks' (you probably already have that because it most likely came with the base system of gentoo) can check again to see if there are any bad blocks, and it will report what sectors are bad   :Mad:  so just run badblocks /dev/hda or whatever your drive is and wait   :Laughing: 

----------

## nixnut

Moved from Gentoo Chat to Kernel & Hardware.

I'll assume this is on a gentoo system. Hardware problem so moved here.

----------

## phsdv

 *Marctraider wrote:*   

> You dont need to recompile your things if anything happens to /var/tmp, only when a compilation is in process.
> 
> 

 Yes, but the problem is that during compile /var/tmp is used, and I have no way to verify if the bad block was used during any of the compiles done. If so, the data could be corrupted, right?

 *Marctraider wrote:*   

> A bad sector cannot just be overwritten by zeroes and its all fixed... bad sectors are normally physical damage problems on the platters of the drive...

  yes, correct. But I understood that the controller in the drive would know a badblock and when all 0 are written, it can reallocte it with a spare block.

 *Marctraider wrote:*   

> Your only option is to try a low-level format (BE SURE THAT ITS POSSIBLE AND THAT IT WONT DESTROY YOUR DISK) but there is always a risk...

  I do not believe so. Low level formats are only for very old drives, at least that is how I understand it.

 *Marctraider wrote:*   

> Edit: a program called 'badblocks' (you probably already have that because it most likely came with the base system of gentoo) can check again to see if there are any bad blocks, and it will report what sectors are bad   so just run badblocks /dev/hda or whatever your drive is and wait  

 Thanks for that tip. I did run it, and it did not find anything   :Question:   while smartctrl says there are errors...

But then I noticed a message in the output of smartctrl that my diskdrive (deskstart) should probably have a firmware upgrade. Well I did upgrade, and after that I do not see any errors anymore. I did run the long test, but still fine!!

----------

## Marctraider

Well seems there was no real problem at all   :Laughing: 

When there is a bad block somewhere it gets recorded in some list of bad blocks (cant remember were) but probably somewhere in the FS, I remember seeing it in dos when i formatted a disk, when it records any bad blocks...

Also with making linux filesystems i believe...

If you compiled something and there was really a bad sector(s) were /var/tmp is located, it would most probably stop the emerging process and reported there was a bad sector in dmesg... Since if there was a bad sector in /var/tmp and it wanted to write on that sector, it couldnt so it would stop the emerge   :Razz: 

You wouldnt want to see all my S.M.A.R.T errors by smartctl... because there are quite some... but they are not really critical, and the drives health are still perfect (as it tells me)

----------

## bonbons

 *phsdv wrote:*   

> Yes, but the problem is that during compile /var/tmp is used, and I have no way to verify if the bad block was used during any of the compiles done. If so, the data could be corrupted, right?

 Bad blocks on disk are by far not as anoying as bad-bits in memory. When the drive finds a bad block you will get a IO-Error on reading, or the disk will simply relocate the block on writing. So not real risk of data corruption. at most the data on faulty blocks is lost (and the program accessing the block gets an IO Error)

 *phsdv wrote:*   

>  *Marctraider wrote:*   A bad sector cannot just be overwritten by zeroes and its all fixed... bad sectors are normally physical damage problems on the platters of the drive...  yes, correct. But I understood that the controller in the drive would know a badblock and when all 0 are written, it can reallocte it with a spare block.

 It does not matter AT ALL what you write to the faulty block. The disk will just relocate if it can't properly write. Eventually a read/write sequence is reauired to trigger the relocation (disk may not detect write issue if it was just writing without ever reading the block)

The information that smart provides is a summary of what has happend since the disk was manufactured. Depending on the brand of the disk individual SMART fields are assigned completely different meanings (e.g. Seagate). What's important is the generic message stating how healthy a disk is. Bad sectore are not problematic unless there are no spare blocks available for relocations.

----------

## flybynite

Looks like all good info here.  I'll just add my a little more based on my reading.

The problem is bad bads are probably a physical failure on the platter.  Now that the platter has been damaged that gunk ,dirt particle,scratched off coating,etc, is flying around inside the disk and may now scratch other parts of the disk leading to a more rapid failure rate in the future.  This still may be weeks, months, or even years.  You have to do more digging to know.

Modern disks do their own badblock replacement and is hidden from the operating system when the failure occurs when writing data.  The disk will silently remap the badblock and only note the failure in the smart data.  When the disk runs out of spare sectors then it will fail to write and the operating system will notice and complain in the logs.  By this time, lots of sectors have failed and all the spares are gone which means the next failure is data loss.

Remapping at the filesystem level is possible but probably not required since the drive should remap the blocks itself.  If the drive is out of spare blocks you now have to remap in the filesystem, but why chance it with such an old disk with so many failed blocks that it ran out of spares?

If the failure occurred during reading, the data is already gone.  Since your block is in tmp no big deal if you force the remapping.

You just need to make sure how many spare sectors remain and check the logs for drive error messages.  This will help you determine the level of failure and note the rate at which badblocks are appearing.

Show the whole smartctl -a /dev/hdx output so we can look and see the failed sectors remaining etc.  Also, do you have logs showing disk errors?

I don't have conclusive info, but smart health status failing means the maker will replace the drive under warranty.  I don't believe the maker will set a failing health status until the drive is about to blow up!  I wouldn't wait for it to fail, by that time, It may already be too late!

Here is some info from the smartmontools people to answer your question about forcing the remapping.

http://smartmontools.sourceforge.net/BadBlockHowTo.txt

```

To force the disk to reallocate this bad block we'll write zeros to

the bad block, and sync the disk:

-----------------------------------------------------------------------------------------------

root]# dd if=/dev/zero of=/dev/hda3 bs=4096 count=1 seek=2269012

root]# sync

-----------------------------------------------------------------------------------------------

NOTE: THIS LAST STEP HAS PERMANENTLY AND IRRETREVIABLY DESTROYED SOME

OF THE DATA THAT WAS IN THIS FILE.  DON'T DO THIS UNLESS YOU DON'T

NEED THE FILE OR YOU CAN REPLACE IT WITH A FRESH OR CORRECT VERSION.

```

----------

## Marctraider

Yeah dd, good program you can also write your disks full of zeroes   :Razz: 

Strange how some people think that writing your disk with zeroes is a low-level format hehe.

I had a boot sector on my old laptop once, the partitions kept being deleted.

I thought the drive was just bad or something, but a program called killdisk wrote the disk full of zeroes and the problem was away. Ofcourse i could achieve the same with linux  :Smile: 

----------

## NeddySeagoon

Marctraider,

Its been a long time since you could do a low level format to a hard drive. One surface of one platter is read only and has the head servo tracks and sector address marks written at manufacture.

There was a brief period when voive coil head actuators were introduced where attempting a low level format would remove the head servo tracks, which was a bit unfortunate because it rendered the drive useless.

----------

## Marctraider

Yeah i guess so, I'm just still thinking about old drives hehe, i know some guy who had bad sectors on an old drive, and he did a low level format, and it was all fixed  :Smile:  not by filesystem level

----------

## kadajawi

Hmm... what would old drive mean? I have an about 5 year old one with some serious trouble, but need the space (and am on low budget  :Wink:  ). It's only for absolutely non critical temporary data where it would be anoying at maximum. Any way to repair it? Connected it as SATA drive, I can low level format it under the SATA BIOS, but it took 12 hours to get 2 percent done, but I still want to use my system. Is there a way to do it under linux? The HD won't even let me change partitions, but the data itself seems to be still existing on it, so I think that some parts of the drive are still usable...

----------

## NeddySeagoon

kadajawi,

Run badblocks on the drive. It can do many different platter surface tests. From reads, to assorted pattern writes and read backs

read 

```
man badblocks
```

Its also worth getting smartmontools. That will allow you to read the drives internal health log.

Despite whats advertised in your BIOS, most low level formats only fo a surface read and zero the MBR, so the dribe looks blank.

12 hours for 2% hints that the surface was readable but that you got a lot of retries, hence it was very slow. Thats a bad sign.

----------

## kadajawi

Ah, ok  :Smile:  I did badblocks -v /dev/sda, but it found nothing.

I hoped that it could just mark the bad blocks as bad and let me use whatever is left (even if its only 6 of the 13 gb).

I'll give smartmontools a try. Thx  :Smile: 

----------

## Marctraider

I think there are no bad sectors at all, secondly, if there were bad sectors only the filesystem would mark them as bad, afaik badblocks should always see them because it just reads all raw data on the drive from beginning to end.

You dont have to worry   :Shocked: 

----------

## kadajawi

Haha, funny thing to say on a drive where fdisk fails to save changes...  :Laughing: 

 :Wink:  But, yeah, it doesn't seem to find anything... strange...

----------

## NeddySeagoon

kadajawi,

In your BIOS, check that boot block virus protection is off.

With it on, nothing (including fdisk) can write to the MBR.

----------

## kadajawi

But I don't have any troubles changing the MBR on my other hard drives. Its a PATA drive connected to SATA using an adapter from Abit.

----------

## Marctraider

 *kadajawi wrote:*   

> But I don't have any troubles changing the MBR on my other hard drives. Its a PATA drive connected to SATA using an adapter from Abit.

 

I also thought about a virus or something, my old laptop had a 4gig hd, and I thought the harddisk was bad because everytime my MBR was gone, partitions gone... It seemed to be a virus, you know how i got rid of it? zero out the whole drive  :Wink: 

```

dd if=/dev/zero of=/dev/hdXY bs=512

```

or

```

cat /dev/zero > /dev/hdXY

```

----------

## kadajawi

Ok, thanks. Will give that a try.

When I try to do fdisk I get this:

 *Quote:*   

> fdisk /dev/sda
> 
> Unable to read /dev/sda

 

What the...? Doing it on other drives makes no problem...

Hm. Doing the cat thing gave me 100% CPU usage, but no drive activity?

----------

## Marctraider

just use the dd command   :Rolling Eyes: 

----------

## kadajawi

Been there, done that. Makes no difference. 

When I tried to restart my system I found this:

 *Quote:*   

> ata1: command 0xca timeout, stat 0xd8 host_stat 0x61
> 
> ata1: status=0xd8 { Busy }
> 
> SCSI error : <0 0 0 0> return code 0x0000002
> ...

 

Lots of such errors...

----------

## Marctraider

Well if there are no bad sectors the drive can still be that, either hardware problem. or problem with software (linux)

----------

