# Strange problem with badblocks

## dopey

My server started getting strange kernel oopses.  Usually when it starts to hit swap so I did a badblocks check on swap.  It found a single bad block.  Being pretty worried, I turned swap off and started to do a badblock check on all my partitions.  almost every partition turned up bad blocks.  The odd thing is some didn't and I can't quite figure out why.  All the bad blocks are

at the end of the partition.  Last few blocks at most.

The kernel reports an attempt to read past the end of the device.

What's even stranger is every single one of my machines exhibits this behavior on the IDE drives. None of my scsi drives do this.  So, given the off random chance that every single IDE drive I own (8 of then) has bad blocks, I managed to clear all the data off of one, repartition it and do some bad blocks testing.  Nothing.  Everything checked out clean.

Then I ran mkswap on the swap partition, used swapon to activate it.  I decided to be more thorough, and deactivated the swap partition and ran badblocks again and guess what? It found a single bad block. Last block of the partition.

WTF?  Anyone else experience this?  Other partitions are randomly starting to show badblocks, but only on the last few blocks.  I think it must be related to the attempt toread past the end of the device, but why the hell is it not properly reading the partition table?

GRRR!!!

----------

## aroedl

Hello!

 *dopey wrote:*   

> GRRR!!!

 

Kernel version? IDE host controller (lspci)? Kernel configuration?

Andi

----------

## dopey

Happens on both 2.4.22 vanilla and gentoo 2.4.20-r7.

on the 2.4.22 vanilla system lspci shows:

00:07.1 IDE interface: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT8233/A/C/VT8235 PIPC Bus Master IDE (rev 06)

kernel messages:

Uniform Multi-Platform E-IDE driver Revision: 7.00beta4-2.4

ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx

VP_IDE: IDE controller at PCI slot 00:07.1

VP_IDE: chipset revision 6

VP_IDE: not 100% native mode: will probe irqs later

ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx

VP_IDE: VIA vt82c686b (rev 40) IDE UDMA100 controller on pci00:07.1

    ide0: BM-DMA at 0xb800-0xb807, BIOS settings: hda:DMA, hdb:pio

    ide1: BM-DMA at 0xb808-0xb80f, BIOS settings: hdc:DMA, hdd:pio

hda: MAXTOR 6L040J2, ATA DISK drive

blk: queue c0390b80, I/O limit 4095Mb (mask 0xffffffff)

hdc: IC35L060AVV207-0, ATA DISK drive

blk: queue c0390fd4, I/O limit 4095Mb (mask 0xffffffff)

On the 2.4.20-r7 gentoo system it's:

00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-766 [ViperPlus] IDE (rev 01)

kernel messages:

Uniform Multi-Platform E-IDE driver Revision: 6.31

ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx

AMD7411: IDE controller on PCI bus 00 dev 39

AMD7411: chipset revision 1

AMD7411: not 100% native mode: will probe irqs later

ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx

AMD_IDE: Advanced Micro Devices [AMD] AMD-766 [ViperPlus] IDE (rev 01) UDMA100 controller on pci00:07.1

    ide0: BM-DMA at 0xf000-0xf007, BIOS settings: hda:DMA, hdb:DMA

    ide1: BM-DMA at 0xf008-0xf00f, BIOS settings: hdc:DMA, hdd:pio

hda: C/H/S=42473/226/193 from BIOS ignored

hdb: C/H/S=42473/226/193 from BIOS ignored

hda: WDC WD800JB-00CRA1, ATA DISK drive

hdb: MAXTOR 6L040J2, ATA DISK drive

hdc: LITE-ON LTR-40125S, ATAPI CD/DVD-ROM drive

relevant IDE section of kernel config

# ATA/IDE/MFM/RLL support

#

CONFIG_IDE=y

#

# IDE, ATA and ATAPI Block devices

#

CONFIG_BLK_DEV_IDE=y

# CONFIG_BLK_DEV_HD_IDE is not set

# CONFIG_BLK_DEV_HD is not set

CONFIG_BLK_DEV_IDEDISK=y

CONFIG_IDEDISK_MULTI_MODE=y

# CONFIG_IDEDISK_STROKE is not set

CONFIG_BLK_DEV_IDECS=m

CONFIG_BLK_DEV_IDECD=m

CONFIG_BLK_DEV_IDETAPE=m

CONFIG_BLK_DEV_IDEFLOPPY=y

CONFIG_BLK_DEV_IDESCSI=m

# CONFIG_IDE_TASK_IOCTL is not set

CONFIG_BLK_DEV_CMD640=y

# CONFIG_BLK_DEV_CMD640_ENHANCED is not set

CONFIG_BLK_DEV_ISAPNP=y

CONFIG_BLK_DEV_IDEPCI=y

# CONFIG_BLK_DEV_GENERIC is not set

CONFIG_IDEPCI_SHARE_IRQ=y

CONFIG_BLK_DEV_IDEDMA_PCI=y

# CONFIG_BLK_DEV_OFFBOARD is not set

# CONFIG_BLK_DEV_IDEDMA_FORCED is not set

CONFIG_IDEDMA_PCI_AUTO=y

# CONFIG_IDEDMA_ONLYDISK is not set

CONFIG_BLK_DEV_IDEDMA=y

# CONFIG_IDEDMA_PCI_WIP is not set

# CONFIG_BLK_DEV_ADMA100 is not set

CONFIG_BLK_DEV_AEC62XX=y

CONFIG_BLK_DEV_ALI15X3=y

# CONFIG_WDC_ALI15X3 is not set

CONFIG_BLK_DEV_AMD74XX=y

# CONFIG_AMD74XX_OVERRIDE is not set

CONFIG_BLK_DEV_CMD64X=y

# CONFIG_BLK_DEV_TRIFLEX is not set

CONFIG_BLK_DEV_CY82C693=y

CONFIG_BLK_DEV_CS5530=y

CONFIG_BLK_DEV_HPT34X=y

# CONFIG_HPT34X_AUTODMA is not set

CONFIG_BLK_DEV_HPT366=y

CONFIG_BLK_DEV_PIIX=y

# CONFIG_BLK_DEV_NS87415 is not set

# CONFIG_BLK_DEV_OPTI621 is not set

# CONFIG_BLK_DEV_PDC202XX_OLD is not set

# CONFIG_PDC202XX_BURST is not set

# CONFIG_BLK_DEV_PDC202XX_NEW is not set

CONFIG_BLK_DEV_RZ1000=y

# CONFIG_BLK_DEV_SC1200 is not set

CONFIG_BLK_DEV_SVWKS=y

# CONFIG_BLK_DEV_SIIMAGE is not set

CONFIG_BLK_DEV_SIS5513=y

CONFIG_BLK_DEV_SLC90E66=y

# CONFIG_BLK_DEV_TRM290 is not set

CONFIG_BLK_DEV_VIA82CXXX=y

# CONFIG_IDE_CHIPSETS is not set

CONFIG_IDEDMA_AUTO=y

# CONFIG_IDEDMA_IVB is not set

# CONFIG_DMA_NONPCI is not set

CONFIG_BLK_DEV_IDE_MODES=y

CONFIG_BLK_DEV_ATARAID=m

CONFIG_BLK_DEV_ATARAID_PDC=m

CONFIG_BLK_DEV_ATARAID_HPT=m

# CONFIG_BLK_DEV_ATARAID_SII is not set

----------

## aroedl

Hello!

 *dopey wrote:*   

> Happens on both 2.4.22 vanilla and gentoo 2.4.20-r7.
> 
> hda: C/H/S=42473/226/193 from BIOS ignored
> 
> hdb: C/H/S=42473/226/193 from BIOS ignored
> ...

 

 *Quote:*   

> 
> 
> CONFIG_IDEDISK_MULTI_MODE=y
> 
> 

 

I don't think, that this is really needed.

 *Quote:*   

> 
> 
> # CONFIG_IDEDISK_STROKE is not set
> 
> 

 

Do you have an AWARD BIOS? The Maxtor HD seems to be larger than 32 GB. So turn this option on. In the Kernel config menu its called "Auto-Geometry Resizing support".

Make sure that you don't specify any values (heads, cylinders, sectors) in the BIOS. It should be set to auto-detection.

Andi

AndiLast edited by aroedl on Tue Oct 14, 2003 5:10 pm; edited 1 time in total

----------

## Sargon

 *dopey wrote:*   

> What's even stranger is every single one of my machines exhibits this behavior on the IDE drives. None of my scsi drives do this.

 

AFAIK, SCSI disks remap bad sectors automatically. There are tools available to get a list of all bad blocks of a SCSI disk though. (scsiinfo?)

Sargon

----------

## dopey

Yeah. I don't think multimode is necessary either.  I didn't think it would hurt to enable it though.

I do have an Award BIOS on the vanilla system.  However, it's a relatively recent motherboard and I didn't think STROKE was necessary.  I'll try it though.  The other system that this is occuring on does not have an award bios (Phoenix Server bios, it's an Athlon MP motherboard) and has the same problem as well.

 *aroedl wrote:*   

> 
> 
>  *Quote:*   
> 
> CONFIG_IDEDISK_MULTI_MODE=y
> ...

 

----------

## dopey

CONFIG_IDEDISK_STROKE doesn't do make a difference.

The same thing still happens.  badblocks at the end of the partition.

What's odd is, if I repartition, use badblocks -w it'll turn up with no bad blocks until I create a filesystem and mount it.  Then if I unmount the problem

occurs.

This is what shows up in a kernel during a badblocks on the athlon mp

system. 2.4.20-gentoo-r7

hdb2: bad access: block=1060288, count=8

end_request: I/O error, cmd 0 dev 03:42 (hdb), sector 106028

On the other system the following shows up:

attempt to access beyond end of device

03:02: rw=0, want=2008128, limit=2008125

----------

## dopey

I found the following on google groups with someone pointing out the same behavior I'm getting.   Apparently, Alan Cox feels it may be a result of building with an "untrusted hand built tool chain"

I'm wondering if might be worth trying to start a Gentoo poll to see how many people with IDE devices might be running into this problem.

http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=001001c27d02%246297fe50%244500a8c0%40cybernet.cz

----------

## marnix

Hi forum,

I have a similar (the same?) problem as dopey and I've run out of ideas for troubleshooting. I will post my findings here and hopefully someone else may have an idea of what is going on here.

Some partitions have at most 3 "badblocks" at the end of the partition. When running badblocks on /dev/hda9 for example I find the following badblocks:

2939860

2939861

2939862

badblocks indicates that it is checking From block 0 to 2939863, so these are the last 3 blocks (except the final one?)

Also these messages appear in the log:

Jan 13 10:50:28 xxxxxx attempt to access beyond end of device

Jan 13 10:50:28 xxxxxx 03:09: rw=0, want=2939864, limit=2939863

Jan 13 10:50:28 xxxxxx attempt to access beyond end of device

Jan 13 10:50:28 xxxxxx 03:09: rw=0, want=2939864, limit=2939863

Jan 13 10:50:28 xxxxxx attempt to access beyond end of device

Jan 13 10:50:28 xxxxxx 03:09: rw=0, want=2939864, limit=2939863

Jan 13 10:50:28 xxxxxx attempt to access beyond end of device

Jan 13 10:50:28 xxxxxx 03:09: rw=0, want=2939864, limit=2939863

Because I previously had other strange behaviour that might suggest a failing harddisk I ran a testprogram provided by the disk manufacturer (maxtor). This tool reported that all tests where passed and that the disk was "certified error free", whatever that may mean.

After this I ran badblocks /dev/hda and to my surprise there where no badblocks on the disk at all, but running it on /dev/hda9 again gave the same output as above. Testing with dd if=/dev/hda0 of=/dev/null and cat /dev/hda9 gave the same messages in the logs. So probably the badblocks reported on hda9 are caused by some other problem than actual bad blocks on the device. So where can it be?

When looking at the partition table in fdisk (expert mode)  if found the following for /dev/hda9:

Nr AF  Hd Sec  Cyl  Hd Sec  Cyl    Start     Size ID

 9 00   1   1  292 254  63  657       63  5879727 83

One point of interest may be that the size reported by fdisk is twice the number of blocks reported by badblocks plus one (2x 2939863 = 5879726). Strangely the number of blocks reported in the log is 2939864, and 2x 2939864 = 5879728. So different parts of the system seem to think differently about the exact size of /dev/hda9.

Running tune2fs -l /dev/hda9 results in the following output (only showing interesting parts):

Filesystem state:          clean

Errors behavior:          Continue

Filesystem OS type:     Linux

Inode count:                368000

Block count:                734965

Reserved block count:   36748

Free blocks:                 715200

Free inodes:                 367989

First block:                             0

Block size:                        4096

Fragment size:                  4096

Blocks per group:          32768

Fragments per group:      32768

Inodes per group:         16000

Inode blocks per group:   500

Considering a blocksize of 4K the number of 1K blocks would be 734965 * 4 = 2939860.

From all these number I have drawn the following conclusion, but please correct me if I'm wrong:

The partition consists of 5879727 512B blocks, which means that it consists of 2939863.5

1K blocks. So the partition boundary is halfway a 1K block and this might explain why the programs want to access 2939864 blocks. This does not work, because the kernel has placed a limit at 2939863 (not quite sure about this). But it puzzles me why any program would want to read beyond the 2939860 blocks in the first place, because the filesystem reports its boundary to be there. Probably because the kernel reports the wrong boundary?

Also the fact that we see 4 lines in the logging might indicate that it is trying to read 4 1K blocks (or 1 4K block) too many.

After this short analysis I tried the following:

badblocks -sv -b 1024 /dev/hda9 (which should consider the blocksize as being 1K), but the results are still the same (three "badblocks and 4 lines in the log).

Having read the above post and the thread on googlegroups mentioned above I decided to try this under another kernel version.

First I tried the 2.4.24 vanilla kernel (compiled on the same gentoo machine):

I made a new filesystem on /dev/hda9 and ran badblocks on it. Too my surprise the problem was not occurring anymore. BUT, after I mounted the filesystem and ran badblocks again, the problems reapeared again. Finally after unmounting the filesystem the problem was still there.

Next I booted of the gentoo liveCD (gentoo kernel) and followed the same procedure and with the exact same results. I might assume that the toolchain that was used to compile the gentoo liveCD kernel is fully tested so the suggestion by Alan Cox probably does not apply here.

(cdspecs of the LiveCD used:):

Sun Sep 14 02:53:41 UTC 2003

cvs.gentoo.org

x86-livecd-20030801

x86-doc-20030911-drob

x86-distsnap-20030911

x86-x86-20030911-cd1

)

So far for my analysis of this strange problem, I am totally out of ideas and I could not find any other posts on this matter. Hopefully someone else can help in solving this problem, since now I am not sure whether I should trust the system with my valuable data...

Thanks in advance for any help!Last edited by marnix on Thu Apr 27, 2006 5:46 pm; edited 1 time in total

----------

## marnix

We have a thread going on this topic at the lfs.support mailinglist, which is accessible through news.linuxfromscratch.org. Look for "strange badblocks problem. Please check it out. I'm beginning to suspect it has something to do with the kernel or e2fs tools on the current liveCD (maybe in combination with a specific brand of harddisk (maxtor).

Maybe some more people could run badblocks -sv /dev/hdaxx on some of their partitions to find out if more people have this problem...

dopey do you still have this problem? You might want to try some troubleshooting suggestions that where mentioned in the forementioned lfs.support thread

----------

