# RAID1 failures

## Prospero

Greetings people,

I have a computer with two 80gb harddrives that I've put in a RAID1 configuration. Both harddrives are split into several partitions, 1 for boot, 1 for swap, and 5 others for other stuff including filesystem root.

The last few days however, the system locks up every now and then, and after rebooting the whole system I find that one of the drives has failed. Or rather - just the 5 partitions that do the data storage. The boot drive is perfectly intact, though the swap drives may be affected (they are not mirrored), explaining the failure.

I've had this failure in the past before, but that was usually just 1 of the harddrives - which I had replaced. Now the harddrives seem to be taking turn in failing.

```
Personalities : [raid1]

md1 : active raid1 sda3[0]

      19542976 blocks [2/1] [U_]

md3 : active raid1 sda6[0]

      9775424 blocks [2/1] [U_]

md2 : active raid1 sda5[0]

      1959808 blocks [2/1] [U_]

md4 : active raid1 sda7[0]

      19542976 blocks [2/1] [U_]

md5 : active raid1 sda8[0]

      28169856 blocks [2/1] [U_]

md0 : active raid1 sdb1[1] sda1[0]

      40064 blocks [2/2] [UU]

unused devices: <none>

```

Does anyone have a clue what's going wrong? Or what logs to check for errors?

----------

## NeddySeagoon

Prospero,

Check dmesg for drive errors and emerge smartmontools to check the drives internal logs.

That will tell you if the drives are recording problems internally.

----------

## Prospero

```

md: Autodetecting RAID arrays.

md: invalid superblock checksum on sdb3

md: sdb3 has invalid sb, not importing!

md: invalid superblock checksum on sdb6

md: sdb6 has invalid sb, not importing!

md: autorun ...

md: considering sdb8 ...

md:  adding sdb8 ...

md: sdb7 has different UUID to sdb8

md: sdb5 has different UUID to sdb8

md: sdb1 has different UUID to sdb8

md:  adding sda8 ...

md: sda7 has different UUID to sdb8

md: sda6 has different UUID to sdb8

md: sda5 has different UUID to sdb8

md: sda3 has different UUID to sdb8

md: sda1 has different UUID to sdb8

md: created md5

md: bind<sda8>

md: bind<sdb8>

md: running: <sdb8><sda8>

md: kicking non-fresh sdb8 from array!

md: unbind<sdb8>

md: export_rdev(sdb8)

raid1: raid set md5 active with 1 out of 2 mirrors

md: considering sdb7 ...

md:  adding sdb7 ...

md: sdb5 has different UUID to sdb7

md: sdb1 has different UUID to sdb7

md:  adding sda7 ...

md: sda6 has different UUID to sdb7

md: sda5 has different UUID to sdb7

md: sda3 has different UUID to sdb7

md: sda1 has different UUID to sdb7

md: created md4

md: bind<sda7>

md: bind<sdb7>

md: running: <sdb7><sda7>

md: kicking non-fresh sdb7 from array!

md: unbind<sdb7>

md: export_rdev(sdb7)

raid1: raid set md4 active with 1 out of 2 mirrors

md: considering sdb5 ...

md:  adding sdb5 ...

md: sdb1 has different UUID to sdb5

md: sda6 has different UUID to sdb5

md:  adding sda5 ...

md: sda3 has different UUID to sdb5

md: sda1 has different UUID to sdb5

md: created md2

md: bind<sda5>

md: bind<sdb5>

md: running: <sdb5><sda5>

md: kicking non-fresh sdb5 from array!

md: unbind<sdb5>

md: export_rdev(sdb5)

raid1: raid set md2 active with 1 out of 2 mirrors

md: considering sdb1 ...

md:  adding sdb1 ...

md: sda6 has different UUID to sdb1

md: sda3 has different UUID to sdb1

md:  adding sda1 ...

md: created md0

md: bind<sda1>

md: bind<sdb1>

md: running: <sdb1><sda1>

raid1: raid set md0 active with 2 out of 2 mirrors

md: considering sda6 ...

md:  adding sda6 ...

md: sda3 has different UUID to sda6

md: created md3

md: bind<sda6>

md: running: <sda6>

raid1: raid set md3 active with 1 out of 2 mirrors

md: considering sda3 ...

md:  adding sda3 ...

md: created md1

md: bind<sda3>

md: running: <sda3>

md: md1: raid array is not clean -- starting background reconstruction

raid1: raid set md1 active with 1 out of 2 mirrors

md: ... autorun DONE.

md: Loading md1: /dev/sda3

md: couldn't update array info. -22

md: could not bd_claim sda3.

md: md_import_device returned -16

md: invalid superblock checksum on sdb3

md: sdb3 has invalid sb, not importing!

md: md_import_device returned -22

md: starting md1 failed

```

That pretty much confirms my suspicion that it's not working  :Wink: 

Thanks for pointing me to the right log, I'm a noob when it comes to logs. 

As for Smartmontools, my drive is an SATA drive, and I've heard they don't work very well with smartmontools (I should have mentioned this earlier I know).

----------

## NeddySeagoon

Prospero,

That log shows a mix of problems relating to damaged data but nothing about problems with the underlying drive, as would be indiced by seek errors. It appears you have invalid superblocks on most partitions and one thats out of sync. Such errors can be caused by not doing clean shutdowns.

Smartmontools needs a libsata with pass through. Its been available as a patch for a while but I don't know if its in the kernel.

It may not work for you.

Its worth trying to reconstruct the mirror from the good drive

----------

## Prospero

Thank you for all your help. Reconstructing the failed drives fixed the problem for a few days but the error has once again happened.

I'm pretty clueless as to how to solve it, so I've taken the liberty of posting /var/log/dmesg, full content this time

```

Linux version 2.6.11-hardened-r15 (something@somewhere) (gcc version 3.3.6 (Gentoo 3.3.6, ssp-3.3.6-1.0, pie-8.7.8)) #7 SMP Tue Nov 22 17:54:20 CET 2005

BIOS-provided physical RAM map:

 BIOS-e820: 0000000000000000 - 000000000009d800 (usable)

 BIOS-e820: 000000000009d800 - 00000000000a0000 (reserved)

 BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)

 BIOS-e820: 0000000000100000 - 000000001fffb000 (usable)

 BIOS-e820: 000000001fffb000 - 000000001ffff000 (ACPI data)

 BIOS-e820: 000000001ffff000 - 0000000020000000 (ACPI NVS)

 BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)

 BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)

 BIOS-e820: 00000000ffff0000 - 0000000100000000 (reserved)

511MB LOWMEM available.

On node 0 totalpages: 131067

  DMA zone: 4096 pages, LIFO batch:1

  Normal zone: 126971 pages, LIFO batch:16

  HighMem zone: 0 pages, LIFO batch:1

DMI 2.3 present.

ACPI: RSDP (v000 ASUS                                  ) @ 0x000f5e20

ACPI: RSDT (v001 ASUS   A7V600-X 0x42302e31 MSFT 0x31313031) @ 0x1fffb000

ACPI: FADT (v001 ASUS   A7V600-X 0x42302e31 MSFT 0x31313031) @ 0x1fffb0b2

ACPI: BOOT (v001 ASUS   A7V600-X 0x42302e31 MSFT 0x31313031) @ 0x1fffb030

ACPI: MADT (v001 ASUS   A7V600-X 0x42302e31 MSFT 0x31313031) @ 0x1fffb058

ACPI: DSDT (v001   ASUS A7V600-X 0x00001000 MSFT 0x0100000b) @ 0x00000000

ACPI: Local APIC address 0xfee00000

ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)

Processor #0 6:8 APIC version 16

ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])

ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])

IOAPIC[0]: apic_id 2, version 3, address 0xfec00000, GSI 0-23

ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl edge)

ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level)

ACPI: IRQ0 used by override.

ACPI: IRQ2 used by override.

ACPI: IRQ9 used by override.

Enabling APIC mode:  Flat.  Using 1 I/O APICs

Using ACPI (MADT) for SMP configuration information

Allocating PCI resources starting at 20000000 (gap: 20000000:dec00000)

Built 1 zonelists

Kernel command line: root=/dev/md1 md=1,/dev/sda3,/dev/sdb3

md: Will configure md1 (super-block) from /dev/sda3,/dev/sdb3, below.

mapped APIC to ffffd000 (fee00000)

mapped IOAPIC to ffffc000 (fec00000)

Initializing CPU#0

CPU 0 irqstacks, hard=c0462000 soft=c045a000

PID hash table entries: 2048 (order: 11, 32768 bytes)

Detected 1393.461 MHz processor.

Using tsc for high-res timesource

Console: colour VGA+ 80x25

Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)

Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)

Memory: 514676k/524268k available (2412k kernel code, 8948k reserved, 471k data, 204k init, 0k highmem)

Checking if this processor honours the WP bit even in supervisor mode... Ok.

Calibrating delay loop... 2744.32 BogoMIPS (lpj=1372160)

Mount-cache hash table entries: 512 (order: 0, 4096 bytes)

CPU: After generic identify, caps: 0383fbff c1cbfbff 00000000 00000000 00000000 00000000 00000000

CPU: After vendor identify, caps: 0383fbff c1cbfbff 00000000 00000000 00000000 00000000 00000000

CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)

CPU: L2 Cache: 256K (64 bytes/line)

CPU: After all inits, caps: 0383fbff c1cbfbff 00000000 00000020 00000000 00000000 00000000

Intel machine check architecture supported.

Intel machine check reporting enabled on CPU#0.

Enabling fast FPU save and restore... done.

Enabling unmasked SIMD FPU exception support... done.

Checking 'hlt' instruction... OK.

CPU0: AMD Athlon(TM) XP 1600+ stepping 01

per-CPU timeslice cutoff: 731.23 usecs.

task migration cache decay timeout: 1 msecs.

Total of 1 processors activated (2744.32 BogoMIPS).

ENABLING IO-APIC IRQs

..TIMER: vector=0x31 pin1=2 pin2=-1

Brought up 1 CPUs

CPU0 attaching sched-domain:

 domain 0: span 01

  groups: 01

  domain 1: span 01

   groups: 01

NET: Registered protocol family 16

PCI: PCI BIOS revision 2.10 entry at 0xf1970, last bus=1

PCI: Using configuration type 1

mtrr: v2.0 (20020519)

ACPI: Subsystem revision 20050211

ACPI: Interpreter enabled

ACPI: Using IOAPIC for interrupt routing

ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 9 10 *11 12)

ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 9 10 11 12) *0, disabled.

ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 5 6 7 9 10 11 12) *0, disabled.

ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 7 9 10 11 12) *0, disabled.

ACPI: PCI Interrupt Link [LNKE] (IRQs 3 4 5 6 7 9 *10 11 12)

ACPI: PCI Interrupt Link [LNKF] (IRQs 3 4 5 6 7 9 10 11 *12)

ACPI: PCI Interrupt Link [LNKG] (IRQs *3 4 5 6 7 9 10 11 12)

ACPI: PCI Interrupt Link [LNKH] (IRQs 3 4 5 6 7 9 10 11 12) *15, disabled.

ACPI: PCI Root Bridge [PCI0] (00:00)

PCI: Probing PCI hardware (bus 00)

PCI: Via IRQ fixup

ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]

ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCI1._PRT]

SCSI subsystem initialized

PCI: Using ACPI for IRQ routing

** PCI interrupts are no longer routed automatically.  If this

** causes a device to stop working, it is probably because the

** driver failed to call pci_enable_device().  As a temporary

** workaround, the "pci=routeirq" argument restores the old

** behavior.  If this argument makes the device work again,

** please email the output of "lspci" to bjorn.helgaas@hp.com

** so I can fix the driver.

Simple Boot Flag at 0x3a set to 0x1

Machine check exception polling timer started.

audit: initializing netlink socket (disabled)

audit(1133019455.721:0): initialized

inotify device minor=63

VFS: Disk quotas dquot_6.5.1

Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)

Initializing Cryptographic API

ACPI: Power Button (FF) [PWRF]

ibm_acpi: ec object not found

Real Time Clock Driver v1.12

Linux agpgart interface v0.100 (c) Dave Jones

[drm] Initialized drm 1.0.0 20040925

serio: i8042 AUX port at 0x60,0x64 irq 12

serio: i8042 KBD port at 0x60,0x64 irq 1

Serial: 8250/16550 driver $Revision: 1.90 $ 8 ports, IRQ sharing disabled

ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A

io scheduler noop registered

io scheduler anticipatory registered

io scheduler deadline registered

io scheduler cfq registered

floppy0: no floppy controllers found

via-rhine.c:v1.10-LK1.2.0-2.6 June-10-2004 Written by Donald Becker

ACPI: PCI interrupt 0000:00:12.0[A] -> GSI 23 (level, low) -> IRQ 23

eth0: VIA Rhine II at 0x18000, 00:11:2f:76:d8:23, IRQ 23.

eth0: MII PHY found at address 1, status 0x786d advertising 01e1 Link 45e1.

PPP generic driver version 2.4.2

Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2

ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx

Probing IDE interface ide0...

hda: ATAPI-CD ROM-DRIVE-52MAX, ATAPI CD/DVD-ROM drive

Probing IDE interface ide1...

Probing IDE interface ide2...

Probing IDE interface ide3...

Probing IDE interface ide4...

Probing IDE interface ide5...

ide0 at 0x1f0-0x1f7,0x3f6 on irq 14

hda: ATAPI 52X CD-ROM drive, 128kB Cache

Uniform CD-ROM driver Revision: 3.20

libata version 1.10 loaded.

sata_via version 1.1

ACPI: PCI interrupt 0000:00:0f.0[B] -> GSI 20 (level, low) -> IRQ 20

sata_via(0000:00:0f.0): routed to hard irq line 4

ata1: SATA max UDMA/133 cmd 0xB800 ctl 0xB402 bmdma 0xA400 irq 20

ata2: SATA max UDMA/133 cmd 0xB000 ctl 0xA802 bmdma 0xA408 irq 20

ata1: dev 0 cfg 49:2f00 82:7c6b 83:7b09 84:4003 85:7c69 86:3a01 87:4003 88:407f

ata1: dev 0 ATA, max UDMA/133, 160086528 sectors:

ata1: dev 0 configured for UDMA/133

scsi0 : sata_via

ata2: dev 0 cfg 49:2f00 82:7c6b 83:7b09 84:4003 85:7c69 86:3a01 87:4003 88:407f

ata2: dev 0 ATA, max UDMA/133, 160086528 sectors:

ata2: dev 0 configured for UDMA/133

scsi1 : sata_via

  Vendor: ATA       Model: Maxtor 6Y080M0    Rev: YAR5

  Type:   Direct-Access                      ANSI SCSI revision: 05

  Vendor: ATA       Model: Maxtor 6Y080M0    Rev: YAR5

  Type:   Direct-Access                      ANSI SCSI revision: 05

SCSI device sda: 160086528 512-byte hdwr sectors (81964 MB)

SCSI device sda: drive cache: write back

SCSI device sda: 160086528 512-byte hdwr sectors (81964 MB)

SCSI device sda: drive cache: write back

 sda: sda1 sda2 sda3 sda4 < sda5 sda6 sda7 sda8 >

Attached scsi disk sda at scsi0, channel 0, id 0, lun 0

SCSI device sdb: 160086528 512-byte hdwr sectors (81964 MB)

SCSI device sdb: drive cache: write back

SCSI device sdb: 160086528 512-byte hdwr sectors (81964 MB)

SCSI device sdb: drive cache: write back

 sdb: sdb1 sdb2 sdb3 sdb4 < sdb5 sdb6 sdb7 sdb8 >

Attached scsi disk sdb at scsi1, channel 0, id 0, lun 0

Attached scsi generic sg0 at scsi0, channel 0, id 0, lun 0,  type 0

Attached scsi generic sg1 at scsi1, channel 0, id 0, lun 0,  type 0

ieee1394: raw1394: /dev/raw1394 device initialized

mice: PS/2 mouse device common for all mice

md: raid1 personality registered as nr 3

md: md driver 0.90.1 MAX_MD_DEVS=256, MD_SB_DISKS=27

oprofile: using NMI interrupt.

NET: Registered protocol family 2

IP: routing cache hash table of 2048 buckets, 32Kbytes

TCP established hash table entries: 32768 (order: 7, 524288 bytes)

TCP bind hash table entries: 32768 (order: 6, 393216 bytes)

TCP: Hash tables configured (established 32768 bind 32768)

ip_conntrack version 2.1 (4095 buckets, 32760 max) - 220 bytes per conntrack

ip_tables: (C) 2000-2002 Netfilter core team

ipt_recent v0.3.1: Stephen Frost <sfrost@snowman.net>.  http://snowman.net/projects/ipt_recent/

arp_tables: (C) 2002 David S. Miller

NET: Registered protocol family 1

NET: Registered protocol family 17

md: Autodetecting RAID arrays.

md: invalid superblock checksum on sdb3

md: sdb3 has invalid sb, not importing!

md: autorun ...

md: considering sdb8 ...

md:  adding sdb8 ...

md: sdb7 has different UUID to sdb8

md: sdb6 has different UUID to sdb8

md: sdb5 has different UUID to sdb8

md: sdb1 has different UUID to sdb8

md:  adding sda8 ...

md: sda7 has different UUID to sdb8

md: sda6 has different UUID to sdb8

md: sda5 has different UUID to sdb8

md: sda3 has different UUID to sdb8

md: sda1 has different UUID to sdb8

md: created md5

md: bind<sda8>

md: bind<sdb8>

md: running: <sdb8><sda8>

md: kicking non-fresh sdb8 from array!

md: unbind<sdb8>

md: export_rdev(sdb8)

raid1: raid set md5 active with 1 out of 2 mirrors

md: considering sdb7 ...

md:  adding sdb7 ...

md: sdb6 has different UUID to sdb7

md: sdb5 has different UUID to sdb7

md: sdb1 has different UUID to sdb7

md:  adding sda7 ...

md: sda6 has different UUID to sdb7

md: sda5 has different UUID to sdb7

md: sda3 has different UUID to sdb7

md: sda1 has different UUID to sdb7

md: created md4

md: bind<sda7>

md: bind<sdb7>

md: running: <sdb7><sda7>

md: kicking non-fresh sdb7 from array!

md: unbind<sdb7>

md: export_rdev(sdb7)

raid1: raid set md4 active with 1 out of 2 mirrors

md: considering sdb6 ...

md:  adding sdb6 ...

md: sdb5 has different UUID to sdb6

md: sdb1 has different UUID to sdb6

md:  adding sda6 ...

md: sda5 has different UUID to sdb6

md: sda3 has different UUID to sdb6

md: sda1 has different UUID to sdb6

md: created md3

md: bind<sda6>

md: bind<sdb6>

md: running: <sdb6><sda6>

md: kicking non-fresh sdb6 from array!

md: unbind<sdb6>

md: export_rdev(sdb6)

raid1: raid set md3 active with 1 out of 2 mirrors

md: considering sdb5 ...

md:  adding sdb5 ...

md: sdb1 has different UUID to sdb5

md:  adding sda5 ...

md: sda3 has different UUID to sdb5

md: sda1 has different UUID to sdb5

md: created md2

md: bind<sda5>

md: bind<sdb5>

md: running: <sdb5><sda5>

md: kicking non-fresh sdb5 from array!

md: unbind<sdb5>

md: export_rdev(sdb5)

raid1: raid set md2 active with 1 out of 2 mirrors

md: considering sdb1 ...

md:  adding sdb1 ...

md: sda3 has different UUID to sdb1

md:  adding sda1 ...

md: created md0

md: bind<sda1>

md: bind<sdb1>

md: running: <sdb1><sda1>

raid1: raid set md0 active with 2 out of 2 mirrors

md: considering sda3 ...

md:  adding sda3 ...

md: created md1

md: bind<sda3>

md: running: <sda3>

md: md1: raid array is not clean -- starting background reconstruction

raid1: raid set md1 active with 1 out of 2 mirrors

md: ... autorun DONE.

md: Loading md1: /dev/sda3

md: couldn't update array info. -22

md: could not bd_claim sda3.

md: md_import_device returned -16

md: invalid superblock checksum on sdb3

md: sdb3 has invalid sb, not importing!

md: md_import_device returned -22

md: starting md1 failed

EXT3-fs: INFO: recovery required on readonly filesystem.

EXT3-fs: write access will be enabled during recovery.

EXT3-fs: recovery complete.

kjournald starting.  Commit interval 5 seconds

EXT3-fs: mounted filesystem with ordered data mode.

VFS: Mounted root (ext3 filesystem) readonly.

Freeing unused kernel memory: 204k freed

grsec: mount of proc to /proc by /bin/mount[mount:19837] uid/euid:0/0 gid/egid:0/0, parent /sbin/rc[rc:347] uid/euid:0/0 gid/egid:0/0

grsec: mount of sysfs to /sys by /bin/mount[mount:23547] uid/euid:0/0 gid/egid:0/0, parent /sbin/rc[rc:5108] uid/euid:0/0 gid/egid:0/0

grsec: mount of udev to /dev by /bin/mount[mount:22254] uid/euid:0/0 gid/egid:0/0, parent /sbin/rc[rc:12540] uid/euid:0/0 gid/egid:0/0

grsec: mount of devpts to /dev/pts by /bin/mount[mount:15005] uid/euid:0/0 gid/egid:0/0, parent /sbin/rc[rc:29882] uid/euid:0/0 gid/egid:0/0

Adding 1004052k swap on /dev/sda2.  Priority:-1 extents:1

Adding 1004052k swap on /dev/sdb2.  Priority:-2 extents:1

EXT3 FS on md1, internal journal

grsec: mount of /dev/md1 to / by /bin/mount[mount:26286] uid/euid:0/0 gid/egid:0/0, parent /sbin/rc[rc:17762] uid/euid:0/0 gid/egid:0/0

EXT3-fs warning: maximal mount count reached, running e2fsck is recommended

kjournald starting.  Commit interval 5 seconds

EXT3 FS on md2, internal journal

EXT3-fs: recovery complete.

EXT3-fs: mounted filesystem with ordered data mode.

grsec: mount of /dev/md2 to /tmp by /bin/mount[mount:24110] uid/euid:0/0 gid/egid:0/0, parent /sbin/rc[rc:23375] uid/euid:0/0 gid/egid:0/0

EXT3-fs warning: maximal mount count reached, running e2fsck is recommended

kjournald starting.  Commit interval 5 seconds

EXT3 FS on md3, internal journal

EXT3-fs: recovery complete.

EXT3-fs: mounted filesystem with ordered data mode.

grsec: mount of /dev/md3 to /var by /bin/mount[mount:24110] uid/euid:0/0 gid/egid:0/0, parent /sbin/rc[rc:23375] uid/euid:0/0 gid/egid:0/0

EXT3-fs warning: maximal mount count reached, running e2fsck is recommended

kjournald starting.  Commit interval 5 seconds

EXT3 FS on md4, internal journal

EXT3-fs: recovery complete.

EXT3-fs: mounted filesystem with ordered data mode.

grsec: mount of /dev/md4 to /home by /bin/mount[mount:24110] uid/euid:0/0 gid/egid:0/0, parent /sbin/rc[rc:23375] uid/euid:0/0 gid/egid:0/0

kjournald starting.  Commit interval 5 seconds

EXT3 FS on md5, internal journal

EXT3-fs: recovery complete.

EXT3-fs: mounted filesystem with ordered data mode.

grsec: mount of /dev/md5 to /usr by /bin/mount[mount:24110] uid/euid:0/0 gid/egid:0/0, parent /sbin/rc[rc:23375] uid/euid:0/0 gid/egid:0/0

grsec: mount of shm to /dev/shm by /bin/mount[mount:24110] uid/euid:0/0 gid/egid:0/0, parent /sbin/rc[rc:23375] uid/euid:0/0 gid/egid:0/0

```

It might be important to note that stuff usually crashes whenever I try to use the drives intensively, like when downloading a large file

----------

## NeddySeagoon

Prospero,

Your 6Y080M0 drives are not in the kernel blacklist, which is good.

Do you need to run with grsec ?

It may be getting in the way of something. I must admit I'm guessing. I run a pair of Maxtor 6B300S0 in a mix of raid1 and raid0 with no problems on a SIL3112A SATA chip whereas you have the VIA SATA.

Does your motherboard have another SATA channel you can try?

Can you rum memtest for a few hours ?

Why are you running a SMP kernel on an AMD Athlon(TM) XP 1600+  ?

It clutters up the kernel and a few drivers are broken on SMP. It may be worth taking that out of the kernel and seeing if things improve.

Other things to try are not using the APIC and turning off ACPI (Power management) if its on.

----------

## Prospero

Well it seems my Kernel needs an overhaul... I have no idea how SMP got turned on but I'm disabling it asap, as for grsec I guess I don't need it - a few people besides myself use the computer but they're not exactly malicious and the system's quite secure without it - I'll see if turning that off has any effect.

As for memtest, I give it a try - as for SATA channels I don't think so, I'll have to look it up

ACPI is not enabled so it's unlikely that's causing any problems

Once again thanks for your help

----------

## Prospero

I made an "interesting" discovery

I shut down the system last night, and when I restarted it this morning and did a cat /proc/mdstat it had reversed the drive names. Previously it told me that /dev/sdb was the faulty drive, now it's telling me /dev/sda is faulty.

Not sure if this means anything but it seems a bit weird

----------

## MrUlterior

 *Prospero wrote:*   

> I made an "interesting" discovery
> 
> I shut down the system last night, and when I restarted it this morning and did a cat /proc/mdstat it had reversed the drive names. Previously it told me that /dev/sdb was the faulty drive, now it's telling me /dev/sda is faulty.
> 
> Not sure if this means anything but it seems a bit weird

 

I had this issue with my fileserver running 2.6.11.* -- I reccently rebuilt the server completely with 2.6.13 & a more reccent livecd & ran "emerge mdadm && emerge --sync &&  emerge -uND world" before re-creating the RAID arrays & the problem has been resolved. I suspect that it was the kernel in conjunction with the mdadm tools version, but I've nothing to prove that ...

----------

## Prospero

Well I tried out your suggestion MrUlterior, and so far the raid drives seem to be remaining stable - I'll try doing something intensive (like a 4gb file download) to see if they stay stable

----------

## Prospero

Ok, raid partitions just gave way again - they just went through the ultimate stress test - cleaning a mail queue that hasn't been cleaned in 2 months while uploading a 3 gig file to another comp in the LAN through SCP.

 *Quote:*   

> # cat /proc/mdstat
> 
> Personalities : [raid1]
> 
> md1 : active raid1 sdb3[1] sda3[2](F)
> ...

 

I just pulled this out of kern.log

 *Quote:*   

> 
> 
> Nov 28 23:36:15 Saidin ata1: status=0x51 { DriveReady SeekComplete Error }
> 
> Nov 28 23:36:15 Saidin ata1: error=0x84 { DriveStatusError BadCRC }
> ...

 

Looks pretty bad

----------

## NeddySeagoon

Prospero,

Yes, thats a bad sign. Time to go the the Maxtor site and get their diagnostic software.

If you run that and it fails, it will offer to print you a Returned Material Authorization, if the drive is still in warranty.

----------

## Prospero

Their diagnostics software doesn't support my VIA KT600 controller...

I think I still have warranty, but this isn't the first time I send a Maxtor drive back - though it sure as hell is the last time I buy one

----------

## MrUlterior

 *Prospero wrote:*   

> Their diagnostics software doesn't support my VIA KT600 controller...
> 
> I think I still have warranty, but this isn't the first time I send a Maxtor drive back - though it sure as hell is the last time I buy one

 

Before returning them, I'd recommend changing cables & testing them with another mobo. Could equally be a cable or controller problem.

----------

## MrUlterior

Actually scratch the cable problem, I'd forgotten those are SATA ..

----------

## Prospero

I've just tested both drives in another computer with the Maxtor diagnostics software.

Result: Both drives pass every test

---

So I'm back at square 1.

----------

## Prospero

Some new insight.

I did some digging around in my motherboard manual, and found out there's an entire boot menu for setting up Raid drives.

(insert smiley of me smashing my head against the wall)

So I figure this is what happened:

1) I, lacking experience on RAID drives, reads that RAID1 is a software solution, so I never think about setting up the hardware beyond plugging the drives in

2) I load up the Gentoo LiveCD and install stuff, following the instructions on setting up software raid carefully.

3) I assemble the array with mdadm, copies data from 1 drive to the other.

4) System goes into use, I install stuff

5) Then start using drive intensively - since I never set it up on the mobo it doesn't properly handle the writing of data, so inconsistencies in the data emerge - then it does an integrity check and finds out the drives are no longer mirrored: BOOM! 1 drive ejected from the array.

6) I rebuild the drive, and it functions again until 5 happens again - the funny part about my whole problem is that it wasn't always the same drive that was failing, and this would explain things.

Any thoughts on this?

----------

## NeddySeagoon

Prospero,

You chose to use kernel software RAID over BIOS software RAID (a good choice).

The two different ways of implementing software RAID and they are mutually incompatible.

The BIOS setting for RAID is not important - you don't have real hardware RAID.

----------

## Prospero

Well you can't blame me for coming to this conclusion   :Very Happy: 

1) Stuff goes wrong with hard drive

2) I check hard drive for errors using diagnostics tool, which finds nothing (ie. hard drive ok)

3) I recheck with different SATA cables, result the same (ie. cables ok too)

4) Only 2 things left that can be wrong - motherboard and software - so I check the mobo manual and find out there are RAID settings I never touched.

Considering the fact that the problem is likely to come from the mobo, and I find something "wrong" with the mobo settings, it seemed logical to me to try and change that.

So I did, and so far nothing has gone wrong - both drives are up and I put them under a little stress test (downloading large files while doing emerge --sync and emerge -uD world afterwards).

The only thing that happened after altering the Motherboard setting (or actually I should say the SATA controller setting since it is separate from the main motherboard BIOS) is that it now says both drives are in a RAID1 setting. The operating system still sees them both as separate drives (which only supports my theory that the drivers for my controller require prior setup when using RAID, but I'm no expert at that so I might just as well be talking nonsense).

Anyway, I'm not celebrating yet, things can still go wrong, in which case I'll pop up in this thread again (I'm not confident enough to put [Solved] in the subject title)

In any case, thanks for all your help NeddySeagoon, and of course MrUlterior.

----------

