# random, unexplained hard lockups?

## number_nine

My system has been experiencing random, hard (must physically reboot) lockups over the last year or so.  The lockups are thus far completely unpredictable, and it always occurs when I'm not at my computer (during the night, at work, etc).  When the computer goes into this hard lock up state, the monitor is blank (but not in power save mode); the computer will respond to pings; I cannot ssh into the computer.

I just ran 14 hours of memtest86+ and found no errors.

I also checked the logs---nothing unusual there (I can't even pinpoint exactly when the lockups occur).

Even worse, my computer may be fine for weeks or even months (i.e. completely stable), then suddently start locking up about once a day.

Does anyone have any idea what the problem may be?  For what it's worth, I have a very high ERR count in /proc/interrupts:

```

# uptime

08:58:35 up  1:29, 12 users,  load average: 1.22, 1.28, 1.20

# cat /proc/interrupts 

           CPU0       

  0:    5391962          XT-PIC  timer

  1:       3486          XT-PIC  i8042

  2:          0          XT-PIC  cascade

  5:     481356          XT-PIC  sym53c8xx, NVidia nForce2, ohci1394

  8:          2          XT-PIC  rtc

  9:          0          XT-PIC  acpi

 10:          0          XT-PIC  ohci_hcd

 11:     534284          XT-PIC  sym53c8xx, ohci_hcd, ehci_hcd, eth0, nvidia

 12:     115771          XT-PIC  i8042

 14:        473          XT-PIC  ide0

 15:         11          XT-PIC  ide1

NMI:          0 

LOC:    5391944 

ERR:      33336

MIS:          0

```

Note that the machine has only been up for 90 minutes and it's already logged 33k ERRs (though I don't exactly know what that means, my other to nforce2 boards have a zero ERR count).

For what it's worth, this computer has the following hardware: Asus A7N8X Deluxe, AMD Athlon XP 2500 (Barton core), 2x512 MB RAM, GeForce4 ti4200 AGP 8x video card, LSI Logic SCSI controller, Fujitsu SCSI Drive, Samsung IDE drive.

Another idea, I see the following in my dmesg:

```

PCI: Using ACPI for IRQ routing

** PCI interrupts are no longer routed automatically.  If this

** causes a device to stop working, it is probably because the

** driver failed to call pci_enable_device().  As a temporary

** workaround, the "pci=routeirq" argument restores the old

** behavior.  If this argument makes the device work again,

** please email the output of "lspci" to bjorn.helgaas@hp.com

** so I can fix the driver.

```

In my kernel config, I have Processor Type and Features -> Local APIC support on unicprocessors and IO-APIC support on unicprocessors both enabled.  However, as you can see above, the kernel is still using XT-PIC.  My other two nforce2 boards (with the same kernel config) use IO-APIC.  I'm not sure exactly what all this means, but it may mean something to somebody.   :Smile: 

Thanks for any help or suggestions!

The following is my complete dmesg:

```

ter vendor identify, caps: 0383fbff c1c3fbff 00000000 00000000 00000000 00000000 00000000

CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)

CPU: L2 Cache: 512K (64 bytes/line)

CPU: After all inits, caps: 0383fbff c1c3fbff 00000000 00000020 00000000 00000000 00000000

Intel machine check architecture supported.

Intel machine check reporting enabled on CPU#0.

CPU: AMD Athlon(tm) XP 2500+ stepping 00

Enabling fast FPU save and restore... done.

Enabling unmasked SIMD FPU exception support... done.

Checking 'hlt' instruction... OK.

ACPI: setting ELCR to 0200 (from 0820)

NET: Registered protocol family 16

PCI: PCI BIOS revision 2.10 entry at 0xfb490, last bus=3

PCI: Using configuration type 1

mtrr: v2.0 (20020519)

ACPI: Subsystem revision 20050211

ACPI: Interpreter enabled

ACPI: Using PIC for interrupt routing

ACPI: PCI Root Bridge [PCI0] (00:00)

PCI: Probing PCI hardware (bus 00)

PCI: nForce2 C1 Halt Disconnect fixup

ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]

ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.HUB0._PRT]

ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.AGPB._PRT]

ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.HUB1._PRT]

ACPI: PCI Interrupt Link [LNK1] (IRQs 3 4 5 6 7 10 *11 12 14 15)

ACPI: PCI Interrupt Link [LNK2] (IRQs 3 4 *5 6 7 10 11 12 14 15)

ACPI: PCI Interrupt Link [LNK3] (IRQs 3 4 5 6 7 10 11 12 14 15) *0, disabled.

ACPI: PCI Interrupt Link [LNK4] (IRQs 3 4 5 6 7 10 *11 12 14 15)

ACPI: PCI Interrupt Link [LNK5] (IRQs 3 4 5 6 7 10 11 12 14 15) *0, disabled.

ACPI: PCI Interrupt Link [LUBA] (IRQs 3 4 5 6 7 10 *11 12 14 15)

ACPI: PCI Interrupt Link [LUBB] (IRQs 3 4 5 6 7 10 11 12 14 15) *0, disabled.

ACPI: PCI Interrupt Link [LMAC] (IRQs 3 4 *5 6 7 10 11 12 14 15)

ACPI: PCI Interrupt Link [LAPU] (IRQs 3 4 5 6 7 10 *11 12 14 15)

ACPI: PCI Interrupt Link [LACI] (IRQs 3 4 *5 6 7 10 11 12 14 15)

ACPI: PCI Interrupt Link [LMCI] (IRQs 3 4 5 6 7 10 11 12 14 15) *0, disabled.

ACPI: PCI Interrupt Link [LSMB] (IRQs 3 4 5 6 7 10 11 12 14 15) *0, disabled.

ACPI: PCI Interrupt Link [LUB2] (IRQs 3 4 5 6 7 10 *11 12 14 15)

ACPI: PCI Interrupt Link [LFIR] (IRQs 3 4 *5 6 7 10 11 12 14 15)

ACPI: PCI Interrupt Link [L3CM] (IRQs 3 4 5 6 7 10 *11 12 14 15)

ACPI: PCI Interrupt Link [LIDE] (IRQs 3 4 5 6 7 10 11 12 14 15) *0, disabled.

ACPI: PCI Interrupt Link [APC1] (IRQs *16), disabled.

ACPI: PCI Interrupt Link [APC2] (IRQs *17), disabled.

ACPI: PCI Interrupt Link [APC3] (IRQs *18), disabled.

ACPI: PCI Interrupt Link [APC4] (IRQs *19), disabled.

ACPI: PCI Interrupt Link [APC5] (IRQs *16), disabled.

ACPI: PCI Interrupt Link [APCF] (IRQs 20 21 22) *0, disabled.

ACPI: PCI Interrupt Link [APCG] (IRQs 20 21 22) *0, disabled.

ACPI: PCI Interrupt Link [APCH] (IRQs 20 21 22) *0, disabled.

ACPI: PCI Interrupt Link [APCI] (IRQs 20 21 22) *0, disabled.

ACPI: PCI Interrupt Link [APCJ] (IRQs 20 21 22) *0, disabled.

ACPI: PCI Interrupt Link [APCK] (IRQs 20 21 22) *0, disabled.

ACPI: PCI Interrupt Link [APCS] (IRQs *23), disabled.

ACPI: PCI Interrupt Link [APCL] (IRQs 20 21 22) *0, disabled.

ACPI: PCI Interrupt Link [APCM] (IRQs 20 21 22) *0, disabled.

ACPI: PCI Interrupt Link [AP3C] (IRQs 20 21 22) *0, disabled.

ACPI: PCI Interrupt Link [APCZ] (IRQs 20 21 22) *0, disabled.

SCSI subsystem initialized

usbcore: registered new driver usbfs

usbcore: registered new driver hub

PCI: Using ACPI for IRQ routing

** PCI interrupts are no longer routed automatically.  If this

** causes a device to stop working, it is probably because the

** driver failed to call pci_enable_device().  As a temporary

** workaround, the "pci=routeirq" argument restores the old

** behavior.  If this argument makes the device work again,

** please email the output of "lspci" to bjorn.helgaas@hp.com

** so I can fix the driver.

spurious 8259A interrupt: IRQ7.

Machine check exception polling timer started.

highmem bounce pool size: 64 pages

inotify device minor=63

devfs: 2004-01-31 Richard Gooch (rgooch@atnf.csiro.au)

devfs: boot_options: 0x0

Installing knfsd (copyright (C) 1996 okir@monad.swb.de).

SGI XFS with ACLs, security attributes, no debug enabled

SGI XFS Quota Management subsystem

Initializing Cryptographic API

Real Time Clock Driver v1.12

Hangcheck: starting hangcheck timer 0.5.0 (tick is 180 seconds, margin is 60 seconds).

vesafb: NVIDIA Corporation, NV28 Board, Chip Rev A2 (OEM: NVIDIA)

vesafb: VBE version: 3.0

vesafb: protected mode interface info at c000:ea60

vesafb: pmi: set display start = c00cea96, set palette = c00ceb00

vesafb: pmi: ports = 3b4 3b5 3ba 3c0 3c1 3c4 3c5 3c6 3c7 3c8 3c9 3cc 3ce 3cf 3d0 3d1 3d2 3d3 3d4 3d5 3da 

vesafb: hardware doesn't support DCC transfers

vesafb: monitor limits: vf = 0 Hz, hf = 0 kHz, clk = 0 MHz

vesafb: scrolling: redraw

Console: switching to colour frame buffer device 128x48

vesafb: framebuffer at 0xd0000000, mapped to 0xf8880000, using 1536k, total 131072k

fb0: VESA VGA frame buffer device

vga16fb: initializing

vga16fb: mapped to 0xc00a0000

fb1: VGA16 VGA frame buffer device

serio: i8042 AUX port at 0x60,0x64 irq 12

serio: i8042 KBD port at 0x60,0x64 irq 1

Serial: 8250/16550 driver $Revision: 1.90 $ 8 ports, IRQ sharing disabled

ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A

ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A

mice: PS/2 mouse device common for all mice

input: AT Translated Set 2 keyboard on isa0060/serio0

input: ImExPS/2 Logitech Explorer Mouse on isa0060/serio1

io scheduler noop registered

io scheduler anticipatory registered

io scheduler deadline registered

io scheduler cfq registered

RAMDISK driver initialized: 16 RAM disks of 8192K size 1024 blocksize

ACPI: PCI Interrupt Link [L3CM] enabled at IRQ 11

PCI: setting IRQ 11 as level-triggered

ACPI: PCI interrupt 0000:02:01.0[A] -> GSI 11 (level, low) -> IRQ 11

3c59x: Donald Becker and others. www.scyld.com/network/vortex.html

0000:02:01.0: 3Com PCI 3c920 Tornado at 0xa000. Vers LK1.1.19

forcedeth.c: Reverse Engineered nForce ethernet driver. Version 0.31.

ACPI: PCI Interrupt Link [LMAC] enabled at IRQ 5

PCI: setting IRQ 5 as level-triggered

ACPI: PCI interrupt 0000:00:04.0[A] -> GSI 5 (level, low) -> IRQ 5

PCI: Setting latency timer of device 0000:00:04.0 to 64

eth1: forcedeth.c: subsystem: 01043:80a7 bound to 0000:00:04.0

Equalizer2002: Simon Janes (simon@ncm.com) and David S. Miller (davem@redhat.com)

Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2

ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx

NFORCE2: IDE controller at PCI slot 0000:00:09.0

NFORCE2: chipset revision 162

NFORCE2: not 100% native mode: will probe irqs later

NFORCE2: BIOS didn't set cable bits correctly. Enabling workaround.

NFORCE2: 0000:00:09.0 (rev a2) UDMA133 controller

    ide0: BM-DMA at 0xf000-0xf007, BIOS settings: hda:DMA, hdb:DMA

    ide1: BM-DMA at 0xf008-0xf00f, BIOS settings: hdc:DMA, hdd:DMA

Probing IDE interface ide0...

hda: SAMSUNG SP1614N, ATA DISK drive

ide0 at 0x1f0-0x1f7,0x3f6 on irq 14

Probing IDE interface ide1...

hdd: _NEC DVD_RW ND-2510A, ATAPI CD/DVD-ROM drive

ide1 at 0x170-0x177,0x376 on irq 15

Probing IDE interface ide2...

Probing IDE interface ide3...

Probing IDE interface ide4...

Probing IDE interface ide5...

hda: max request size: 1024KiB

hda: 312581808 sectors (160041 MB) w/8192KiB Cache, CHS=19457/255/63, UDMA(100)

hda: cache flushes supported

 /dev/ide/host0/bus0/target0/lun0: p1 p2 p3

hdd: ATAPI 40X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(33)

Uniform CD-ROM driver Revision: 3.20

ACPI: PCI Interrupt Link [LNK1] enabled at IRQ 11

ACPI: PCI interrupt 0000:01:0a.0[A] -> GSI 11 (level, low) -> IRQ 11

sym0: <1010-33> rev 0x1 at pci 0000:01:0a.0 irq 11

sym0: Symbios NVRAM, ID 7, Fast-80, LVD, parity checking

sym0: open drain IRQ line driver, using on-chip SRAM

sym0: using LOAD/STORE-based firmware.

sym0: handling phase mismatch from SCRIPTS.

sym0: SCSI BUS has been reset.

scsi0 : sym-2.1.18n

  Vendor: FUJITSU   Model: MAP3367NP         Rev: 0106

  Type:   Direct-Access                      ANSI SCSI revision: 03

sym0:15:0: tagged command queuing enabled, command queue depth 16.

 target0:0:15: Beginning Domain Validation

sym0:15: wide asynchronous.

sym0:15: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 62)

 target0:0:15: Ending Domain Validation

ACPI: PCI Interrupt Link [LNK2] enabled at IRQ 5

ACPI: PCI interrupt 0000:01:0a.1[B] -> GSI 5 (level, low) -> IRQ 5

sym1: <1010-33> rev 0x1 at pci 0000:01:0a.1 irq 5

sym1: Symbios NVRAM, ID 7, Fast-80, SE, parity checking

sym1: open drain IRQ line driver, using on-chip SRAM

sym1: using LOAD/STORE-based firmware.

sym1: handling phase mismatch from SCRIPTS.

sym1: SCSI BUS has been reset.

scsi1 : sym-2.1.18n

SCSI device sda: 71775284 512-byte hdwr sectors (36749 MB)

SCSI device sda: drive cache: write back

SCSI device sda: 71775284 512-byte hdwr sectors (36749 MB)

SCSI device sda: drive cache: write back

 /dev/scsi/host0/bus0/target15/lun0: p1 p2 p3 p4 < p5 p6 p7 p8 >

Attached scsi disk sda at scsi0, channel 0, id 15, lun 0

Attached scsi generic sg0 at scsi0, channel 0, id 15, lun 0,  type 0

usbcore: registered new driver hiddev

usbcore: registered new driver usbhid

drivers/usb/input/hid-core.c: v2.0:USB HID core driver

NET: Registered protocol family 2

IP: routing cache hash table of 8192 buckets, 64Kbytes

TCP established hash table entries: 262144 (order: 9, 2097152 bytes)

TCP bind hash table entries: 65536 (order: 6, 262144 bytes)

TCP: Hash tables configured (established 262144 bind 65536)

NET: Registered protocol family 1

NET: Registered protocol family 10

IPv6 over IPv4 tunneling driver

NET: Registered protocol family 17

EXT3-fs: INFO: recovery required on readonly filesystem.

EXT3-fs: write access will be enabled during recovery.

EXT3-fs: recovery complete.

kjournald starting.  Commit interval 5 seconds

EXT3-fs: mounted filesystem with ordered data mode.

VFS: Mounted root (ext3 filesystem) readonly.

Freeing unused kernel memory: 192k freed

Adding 1501176k swap on /dev/sda2.  Priority:-1 extents:1

EXT3 FS on sda3, internal journal

Linux agpgart interface v0.100 (c) Dave Jones

nvidia: module license 'NVIDIA' taints kernel.

ACPI: PCI Interrupt Link [LNK4] enabled at IRQ 11

ACPI: PCI interrupt 0000:03:00.0[A] -> GSI 11 (level, low) -> IRQ 11

NVRM: loading NVIDIA Linux x86 NVIDIA Kernel Module  1.0-7664  Wed May 25 10:47:55 PDT 2005

i2c /dev entries driver

i2c_adapter i2c-1: nForce2 SMBus adapter at 0x5000

i2c_adapter i2c-2: nForce2 SMBus adapter at 0x5500

ACPI: PCI Interrupt Link [LACI] enabled at IRQ 5

ACPI: PCI interrupt 0000:00:06.0[A] -> GSI 5 (level, low) -> IRQ 5

PCI: Setting latency timer of device 0000:00:06.0 to 64

intel8x0_measure_ac97_clock: measured 49140 usecs

intel8x0: clocking to 47491

Realtime LSM initialized (group 18, mlock=1)

EXT3-fs warning: maximal mount count reached, running e2fsck is recommended

kjournald starting.  Commit interval 5 seconds

EXT3 FS on sda5, internal journal

EXT3-fs: recovery complete.

EXT3-fs: mounted filesystem with ordered data mode.

EXT3-fs warning: maximal mount count reached, running e2fsck is recommended

kjournald starting.  Commit interval 5 seconds

EXT3 FS on sda6, internal journal

EXT3-fs: recovery complete.

EXT3-fs: mounted filesystem with ordered data mode.

EXT3-fs warning: maximal mount count reached, running e2fsck is recommended

kjournald starting.  Commit interval 5 seconds

EXT3 FS on sda7, internal journal

EXT3-fs: recovery complete.

EXT3-fs: mounted filesystem with ordered data mode.

EXT3-fs warning: maximal mount count reached, running e2fsck is recommended

kjournald starting.  Commit interval 5 seconds

EXT3 FS on sda8, internal journal

EXT3-fs: recovery complete.

EXT3-fs: mounted filesystem with ordered data mode.

EXT3-fs: mounted filesystem with ordered data mode.

kjournald starting.  Commit interval 5 seconds

EXT3-fs: mounted filesystem with ordered data mode.

kjournald starting.  Commit interval 5 seconds

EXT3-fs: INFO: recovery required on readonly filesystem.

EXT3-fs: write access will be enabled during recovery.

EXT3-fs: recovery complete.

kjournald starting.  Commit interval 5 seconds

EXT3-fs: mounted filesystem with ordered data mode.

Disabled Privacy Extensions on device c04eacc0(lo)

agpgart: Detected NVIDIA nForce2 chipset

agpgart: Maximum main memory to use for agp memory: 941M

agpgart: AGP aperture is 64M @ 0xd8000000

ohci_hcd: 2004 Nov 08 USB 1.1 'Open' Host Controller (OHCI) Driver (PCI)

ACPI: PCI Interrupt Link [LUBA] enabled at IRQ 11

ACPI: PCI interrupt 0000:00:02.0[A] -> GSI 11 (level, low) -> IRQ 11

ohci_hcd 0000:00:02.0: nVidia Corporation nForce2 USB Controller

PCI: Setting latency timer of device 0000:00:02.0 to 64

ohci_hcd 0000:00:02.0: irq 11, pci mem 0xe2087000

ohci_hcd 0000:00:02.0: new USB bus registered, assigned bus number 1

hub 1-0:1.0: USB hub found

hub 1-0:1.0: 3 ports detected

ACPI: PCI Interrupt Link [LUBB] enabled at IRQ 10

PCI: setting IRQ 10 as level-triggered

ACPI: PCI interrupt 0000:00:02.1[B] -> GSI 10 (level, low) -> IRQ 10

ohci_hcd 0000:00:02.1: nVidia Corporation nForce2 USB Controller (#2)

PCI: Setting latency timer of device 0000:00:02.1 to 64

ohci_hcd 0000:00:02.1: irq 10, pci mem 0xe2082000

ohci_hcd 0000:00:02.1: new USB bus registered, assigned bus number 2

hub 2-0:1.0: USB hub found

hub 2-0:1.0: 3 ports detected

usb 1-2: new full speed USB device using ohci_hcd and address 2

drivers/usb/class/usblp.c: usblp0: USB Bidirectional printer dev 2 if 0 alt 1 proto 2 vid 0x03F0 pid 0x1D17

usbcore: registered new driver usblp

drivers/usb/class/usblp.c: v0.13: USB Printer Device Class driver

ACPI: PCI Interrupt Link [LUB2] enabled at IRQ 11

ACPI: PCI interrupt 0000:00:02.2[C] -> GSI 11 (level, low) -> IRQ 11

ehci_hcd 0000:00:02.2: nVidia Corporation nForce2 USB Controller

PCI: Setting latency timer of device 0000:00:02.2 to 64

ehci_hcd 0000:00:02.2: irq 11, pci mem 0xe2083000

ehci_hcd 0000:00:02.2: new USB bus registered, assigned bus number 3

PCI: cache line size of 64 is not supported by device 0000:00:02.2

ehci_hcd 0000:00:02.2: park 0

ehci_hcd 0000:00:02.2: USB 2.0 initialized, EHCI 1.00, driver 10 Dec 2004

usb 1-2: USB disconnect, address 2

drivers/usb/class/usblp.c: usblp0: removed

hub 3-0:1.0: USB hub found

hub 3-0:1.0: 6 ports detected

usb 1-2: new full speed USB device using ohci_hcd and address 3

drivers/usb/class/usblp.c: usblp0: USB Bidirectional printer dev 3 if 0 alt 1 proto 2 vid 0x03F0 pid 0x1D17

ieee1394: Initialized config rom entry `ip1394'

ohci1394: $Rev: 1223 $ Ben Collins <bcollins@debian.org>

ACPI: PCI Interrupt Link [LFIR] enabled at IRQ 5

ACPI: PCI interrupt 0000:00:0d.0[A] -> GSI 5 (level, low) -> IRQ 5

PCI: Setting latency timer of device 0000:00:0d.0 to 64

ohci1394: fw-host0: OHCI-1394 1.1 (PCI): IRQ=[5]  MMIO=[e2084000-e20847ff]  Max Packet=[2048]

USB Universal Host Controller Interface driver v2.2

ACPI: PCI interrupt 0000:02:01.0[A] -> GSI 11 (level, low) -> IRQ 11

parport0: PC-style at 0x378 (0x778) [PCSPP(,...)]

parport0: irq 7 detected

lp0: using parport0 (polling).

program hddtemp is using a deprecated SCSI ioctl, please convert it to SG_IO

nfs warning: mount version older than kernel

nfs warning: mount version older than kernel

eth0: no IPv6 routers present

agpgart: Found an AGP 3.0 compliant device at 0000:00:00.0.

agpgart: Putting AGP V3 device at 0000:00:00.0 into 8x mode

agpgart: Putting AGP V3 device at 0000:03:00.0 into 8x mode

agpgart: Found an AGP 3.0 compliant device at 0000:00:00.0.

agpgart: Putting AGP V3 device at 0000:00:00.0 into 8x mode

agpgart: Putting AGP V3 device at 0000:03:00.0 into 8x mode

```

----------

## RayDude

I'm using an NForce2 mobo with a plain athlon XP 2000+

```
server brian # cat /proc/interrupts

           CPU0

  0:    8263984          XT-PIC  timer

  1:          8          XT-PIC  i8042

  2:          0          XT-PIC  cascade

  3:       1252          XT-PIC  ohci_hcd

  4:          0          XT-PIC  ohci_hcd

  5:          2          XT-PIC  ehci_hcd

  7:      15100          XT-PIC  parport0

  9:          0          XT-PIC  acpi

 11:     856411          XT-PIC  eth0

 12:          1          XT-PIC  NVidia nForce2

 14:      21154          XT-PIC  ide0

 15:         24          XT-PIC  ide1

NMI:          0

LOC:    5227263

ERR:         23

server brian # uptime

 15:21:26 up  2:17,  1 user,  load average: 0.21, 0.10, 0.03
```

I have errors... Not as many as you though...

I get no errors from my P4 at work though...

I don't know what ERR: means. It may have nothing to do with your lock up problem.

Answer some questions for me:

1. What kernel are you running?

2. Is X running? If so what video drivers are you using? What GUI are you using?

3. When the machine hard locks, does pressing the num-lock key cause the num-lock led to turn on and off?

If it does, then its not completely dead (being able to ping seems to imply its not completely dead as well, but I'm not postive about that).

4. Does your NFORCE 2 Mobo have a fan on the chipset? If so, is it spinning? If it is spinning, does it appear to be making good contact with the north bridge chip?

5. How many watts is your power supply?

6. Have you tried reseating the RAM (and maybe even the CPU?)

7. IF this is a homebuilt system, have you made sure to have good thermal compound between the CPU and heatsink?

8. Have you checked to make sure your scsi controller is plugged all the way in?

9. Have you tried changing the scsi card to another slot? (it could be a shared interrupt problem)

10. Have you tried running from IDE only to see if the problem goes away? (implication is the scsi card or scsi driver is at fault)

11. Have you tried running prime95 in windows to see if it has stability problems (if so then its a hardware problem for sure).

12. Did you ever try to overclock this system? (if so you could have damaged something)

13. Try disconnecting the floppy (if you have one) and the cd and or dvd drives to see if its more stable (it could be a bad mobo and being more stable after disconnecting the floppy and cds would support that theory).

14. What kind of RAM do you have? If it doesn't have a brand name, consider buying something better. (although memtest should catch a memory issue).

15. Are you tweaking the memory timings in the bios? If so go set it to "set by SPD" to make sure timing is correct.

16. Try reseating the ATX power connector (and the little four pin extention of the mobo has one).

This is off the top of my head. It does sound like you have a flakey hardware there. If you can isolate it, you'll be able to fix it.

Raydude

----------

## solszew

Hello--

In the 'Desktop Environments' forum, there is a long thread that discusses possible reasons why some people using both xorg and the proprietary Nvidia drivers get random, hard lockups.  You might have a look over there and see if they might have some info for you.  And if you haven't already looked, you might take a peek at your Xorg or XFree logs, and see if you find anything interesting there.  

Good luck!

----------

## number_nine

Well, for what it's worth, I did a BIOS upgrade, and now my /proc/interrupts looks like my other two nforce2 boards:

```

$ cat /proc/interrupts 

           CPU0       

  0:     924466    IO-APIC-edge  timer

  1:       1163    IO-APIC-edge  i8042

  8:          2    IO-APIC-edge  rtc

  9:          0   IO-APIC-level  acpi

 12:      46175    IO-APIC-edge  i8042

 14:       5725    IO-APIC-edge  ide0

 15:         11    IO-APIC-edge  ide1

 16:      22884   IO-APIC-level  sym53c8xx

 17:         31   IO-APIC-level  sym53c8xx

 19:     112201   IO-APIC-level  nvidia

 20:         48   IO-APIC-level  ohci_hcd

 21:      68920   IO-APIC-level  NVidia nForce2, ehci_hcd

 22:      59874   IO-APIC-level  ohci_hcd, eth0

NMI:          0 

LOC:     919764 

ERR:          0

MIS:          0

```

 *RayDude wrote:*   

> 1. What kernel are you running?

 

A custom configured and compiled 2.6.11-gentoo-r4.

 *RayDude wrote:*   

> 2. Is X running? If so what video drivers are you using? What GUI are you using?

 

Yes.  I'm using the binary-only (proprietary) nVidia drivers (version 1.0.7667) with xorg-x11 (6.8.2-r2) and Fluxbox (0.9.13-r1) as my window manager.

Note that I don't believe this is a software version-specific problem, as these lockups have been occuring sporradically over the last year or so (i.e. through many software upgrades and revisions).

 *RayDude wrote:*   

> 3. When the machine hard locks, does pressing the num-lock key cause the num-lock led to turn on and off?  If it does, then its not completely dead (being able to ping seems to imply its not completely dead as well, but I'm not postive about that).

 

I just had another lockup tonight.  I was logged into the machine via SSH from my work computer.  The ssh session completely froze.  If I tried to re-login through ssh, it just hung (as described in my original post).  But I could still ping the computer.

When I came home, I could toggle the num lock key on and off.  But the keyboard and mouse was otherwise unresponsive (CTRL-ALT-Backspace and CTRL-ALT-F1 did nothing; no pointer showed up on the screen).

However, in every other previous lockup, the screen was blank; tonight it was frozen on a still image from the screensaver (xscreensaver, pinion demo).

 *RayDude wrote:*   

> 4. Does your NFORCE 2 Mobo have a fan on the chipset? If so, is it spinning? If it is spinning, does it appear to be making good contact with the north bridge chip?

 

Nope, it has a giant passive heatsink.

 *RayDude wrote:*   

> 5. How many watts is your power supply?

 

300 Watts.  It's a Seasonic Super Tornado.  (Note that this is (at least in theory) my "high end" machine that I tried to stock only with premium components.)

 *RayDude wrote:*   

> 6. Have you tried reseating the RAM (and maybe even the CPU?)

 

Yup.  Actually, I didn't say this in my first post, but I [b]did[b] find bad RAM via memtest; however, I sent it back and got new replacement RAM (on warranty).  The new stuff ran for 14 hours without a single error in memtest86+.

 *RayDude wrote:*   

> 7. IF this is a homebuilt system, have you made sure to have good thermal compound between the CPU and heatsink?

 

Absolutely.  I used Artic Silver 5 plus the Zalman CNPS7000 heatsink.  I used to watch the CPU and case temps pretty closely; nothing ever came close to being "questionable" (even in the summer).

 *RayDude wrote:*   

> 8. Have you checked to make sure your scsi controller is plugged all the way in?

 

Yup.

 *RayDude wrote:*   

> 9. Have you tried changing the scsi card to another slot? (it could be a shared interrupt problem)

 

Not yet.

 *RayDude wrote:*   

> 10. Have you tried running from IDE only to see if the problem goes away? (implication is the scsi card or scsi driver is at fault)

 

Not yet.

 *RayDude wrote:*   

> 11. Have you tried running prime95 in windows to see if it has stability problems (if so then its a hardware problem for sure).

 

No, I don't have access to any kind of Windows installation.  Is there something equally good for Linux?

 *RayDude wrote:*   

> 12. Did you ever try to overclock this system? (if so you could have damaged something)

 

Nope.  I prefer stability to speed (at least in theory!   :Smile:  )

 *RayDude wrote:*   

> 13. Try disconnecting the floppy (if you have one) and the cd and or dvd drives to see if its more stable (it could be a bad mobo and being more stable after disconnecting the floppy and cds would support that theory).

 

Good call, that's another thing to try.

 *RayDude wrote:*   

> 14. What kind of RAM do you have? If it doesn't have a brand name, consider buying something better. (although memtest should catch a memory issue).

 

Corsair XMS (again, I went for the "premium" components when I built this system... ironic that it has all the problems, eh?)

 *RayDude wrote:*   

> 15. Are you tweaking the memory timings in the bios? If so go set it to "set by SPD" to make sure timing is correct.

 

Nope, I'm using all default memory timings in the BIOS.

 *RayDude wrote:*   

> 16. Try reseating the ATX power connector (and the little four pin extention of the mobo has one).

 

Okay, I'll give it a try.

 *RayDude wrote:*   

> This is off the top of my head. It does sound like you have a flakey hardware there. If you can isolate it, you'll be able to fix it.

 

I agree, it's just so hard to track down subtle bugs like this.  My next step is to get a bunch of rss-glx (really slick screensavers) running to see if pushing the video card is the problem.

Thank you very much for all the suggestions and ideas!

----------

## RayDude

What video card do you have?

300 Watts is not really enough for a kick ass Athlon with a kick ass Nvidia leaf blower.

One way to check is to go into the bios and find the power / temp monitor page.

Then check to see if +5 is at least +5, +3.3 at least +3.3, core= 1.65 (or what ever you set it to), etc. If its even just under or right at it correct value it could be that under heavy load the power is dropping low enough to cause timing violations in the mobo chipset or cpu.

The Num-lock working after the hang means the system is still alive. At least the CPU is running, its just having problems communicating with the outside world. That implies motherboard or software in my opinion.

I don't know of anything like prime95 for Linux, I'll have to look though, its a good idea...

Another idea is to disable the nvidia drivers and use the xorg stock nv drivers. (No 3d I know) but if you don't have problems, then it could be the nvidia drivers.

I'm using those drivers here at work and I have no problems at all. This is an intel machine though, very different.

I really suspect your power supply. Most 300 Watt supplies lie about how much power they have and really aren't much good above 250 watts. Try a new power supply, I bet that will help.

As for which one: I use antec and only antec. I've been burned so many times by other companies, I'll probably never use anything else.

Let me know how it goes,

Raydude

----------

## number_nine

Another thing I thought I'd throw out there (or at least repeat if I've mentioned it already) is that these lockups always occur when the computer is idle (well, idle in the sense that I'm not explicitly using it for something).  I've never had the computer lock-up while working on it.  It always happens when I'm at work or sleeping.

This makes the problem extremely hard to diagnose, as I haven't found a way to reliably duplicate the problem.  In fact, this problem seems to come and go---I've gone months without a lockup, then all the sudden it will start doing it again, almost daily (or at least every other day) for 10 days or so.

What can I do to "simulate" me being away from the keyboard for a long period of time?

What kinds of things run in the background while the computer isn't in use?

I mean, I know what things are always running, such as seti@home, fetchmail daemon, samba daemon, etc., but I'm not so sure what kind of things (if anything) happen during "off peak" hours.

Thanks again!

Matt

----------

## RayDude

 *number_nine wrote:*   

> Another thing I thought I'd throw out there (or at least repeat if I've mentioned it already) is that these lockups always occur when the computer is idle (well, idle in the sense that I'm not explicitly using it for something).  I've never had the computer lock-up while working on it.  It always happens when I'm at work or sleeping.
> 
> This makes the problem extremely hard to diagnose, as I haven't found a way to reliably duplicate the problem.  In fact, this problem seems to come and go---I've gone months without a lockup, then all the sudden it will start doing it again, almost daily (or at least every other day) for 10 days or so.
> 
> What can I do to "simulate" me being away from the keyboard for a long period of time?
> ...

 

What screen saver are you using? Is it openGL?

If you are using your screen saver, try disabling it, just blank the screen.

That's the only idea that came to mind.

Oh, Seti@home uses mondo cpu power, try disabling it...

Raydude

----------

