# Wrongly configured Kernel or dying HDD?

## Clad in Sky

Hello all,

I've had trouble with my system recently. I get numerous strange events related to (probably) my HDD.

I freshly installed Gentoo (several times since 30st April) with some troubles. The last time I even did it as verbatimly to the handbook as possible. it's booting up, so no problem here, but I do get errors quite often:

Stale NFS file handles

I/O errors (on several occasions, during boot: fonts not being found (not that much of a problem), hal not being able to touch some files (more of a problem), Gnome not starting for one user (while it does work for another) because of I/O errors.

Wrong number of inodes 

e2fsck claiming that there was a filesystem with errors mounted (even though the system shut down normally before) - checking the fs after that produces no errors

Volume Control in Gnome shutting down unexpectedly

Graphical glitches (even though I backed up and reused my old xorg.conf which worked flawlessly - on xorg 1.5.3 too; I need a xorg.conf since without x won't start)

Problem emerging openoffice (unexpected end of file in the archive. Could've been an error on the mirror portage dled the file from but fits well with the other errors I'm encountering)

Hm.. that's about it.

The question is, whether this is really my HDD or rather an inexpertly put together kernel. I was quite proud of that one, booting up quite a bit faster than the one I had before (and which I also compiled myself). In the kernel I use now I tried to keep out everything that seemed unnecessary to me... perhaps I missed something. dunno if this would result in such... results (er..). I gather the kernel'd refuse to boot altogether.

The reason why I suspect it might rather be the kernel than the HDD is that I do not encounter the problems booting the gentoo minimal cd or the system rescue cd.

I ran e2fsck -ccv on my root partition after booting up systemrescue cd and it didn't find any bad blocks on the drive.

I also noticed that systemrescue cd used some different kernel drivers, namely pata_atiixp and pata_jmicron for IDE interfaces.

BUT my root partition is on a SATA disk, so that'd hardly be the cause, would it. I do have an IDE drive in my box but this one works very well with the kernel I put together.

So can anyone make anything out of this?

Thank you.

Christian

----------

## ronmon

Every hard drive maker has a diagnostic program. I would suggest that you download the one for your particular drive or Ultimate Boot CD and run it.

----------

## booleandomain

check out smartmontools (http://en.gentoo-wiki.com/wiki/Smartmontools)

----------

## ronmon

That's a handy tool, but if your drive is truly dying the manufacturer's diagnostic tools give way more definitive results. Also, a failure code for a RMA if it is still under warranty.

----------

## Clad in Sky

Thank you for the replies. I used the Seatools for my Seagate drive, but it only found one bad block, which was corrected/ write protected afterwards. A new check didn't find anything suspicious.

I asked a friend of mine (the one who introduced me to Gentoo and who's been using it himself for years) and he said it probably was the HDD. 

Since I've got another one in my computer I just tried installing Gentoo on that one. You'll never guess - everything went wahooney-shaped AGAIN. The installation works quite flawlessly, but as soon as I leave the chrooted environment and reboot, I get the strange stuff going.

I'd've loved to try smartmontools, but I now get segfaults each time I enter emerge foo. Apart from that I don't think BOTH my HDDs, which were healthy before, suddenly kicked the bucket.

So, here's what I've been doing ever since that doomed thursday 30th of April:

fdisked HDD to fit my needs

booted up Gentoo Minimal CD (the same one I used to get my system up and running when I first got the computer and which then worked well)

installed a current stage3 (20090422)

installed latest portage

set up the config files

rebooted ← everything went fine until here.

The first reboot normally works well. I then go and emerge xorg-server, which also works and Gnome.

Both, emerging xorg-server and gnome, work well. But after a reboot or more suddenly those errors occur.

The most common one is a plain "the disk contains a filesystem with errors, check forced"

The filesystem will then be checked and no further error would be produced. So apparently nothing wrong with the FS.

Sometimes I get /var/lib... could not be read or could not toch var...

I also get random I/O errors at boot up

Sometimes e2fsck tells me to run it manually and finds then corrupt directories and such.

So, all in all this hints at a HDD failure. The thing is, I don't believe it (s. above). Of course two HDDs can break in a short period of time, but somehow I find it unlikely, given that one of them is one year old, the other one two years (and not being heavily used). Add to that that the errors don't seem to occur when running a live CD

Since I have these problems since I made a new kernel, perhaps it is something wrong in the config, even though I think this wouldn't result in such strange behaviour.

I did a memtest and memory seems to be OK. 

The only other cause I can see is that my Mobo could somehow be broken (HDD controller).

But perhaps it's the kernel.

Could anyone have a look at it?

lspci -n

lspci -k

kernel config

Since I needed to reboot the machine anyway, I got the chance to write down some of the fsck errors:

Directory inode 1799726 block 0, offset 0 directory corrupted. Salvage (I pressed y)

Missing '.' in directory inode 1799726 (fixed it)

Missing '..' in (see above)

'..' in /var/lib/init.d/started is <The NULL inode> (0) should be /var/lib/init.d

inode 2 ref count is 18, should be 19 (+ one more of that for another inode)

If anyone could help me I'd be very happy. I need my computer for my job. And while I can use my gf's when she's not at home and I am, this is not a perfect solution.

----------

## MaximeG

Hi,

When facing (seemingly) randomness in issues, it's 90% of the time due to memory issues. But if you say you checked and it's alright it must be ok memorywise. (to be 100% sure, try and emerge memtester, and runs it for a big amount of memory).

Then, another source of apparent random issues may be coming from the CPU. (Rarely because of use of wrong cflags), and/or other mobo chipsets.

HDD failures don't appear to be random (altough they might appear at different moment on different installations).

However, I believe first thing to do would be to make sure it's not related to your software by installing another distribution.

If you have the same issues, then it must be something hardware.

Regards,

Maxime

----------

## EzInKy

How old is your motherboard?

----------

## Clad in Sky

My motherboard is slightly over one year old.

It's a Gigabyte GA-MA790FX-DS5.

I mean, sure, it all looks like HW failure but such a coincidence:

I messed up my system with depclean on the 30th of April and am trying to reinstall Gentoo since then.

I just can't and don't want to believe that not only my installation is broken but some of my hardware as well. Everything worked perfektly before so I this hardware failure must've come all of a sudden (well, of course they do).

dmesging and grepping brought up something else:

[    0.301963] libata version 3.00 loaded.

[    1.017915] ata1: SATA max UDMA/133 abar m1024@0xfe02f000 port 0xfe02f100 irq 22

[    1.018067] ata2: SATA max UDMA/133 abar m1024@0xfe02f000 port 0xfe02f180 irq 22

[    1.018229] ata3: SATA max UDMA/133 abar m1024@0xfe02f000 port 0xfe02f200 irq 22

[    1.018384] ata4: SATA max UDMA/133 abar m1024@0xfe02f000 port 0xfe02f280 irq 22

[    1.476047] ata1: softreset failed (device not ready) ← 

[    1.476148] ata1: failed due to HW bug, retry pmp=0 ← 

[    1.629046] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)

[    1.673958] ata1.00: HPA detected: current 156299375, native 156301488

[    1.674060] ata1.00: ATA-7: ST380215AS, 3.AAD, max UDMA/133

[    1.674159] ata1.00: 156299375 sectors, multi 16: LBA48 NCQ (depth 31/32)

[    1.674273] ata1.00: SB600 AHCI: limiting to 255 sectors per cmd

[    1.732284] ata1.00: SB600 AHCI: limiting to 255 sectors per cmd

[    1.732387] ata1.00: configured for UDMA/133

[    2.048024] ata2: SATA link down (SStatus 0 SControl 300)

[    2.517012] ata3: softreset failed (device not ready) ← 

[    2.517111] ata3: failed due to HW bug, retry pmp=0 ← 

[    2.670050] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)

----------

## overkll

I've read in the past on these forums that the Jmicron SATA controller is a piece of garbage.  I've never had one so I can't confirm that fact.  That may be the cause of your woes, and then again, it may not.

If you're not confident about your kernel tuning skills, you may want to try one of pappy's kernel seeds.

----------

## Clad in Sky

Yes, I sent pappy a pm and asked him if he'd check my config.

Thing is, the parts of my config that i do not fully trust are the parts about the device drivers which I have to fill in myself. So it'd be kinda pointelss if I used a seed.

I might get rid of the JMicron driver... as far as I know I didn't have it in my old config.

----------

## overkll

I'm not talking about the driver, I'm talking about the jmicron controller chip (hardware).

----------

## danomac

FYI

I have had several computers here at work have intermittent errors (albiet with Windows.) 

After a lot of troubleshooting, random guesses, and cursing, I discovered that the hard drives themselves were fine, but we had a run of faulty motherboards in our machines that would randomly trip weird-ass errors with the NIC and the SATA controller.

Replacing the motherboard with a different model fixed the problems. Of course, they all start acting up when they were out of warranty.   :Rolling Eyes: 

Before you try replacing the drive, try it on a controller card, or better yet, test it in another machine if you can.

----------

## Clad in Sky

Thank you for the tip. I'll try that, then I'll try another kernel, then a new install and then I'll get a new MoBo, I think.

I really want my gentoo back, it was so nice....

----------

## EzInKy

 *danomac wrote:*   

> FYI
> 
> I have had several computers here at work have intermittent errors (albiet with Windows.) 
> 
> After a lot of troubleshooting, random guesses, and cursing, I discovered that the hard drives themselves were fine, but we had a run of faulty motherboards in our machines that would randomly trip weird-ass errors with the NIC and the SATA controller.
> ...

 

Yes, my four year old AMD server started showing the same errors a little over a month ago. I stubbornly refused to accept that it could be the motherboard. Lost a lot of data and one of my hard drives, the thing almost scorched my fingers when I touched it. The rest were fine when I stuck them in a new box.

----------

## pappy_mcfae

Clad in Sky,

My initial thought is that it's a drive or controller problem, not a driver issue, as in kernel. 

That said, yes, your kernel did have some issues. I didn't go through and whip you up a new one. I used yours as a base, and worked from it. Since the present idea is to determine the cause of your issues, the kernel I set up should suffice until we're sure what happens after you install it. If this tames your savage drive, then I'll go through once we're sure of that, and set you up with a full fledged Pappy kernel. If it doesn't then you're almost surely looking at a hardware issue. 

Is the jmicron card removable? If so, pull it if your problems remain, and retry. If it's internal, shut it off in BIOS. If you remove the jmicron, and the issues remain, then it is a serious hardware issue (probably drive).

Click here for your new .config. Compile as is.

For the best results, please do the following:

1) Move your .config file out of your kernel source directory ( /usr/src/linux-2.6.28-gentoo-r5 ).

2) Issue the command make mrproper. This is a destructive step. It returns the source to pristine condition. Unmoved .config files will be deleted!

3) Copy my .config into your source directory.

4) Issue the command make && make modules_install.

5) Install the kernel as you normally would, and reboot.

6) Once it boots, please post /var/log/dmesg so I can see how things loaded.

Good luck.

Blessed be!

Pappy

----------

## Clad in Sky

Thank you very much, pappy.

I'll test it as soon as I'm home again. I still hope the kernel was the culprit.

Unfortunately the JMicron is onboard, so I can't remove it. Gotta look it up in the BIOS and deactivate it, if I find it.

Again, thanks for your help.

----------

## Clad in Sky

So, I made the kernel and booted it up. Dmesg showed no softreset failure anymore.

Here it is:

dmesg

Funny (or rather not) is: When i copied the file to my home-dir using cat /var/log/dmesg >> /home

I got an I/O error.

```

[  990.349814] EXT3-fs error (device sda4): ext3_new_inode: reserved inode or inode > inodes count - block_group = 0, inode=8

[  990.349983] EXT3-fs error (device sda4) in ext3_new_inode: IO failure

```

Dunno what that's supposed to mean. Probably that it IS a hardware issue. Damn.

Another try worked, though.

Thank you for your time.[/url]

----------

## pappy_mcfae

You definitely want to keep an eye on it. Recheck the tightness of the SATA connectors, just to be on the safe side. If the issue remains, get a different drive and try it in place of the one there.

And you're welcome.

Blessed be!

Pappy

----------

## Clad in Sky

Checking the cables was actually the first thing I did.

I replaced the SATA cable now and plugged it into another socket on my mainboard. We'll see if the problem persists.

Additionally I'll get a new drive today. Drives are not that expansive and one can always do with some additional diskspace.

I hope this solves it. I really don't want to get a new mainboard since I bought the one I have (and which was not cheap) intending to keep it for some years.

Edit: Changing the cable apparently didn't help.

Trying to emerge ooo the ebuild failed because for some reason or the other portage was denied access to the unpacked files. 

```

 * Found db version 4.5

ACCESS DENIED  open_rd:      /usr/portage/profiles/base/profile.bashrc

/usr/lib/portage/bin/ebuild.sh: line 36: /usr/portage/profiles/base/profile.bashrc: Permission denied

ACCESS DENIED  execve:       /usr/bin/install

ACCESS DENIED  open_rd:      /usr/bin/install

/usr/lib/portage/bin/ebuild.sh: line 680: /usr/bin/install: Permission denied

ACCESS DENIED  open_wr:      /var/tmp/portage/app-office/openoffice-3.0.0/temp/logging/unpack

/usr/lib/portage/bin/isolated-functions.sh: line 185: /var/tmp/portage/app-office/openoffice-3.0.0/temp/logging/unpack: Permission denied

 * 

ACCESS DENIED  open_wr:      /var/tmp/portage/app-office/openoffice-3.0.0/temp/logging/unpack

/usr/lib/portage/bin/isolated-functions.sh: line 185: /var/tmp/portage/app-office/openoffice-3.0.0/temp/logging/unpack: Permission denied

 * ERROR: app-office/openoffice-3.0.0 failed.

ACCESS DENIED  open_wr:      /var/tmp/portage/app-office/openoffice-3.0.0/temp/logging/unpack

/usr/lib/portage/bin/isolated-functions.sh: line 185: /var/tmp/portage/app-office/openoffice-3.0.0/temp/logging/unpack: Permission denied

 * Call stack:

ACCESS DENIED  execve:       /bin/basename

ACCESS DENIED  open_rd:      /bin/basename

/usr/lib/portage/bin/isolated-functions.sh: line 40: /bin/basename: Permission denied

ACCESS DENIED  open_wr:      /var/tmp/portage/app-office/openoffice-3.0.0/temp/logging/unpack

/usr/lib/portage/bin/isolated-functions.sh: line 185: /var/tmp/portage/app-office/openoffice-3.0.0/temp/logging/unpack: Permission denied

 *                        , line 2104:  Called ebuild_main

ACCESS DENIED  execve:       /bin/basename

ACCESS DENIED  open_rd:      /bin/basename

/usr/lib/portage/bin/isolated-functions.sh: line 40: /bin/basename: Permission denied

```

 and so on...

So I guess it IS new hardware time *sigh*

----------

## pappy_mcfae

On the plus side, you can get some really amazing systems cheaply if you look around. Core-too only cost 200 (chip, board, 2 Gigs). It still blows me away when I watch it emerge. It gives me the geek wood like few other things.  :Smile: 

Blessed be!

Pappy

----------

## Clad in Sky

Well, I've still got a phenom, so a core two board would cause hard failures. I'm not even sure how to put a AM2+ processor into an Intel socket  :Mr. Green: 

----------

## pappy_mcfae

 :Embarassed:  I guess so...Still, that means all you have to pop for is the mobo.

Blessed be!

Pappy

----------

## energyman76b

you have tried a different PSU, haven't you?

----------

## Clad in Sky

Of course... I didn't. Hm... well, trying a new HDD now.

----------

## overkll

Don't rule out your memory.  I accidently came across another recent post in here with the same dmesg error and it turned out to be bad memeory.

Did you run memtest?

----------

## Clad in Sky

Yes, I did.

Memtest on the SystemRescueCD as well as memtester from portage. Both gave no errors.

I'm trying the new HDD now, if it doesn't help I'll run memtest for a few hours to be surer that it's not the memory and then I'll need to go and get a new motherboard. I still hope it was the HDD and my Kernel.

----------

## energyman76b

and I still don't believe it is the hd and more the psu.

----------

## Clad in Sky

So, posting this from a freshly setup system on a brand new HDD.

Seems to work. No interrupted ebuilds anymore (compiled OOO without any problems.). I'll do some further testing.

PSU? It's not due, yet. In another 10 months its period of warranty'll be over. I expect it to die then.

----------

## pappy_mcfae

Awesome. I'm glad to read that.

Blessed  be!

Pappy

----------

## Clad in Sky

 *pappy_mcfae wrote:*   

> Awesome. I'm glad to read that.

 

So was I to experience that. But perhaps the error just had to get accustomed to the new HDD, because it is back.

Next reboot resulted in the keymaps startscript not being accesible.

Today's boot gave me the following:

```

EXT3-fs error (device sd3): ext3_check_descriptors: Block bitmap for group 0 not in group (block 4294967295)

EXT3-fs group descriptors corrupted!

EXT3-fs error (device sd6): ext3_check_descriptors: Block bitmap for group 0 not in group (block 4294967295)

EXT3-fs group descriptors corrupted!

```

sda3 is the partition I keep portage on, sda6 is for the home dirs.

I called a friend and hopefully he can lend me a PSU so I can test this.

I rechecked my RAM using the memtest on the system rescue cd. It completed without errors. I'm no trying (again) memtester from portage which already completed one run without errors before and I expect it to do so now.

----------

