# Kernel Bug HDD Error - what kernel version would work?

## tcmsurfer

Hi,

I have a reoccuring bug with my gentoo-kernel (gentoo-sources). I already tried Kernel Version 2.6.17-r8, 2.6.18-r3 and now 2.6.8-r6 ...

It seems to hit when there's heavy access to the HDD. The HDD is ext2 but error free. It doesn't always happen with heavy HDD access though, only sometimes. And it seems that with the newer kernel versions it happens alot more often.

I already had this error with a couple of applications. Sadly this causes the accessing program to freeze. I can still access it, work with it but not sync, shut down or even send an S5 Signal to proc/acpi/sleep...

I saw that a couple of other people seem to have similar problems. The computer is a-ok otherwise (memory etc.).

Anybody got a good solution or working kernel Version?

Thanks,

  tCMSurfer

```
Jan 16 17:45:53 tCMServer BUG: unable to handle kernel NULL pointer dereference at virtual address 00000000

Jan 16 17:45:53 tCMServer printing eip:

Jan 16 17:45:53 tCMServer c0146609

Jan 16 17:45:53 tCMServer *pde = 00000000

Jan 16 17:45:53 tCMServer Oops: 0000 [#1]

Jan 16 17:45:53 tCMServer Modules linked in: nfsd exportfs lockd sunrpc

Jan 16 17:45:53 tCMServer CPU:    0

Jan 16 17:45:53 tCMServer EIP:    0060:[<c0146609>]    Not tainted VLI

Jan 16 17:45:53 tCMServer EFLAGS: 00210202   (2.6.18-gentoo-r6 #1)

Jan 16 17:45:53 tCMServer EIP is at __block_prepare_write+0x111/0x3d5

Jan 16 17:45:53 tCMServer eax: 8000087d   ebx: 00000603   ecx: 00000000   edx: c11251c0

Jan 16 17:45:53 tCMServer esi: fffff603   edi: 00000000   ebp: 00001000   esp: c6c17d74

Jan 16 17:45:53 tCMServer ds: 007b   es: 007b   ss: 0068

Jan 16 17:45:53 tCMServer Process smbd (pid: 25143, ti=c6c16000 task=c7552a50 task.ti=c6c16000)

Jan 16 17:45:53 tCMServer Stack: 00000000 c02fc7f9 00000603 c11251c0 ce92cb84 00000041 00002000 cf1f5cf4

Jan 16 17:45:53 tCMServer c6c17da8 fffff000 1501a8c0 00002000 00001000 01bd5904 000001bd 00000000

Jan 16 17:45:53 tCMServer c11251c0 00000000 00000000 c11251c0 c01468e3 00000653 c0184898 c0343780

Jan 16 17:45:53 tCMServer Call Trace:

Jan 16 17:45:53 tCMServer [<c02fc7f9>] ip_local_deliver_finish+0x0/0x164

Jan 16 17:45:53 tCMServer [<c01468e3>] block_prepare_write+0x16/0x22

Jan 16 17:45:53 tCMServer [<c0184898>] ext2_get_block+0x0/0x45c

Jan 16 17:45:53 tCMServer [<c012fc8c>] generic_file_buffered_write+0x233/0x5a1

Jan 16 17:45:53 tCMServer [<c0184898>] ext2_get_block+0x0/0x45c

Jan 16 17:45:53 tCMServer [<c033a08f>] br_nf_pre_routing_finish+0x263/0x26d

Jan 16 17:45:53 tCMServer [<c02f7885>] nf_iterate+0x30/0x61

Jan 16 17:45:53 tCMServer [<c011b257>] current_fs_time+0x40/0x49

Jan 16 17:45:53 tCMServer [<c013036a>] __generic_file_aio_write_nolock+0x370/0x3bd

Jan 16 17:45:53 tCMServer [<c01304e7>] __generic_file_write_nolock+0x86/0x9a

Jan 16 17:45:53 tCMServer [<c012556b>] autoremove_wake_function+0x0/0x2d

Jan 16 17:45:53 tCMServer [<c01305cb>] generic_file_write+0x3a/0x94

Jan 16 17:45:53 tCMServer [<c0130591>] generic_file_write+0x0/0x94

Jan 16 17:45:53 tCMServer [<c0144a61>] vfs_write+0x7f/0xe1

Jan 16 17:45:53 tCMServer [<c0144eca>] sys_write+0x3c/0x63

Jan 16 17:45:53 tCMServer [<c01029c1>] sysenter_past_esp+0x56/0x79

Jan 16 17:45:53 tCMServer Code: 44 24 30 8b 5c 24 08 89 c5 2b 6c 24 30 39 5c 24 18 89 44 24 2c 76 06 3b 6c 24 54

 72 21 8b 54 24 0c 8b 02 a8 08 0f 84 88 01 00 00 <8b> 01 a8 01 0f 85 7e 01 00 00 0f ba 29 00 e9 75 01 00 00 8b 0

1

Jan 16 17:45:53 tCMServer EIP: [<c0146609>] __block_prepare_write+0x111/0x3d5 SS:ESP 0068:c6c17d74

```

----------

## yabbadabbadont

Try using vanilla-sources.  If the problem goes away, file a bug against gentoo-sources as it is probably a problem with their patch set.  If the problem persists, then check the official kernel bug lists to see if it has already been reported.  If not, report it.  Make sure that you aren't running any closed source kernel modules, like nvidia, before you file a bug upstream as they will not accept it if the kernel is "tainted".

----------

## dziekan

I've got an identical issue. Did You solve the problem?

Mine issue happens on encrypted fat32 especially with big files (~20MB). Dmesg says:

```

BUG: unable to handle kernel NULL pointer dereference at virtual address 00000000

 printing eip:

c01cf741

*pde = 00000000

Oops: 0000 [#1]

PREEMPT

Modules linked in: sha256 aes cbc blkcipher cryptomgr crypto_algapi ndiswrapper xt_limit ipt_LOG xt_state xt_tcpudp iptable_filter iptable_mangle ipt_MASQUERADE iptable_nat ip_nat ip_conntrack ip_tables x_tables snd_seq_midi snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss nls_utf8 ntfs dm_crypt dm_mod radeonfb snd_ca0106 snd_rawmidi snd_seq_device snd_ac97_codec snd_pcm snd_timer snd via686a hwmon soundcore i2c_isa i2c_viapro snd_ac97_bus i2c_core pcspkr snd_page_alloc

CPU:    0

EIP:    0060:[<c01cf741>]    Tainted: P      VLI

EFLAGS: 00010293   (2.6.19-gentoo-r4 #1)

EIP is at blk_recount_segments+0x6e/0x227

eax: 0000001d   ebx: 00000000   ecx: e018b960   edx: 00000000

esi: c1992d80   edi: 00000000   ebp: c00353c0   esp: ef727e84

ds: 007b   es: 007b   ss: 0068

Process kcryptd/0 (pid: 3241, ti=ef726000 task=efe30030 task.ti=ef726000)

Stack: 00000000 01f7dcb9 00000000 00000000 00000000 00000000 00000001 c013b74d

       00000001 00011200 00000000 efe30030 c03cc440 c01de763 e018b960 c1992d80

       c00353c0 e018b7e0 c0175617 c1739804 e018b960 00000600 c1739804 e018b960

Call Trace:

 [<c013b74d>] mempool_alloc+0x29/0xf1

 [<c01de763>] _mmx_memcpy+0x3f/0x13a

 [<c0175617>] __bio_clone+0x8f/0xb5

 [<f084e974>] kcryptd_do_work+0x1fb/0x38a [dm_crypt]

 [<c012334c>] run_workqueue+0x91/0xed

 [<f084e779>] kcryptd_do_work+0x0/0x38a [dm_crypt]

 [<c0123921>] worker_thread+0x111/0x144

 [<c0111ee4>] default_wake_function+0x0/0x15

 [<c0123810>] worker_thread+0x0/0x144

 [<c012620a>] kthread+0xc5/0xf3

 [<c0126145>] kthread+0x0/0xf3

 [<c01035df>] kernel_thread_helper+0x7/0x10

 =======================

Code: 00 00 c7 44 24 14 00 00 00 00 c7 44 24 20 01 00 00 00 c7 44 24 28 00 00 00 00 89 04 24 6b c0 0c 8d 2c 02 e9 62 01 00 00 8b 7d 00 <8b> 07 89 f9 c1 e8 1a c1 e0 02 8d 90 e0 f5 3c c0 8b 80 e0 f5 3c

EIP: [<c01cf741>] blk_recount_segments+0x6e/0x227 SS:ESP 0068:ef727e84

```

I met this bug(?) at:

- 2.6.19-gentoo-r4,

- 2.6.18-gentoo-r6.

I didn't met this bug in some previous 2.6.17 (gentoo-sources) but I'm not sure as my HDD wasn't under so heavy load then.

----------

## dziekan

I check 2.6.17.13-vanilla and got the same problem.

```

BUG: unable to handle kernel NULL pointer dereference at virtual address 00000000

 printing eip:

c01b5ac5

*pde = 00000000

Oops: 0000 [#1]

PREEMPT

Modules linked in: sha256 aes ndiswrapper xt_limit ipt_LOG xt_state xt_tcpudp iptable_filter iptable_mangle ipt_MASQUERADE iptable_nat ip_nat ip_conntrack ip_tables x_tables snd_seq_midi snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss pcspkr nls_utf8 ntfs dm_crypt dm_mod snd_ca0106 snd_rawmidi snd_seq_device radeonfb snd_ac97_codec snd_pcm snd_timer via686a snd hwmon i2c_isa soundcore i2c_viapro i2c_core snd_ac97_bus snd_page_alloc

CPU:    0

EIP:    0060:[<c01b5ac5>]    Tainted: P      VLI

EFLAGS: 00210293   (2.6.17.13 #1)

EIP is at blk_recount_segments+0x6e/0x227

eax: 00000011   ebx: 00000000   ecx: da4e5420   edx: 00000000

esi: d3cf2780   edi: 00000000   ebp: d3cf2180   esp: efe3fbac

ds: 007b   es: 007b   ss: 0068

Process pdflush (pid: 111, threadinfo=efe3e000 task=eff7a0b0)

Stack: 00000000 efe3fbd8 00000000 00000000 00000000 00000000 00000001 00011200

       00000001 efe3fbd8 00000000 00000000 c01c430f da4e5420 da4e5420 d3cf2780

       d3cf2180 eb3548a0 c014c98f c17927dc da4e5420 c17927dc da4e5420 eb3548a0

Call Trace:

 <c01c430f> _mmx_memcpy+0x3f/0x13a  <c014c98f> __bio_clone+0x83/0xa6

 <c014c9e1> bio_clone+0x2f/0x36  <f0865684> crypt_map+0xb8/0x2ea [dm_crypt]

 <f08d829e> __map_bio+0x35/0x6e [dm_mod]  <f08d857c> __split_bio+0x14d/0x325 [dm_mod]

 <f0865857> crypt_map+0x28b/0x2ea [dm_crypt]  <f08d927d> dm_request+0x111/0x126 [dm_mod]

 <c01b69d9> generic_make_request+0x131/0x143  <c016611a> mpage_end_io_read+0x0/0x65

 <c01b7f87> submit_bio+0xa7/0xad  <c014c1b6> bio_alloc+0x13/0x22

 <c016611a> mpage_end_io_read+0x0/0x65  <c01657ac> mpage_bio_submit+0x1b/0x21

 <c0165b02> __mpage_writepage+0x350/0x488  <c0199aa3> fat_get_block+0x0/0x25

 <c0130f32> find_get_pages_tag+0x32/0x7f  <c01663ad> mpage_writepages+0x1c7/0x2e1

 <c0199b76> fat_writepage+0x0/0x16  <c0199b5d> fat_writepages+0x12/0x16

 <c0199aa3> fat_get_block+0x0/0x25  <c0133e28> do_writepages+0x23/0x39

 <c0164deb> __writeback_single_inode+0x17f/0x30f  <c0134681> pdflush+0x0/0x18b

 <f08d8d26> dm_any_congested+0x34/0x3b [dm_mod]  <c0165183> sync_sb_inodes+0x172/0x21e

 <c0134681> pdflush+0x0/0x18b  <c0165657> writeback_inodes+0x6b/0xd4

 <c01340e1> background_writeout+0x6c/0x94  <c013476d> pdflush+0xec/0x18b

 <c0134075> background_writeout+0x0/0x94  <c012327f> kthread+0x95/0xc2

 <c01231ea> kthread+0x0/0xc2  <c0100b15> kernel_thread_helper+0x5/0xb

Code: 00 00 c7 44 24 14 00 00 00 00 c7 44 24 20 01 00 00 00 c7 44 24 28 00 00 00 00 89 04 24 6b c0 0c 8d 2c 02 e9 62 01 00 00 8b 7d 00 <8b> 07 89 f9 c1 e8 1a c1 e0 02 8d 90 80 7d 38 c0 8b 80 80 7d 38

EIP: [<c01b5ac5>] blk_recount_segments+0x6e/0x227 SS:ESP 0068:efe3fbac

```

Till now I got this in:

- 2.6.19-gentoo-r4,

- 2.6.18-gentoo-r6. 

- 2.6.17.13-vanilla.

Does anybody else got this issue?

----------

## yabbadabbadont

Since you got it in a vanilla kernel, you need to do as I suggested in my previous post.  Check the kernel mailing lists/bug tracker to see if anyone else has reported the problem.  Then either open a new bug or add your report to an existing one (if any).  Make sure that you do not have any non-OSS kernel modules loaded (like nvidia or ati) or the upstream devs will refuse to look at your problem.

Edit: I see you use ndiswrapper.  I don't know if that taints the kernel or not.  Don't be surprised if they tell you to remove it and try to reproduce the problem again.

Edit2: Probably they will.   *Quote:*   

> Tainted: P

 

----------

## dziekan

Thanks for Your support. That is a really annoying bug. I will follow Your advice, maybe today will paste a link to the bug report. Right now I'm trying to make some recipe to reproduce the error... but without luck till now. Do You know some... Linux program which will test hdd, especially random write-read which would help me to easily reproduce the error?

----------

## yabbadabbadont

badblocks, but I don't think you can run it on a mounted filesystem...

I noticed that the oops occurred in dm_crypt in both your reports.  Can you verify that it has occurred in dm_crypt with each kernel?  If so, perhaps it is just the encrypted filesystem stuff that is causing the problem.  Can you do without it?

----------

## dziekan

That's right. The kernel oops I've catched in dmesg have happened on encrypted fat32 partition. So it could be some crypting bug. Right now I'm trying to reproduce the error on unecrypted ext3 partition. If I will catch kernel oops again on ext3, I will report kernel bug. However the suspicious hdd access hangs have happened earlier on unencrypted fat32 but I didn't catch dmesg output then. Also the first post seems to be the same issue and that happened on unecrypted ext2.

I will post if I get somewhere with this issue.

----------

## dziekan

Ok, I changed back to 2.6.19-gentoo-r4 and reconnected my 2 hard disks (I left the configuration) and now I can't trigger the kernel bug. Is it possible that such kind of kernel bug could be made due to bad connection between HDD and motherboard? Syslog and dmesg don't contain any kind of other bugs like ide-reset ide-seek error or such so I doubt and I think that bug is still somewhere there. The problem is huge as after that error shows up in dmesg I loose access to the specified directory in which the hanged process hanged and after reboot I usually lost all contents of that directory.

The author of this topic said that he got serious hardware problems and because of that his first post is not valid anymore.

----------

## dziekan

I unplugged one of my HDD from HDD case/box and then plugged it directly to the motherboard. After that a week has passed and I didn't catch any kernel bug. That's fine. HDD case is a simple one - IDE2IDE - so there's no chips on it. It seems that the kernel bug was related with this hdd box/case. Maybe some connection problems...

Well In this case I won't report a bug. It's possible that this should be served somehow and it shouldn't end in kernel bug which hangs a process but I think no one will care about hdd case related bug. Neither do I  :Smile: 

However all this is strange as I didn't catch any other ide-* bugs. Well connection problems should be seen in syslog... ? More over 3 per 4 times I catched the kernel bug when I was accessing HDD which was plugged directly to the MOBO and in the first IDE, while the problematic hdd case was plugged to the second IDE which is shared with a CDR device.

Anyone with other experiences? Maybe I should worry about the health of my MOBO / HDD?

Well a week has gone and I was unable to reproduce the kernel bug again... I think that this thread can be closed.

BTW how can I recover file which I've deleted with rm command in such partitions:

a) ext3

b) ext2

c) fat32 (vfat)

d) ntfs.

----------

## dziekan

Unfortunately I catched the bug again... a day after I wrote the previous post... so the bug is somewhere there. I switched to the stable 2.6.19-gentoo-r5. Recently I've found this thread http://bugzilla.kernel.org/show_bug.cgi?id=7763 about dm-crypt bug. There's possibility that it's the bug that causes my kernel bugs, but I'm not sure. The common issue is that bug happens under heavy hdd load so it can fit the dm-crypt bug. I've applied this patch http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/dm-merge-max_hw_sector.patch from the thread and now testing again... I will post a results but it make take some time... a week or two. Also look at the thread if You're interested. I posted my kernel bug http://bugzilla.kernel.org/show_bug.cgi?id=7763#c21 but they didn't answer if this can be also the consequence of the bug they've discovered.

----------

## dziekan

Unfortunatly the patch doesn't work for me. I catched the bug again while unzipping a file.

----------

## dziekan

Hi!

I'm glad to say that it seems that my issue (and probably also Yours if You use dm-crypt on fat32) seems to be resolved right now. There was dm-crypt bug. Thread about it is here. Kernel 2.6.22-rc5 (www.kernel.org) is patched against that bug and it works for me. Since 2 weeks I'm unable to catch the bug again. As I don't know how gentoo kernels are patched, I can't say which versions are/will be secured against that bug (and the bug is serious as usually it destroyed data in a whole directory where bug occured). Can anyone point which gentoo-sources version will contain that patches?

Please note however, that I can't confirm at 100% that the bug is fixed. I can only say that since 2 weeks I didn't catch it. Follow above link for details.

----------

