# System instabilities - General Protection faults

## MarkCu

I've got a new homebuild system based on a ASUS Mini-ITX Motherboard. Nothing else has been installed on it (OS wise).  The system is intended as a  headless backup server.  RAID/SSH/rsync/NFS. 

The install went smoothly, things seem to work fine.  Until I start using the system for it's intended purposes - doing backups.  I started getting frequent crashes.

Narrowing down / searching these forums, I see most folks with similar issues end up finding hardware issues.  So, to start I focus there.

Memtest+ runs for 36 hours.  No issues.

CpuBurn for 4 hours.  No issues.

Check cooling - looks ok - CPU temp never goes above 65 C.

Underclock the system by 5% (both CPU and memory).  Failure modes doesn't seem to change.

Pull 1 DIMM - I have (2) 4 GB Dimms - . No changes.

Swap - use other DIMM.  No changes

So, I'm thinking HW looks fairly reasonable.

So focus more on SW.  Trying to narrow my scope, I can usually

get a crash just by doing a dd on the server itself:

 dd if=/dev/md127 of=/dev/null

The failures aren't identical, but seem to similar to below: 

```

[35101.826124] general protection fault: 0000 [#1] SMP 

[35101.826153] CPU 0 

[35101.826161] Modules linked in: k10temp

[35101.826179] 

[35101.826190] Pid: 568, comm: kswapd0 Not tainted 3.4.9-gentoo #9 System manufacturer System Product Name/C60M1-I

[35101.826224] RIP: 0010:[<ffffffff81145fb8>]  [<ffffffff81145fb8>] drop_buffers+0x28/0xb0

[35101.826260] RSP: 0018:ffff880234eff9a0  EFLAGS: 00010206

[35101.826276] RAX: 0000000000000000 RBX: ffffea00048dab40 RCX: 0000000000000000

[35101.826295] RDX: 0000000000000000 RSI: ffff880234eff9d8 RDI: ffbf8801332bdf08

[35101.826314] RBP: ffff880234eff9c0 R08: dead000000200200 R09: dead000000100100

[35101.826333] R10: ffff880234effbb8 R11: ffff880234effbc0 R12: ffff8802365e55a0

[35101.826352] R13: ffff8801332bdf08 R14: ffff880234eff9d8 R15: 0000000000000001

[35101.826373] FS:  00007fb00bc26700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000

[35101.826395] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b

[35101.826411] CR2: 000000000065dd00 CR3: 000000022b4fd000 CR4: 00000000000007f0

[35101.826464] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

[35101.826516] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

[35101.826569] Process kswapd0 (pid: 568, threadinfo ffff880234efe000, task ffff88023597be70)

[35101.826655] Stack:

[35101.826696]  ffffea00048dab40 ffff8802365e55a0 0000000000000000 ffffea00048dab40

[35101.826789]  ffff880234effa00 ffffffff81146090 ffff880234eff9f0 0000000000000000

[35101.826882]  ffff8802365e55a0 ffff880234effd90 ffff880234effba0 ffffea00048dab60

[35101.826975] Call Trace:

[35101.827023]  [<ffffffff81146090>] try_to_free_buffers+0x50/0xb0

[35101.827076]  [<ffffffff8114cd9d>] blkdev_releasepage+0x3d/0x50

[35101.827130]  [<ffffffff810cc29d>] try_to_release_page+0x2d/0x40

[35101.827185]  [<ffffffff810df2e2>] shrink_page_list+0x762/0x910

[35101.827239]  [<ffffffff810e8464>] ? __mod_zone_page_state+0x44/0x50

[35101.827293]  [<ffffffff810dd134>] ? update_isolated_counts.clone.55+0x114/0x130

[35101.827383]  [<ffffffff810df974>] shrink_inactive_list+0x244/0x4c0

[35101.827437]  [<ffffffff810e0304>] shrink_mem_cgroup_zone+0x3b4/0x4f0

[35101.827491]  [<ffffffff8111bbe2>] ? prune_super+0x192/0x1b0

[35101.827545]  [<ffffffff810e10f2>] balance_pgdat+0x542/0x730

[35101.827598]  [<ffffffff810e1449>] kswapd+0x169/0x3c0

[35101.827649]  [<ffffffff81058be0>] ? wake_up_bit+0x40/0x40

[35101.827701]  [<ffffffff810e12e0>] ? balance_pgdat+0x730/0x730

[35101.827752]  [<ffffffff81058466>] kthread+0x96/0xa0

[35101.827804]  [<ffffffff816ae2d4>] kernel_thread_helper+0x4/0x10

[35101.827856]  [<ffffffff810583d0>] ? flush_kthread_worker+0xb0/0xb0

[35101.827909]  [<ffffffff816ae2d0>] ? gs_change+0xb/0xb

[35101.827955] Code: 00 00 00 55 48 89 e5 41 56 49 89 f6 41 55 41 54 53 48 8b 07 48 89 fb f6 c4 08 0f 84 8e 00 00 00 4c 8b 6f 30 4c 89 ef 0f 1f 40 00 <48> 8b 07 f6 c4 08 74 0e 48 8b 43 08 48 85 c0 74 05 f0 80 48 7b 

[35101.828240] RIP  [<ffffffff81145fb8>] drop_buffers+0x28/0xb0

[35101.828294]  RSP <ffff880234eff9a0>

[35101.828669] ---[ end trace 7979c35d1c9be633 ]---

[35161.769980] INFO: rcu_sched self-detected stall on CPU { 1}  (t=60000 jiffies)

[35161.770288] Pid: 3339, comm: dd Tainted: G      D      3.4.9-gentoo #9

[35161.771507] Call Trace:

[35161.771588]  <IRQ>  [<ffffffff810a7d26>] __rcu_pending+0x206/0x490

[35161.771735]  [<ffffffff810a8470>] rcu_check_callbacks+0xb0/0x170

[35161.771828]  [<ffffffff810472b3>] update_process_times+0x43/0x80

[35161.771918]  [<ffffffff8107a6bf>] tick_sched_timer+0x5f/0xb0

[35161.772008]  [<ffffffff8105c7a8>] __run_hrtimer+0x78/0x1c0

[35161.772097]  [<ffffffff8107a660>] ? tick_nohz_handler+0xe0/0xe0

[35161.772187]  [<ffffffff8105cfe6>] hrtimer_interrupt+0xf6/0x240

[35161.772278]  [<ffffffff810212e4>] smp_apic_timer_interrupt+0x64/0xa0

[35161.772371]  [<ffffffff816ada87>] apic_timer_interrupt+0x67/0x70

[35161.772458]  <EOI>  [<ffffffff816ac7ca>] ? _raw_spin_lock+0x1a/0x30

[35161.772597]  [<ffffffff8114734b>] create_empty_buffers+0x4b/0xd0

[35161.772689]  [<ffffffff811485a8>] block_read_full_page+0x2c8/0x390

[35161.772780]  [<ffffffff8114c280>] ? I_BDEV+0x10/0x10

[35161.772869]  [<ffffffff810e8cae>] ? __inc_zone_page_state+0x2e/0x30

[35161.772961]  [<ffffffff810ccfeb>] ? add_to_page_cache_locked+0x8b/0xe0

[35161.773052]  [<ffffffff8114ce23>] blkdev_readpage+0x13/0x20

[35161.773142]  [<ffffffff810d81c9>] __do_page_cache_readahead+0x1d9/0x260

[35161.773234]  [<ffffffff810d857c>] ra_submit+0x1c/0x20

[35161.773321]  [<ffffffff810d868d>] ondemand_readahead+0x10d/0x230

[35161.773413]  [<ffffffff812d283d>] ? copy_user_generic_string+0x2d/0x40

[35161.773503]  [<ffffffff810d8830>] page_cache_async_readahead+0x80/0xa0

[35161.773596]  [<ffffffff810ce86b>] generic_file_aio_read+0x48b/0x780

[35161.773688]  [<ffffffff811183c2>] do_sync_read+0xe2/0x120

[35161.773778]  [<ffffffff8127db83>] ? security_file_permission+0x93/0xb0

[35161.773869]  [<ffffffff81118c93>] vfs_read+0xc3/0x170

[35161.773956]  [<ffffffff81118d8c>] sys_read+0x4c/0x90

[35161.774044]  [<ffffffff816acfca>] ? system_call_after_swapgs+0x17/0x59

[35161.774135]  [<ffffffff816ad022>] system_call_fastpath+0x16/0x1b

```

I understand folks don't want to debug processes that are "Tainted".  The log above shows one process (568) as "Not Tainted", the other (3339) is "Tainted". Really "dd" is Tainted?  Or I'm just interpreting this wrong?  

Full dmesg:

http://pastebin.com/raw.php?i=04Y9Kx75

Full kernel .config:

http://pastebin.com/raw.php?i=kUqbhM7z

Any help appreciated.

Thanks,

Mark

----------

## NeddySeagoon

MarkCu,

CONFIG_HZ_1000=y  is known to cause problems on some hardware.  Its not need on a headless system either.

Try 100Hz instead.

You also have several debug options on in your kernel, I did not check them all. Debug options always cause logspam and sometimes interfere with normal operation.

Debug options should only be on if you are debugging that part of the kernel.

While you are fixing your kernel timer, turn off all the debug stuff too.

----------

## MarkCu

Thanks for the advise.  

Recompiled my kernel with suggested changes.

Still crashing:

Worth mentioning - don't know if it matters or not - but I'm

running without swap.  Figured 8G memory should be plenty

for this config.

Same command:

dd if=/dev/md127 of=/dev/null bs=1024

dmesg Result: 

```

[ 2111.906328] general protection fault: 0000 [#1] PREEMPT SMP 

[ 2111.906358] CPU 0 

[ 2111.906365] Modules linked in: k10temp

[ 2111.906381] 

[ 2111.906391] Pid: 565, comm: kswapd0 Not tainted 3.4.9-gentoo #11 System manufacturer System Product Name/C60M1-I

[ 2111.906421] RIP: 0010:[<ffffffff8114a658>]  [<ffffffff8114a658>] drop_buffers+0x28/0xc0

[ 2111.906453] RSP: 0018:ffff880234fb79c0  EFLAGS: 00010206

[ 2111.906467] RAX: 0000000000000000 RBX: ffffea0004883ec0 RCX: 0000000000000000

[ 2111.906484] RDX: 0000000000000000 RSI: ffff880234fb79f8 RDI: ffbf8801326bdf08

[ 2111.906501] RBP: ffff8802365e05e8 R08: 0000000000000003 R09: ffff880234fb6000

[ 2111.906518] R10: ffff880234fb7fd8 R11: ffff880234fb7bb0 R12: ffff8801326bdf08

[ 2111.906536] R13: ffff880234fb79f8 R14: 0000000000000001 R15: ffff880234fb7af0

[ 2111.906554] FS:  00007fdcadc1b700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000

[ 2111.906574] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b

[ 2111.906588] CR2: 00007f9bae9e8000 CR3: 00000002340be000 CR4: 00000000000007f0

[ 2111.906606] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

[ 2111.906623] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

[ 2111.906642] Process kswapd0 (pid: 565, threadinfo ffff880234fb6000, task ffff880235a4e200)

[ 2111.906724] Stack:

[ 2111.906762]  ffff8802365e0560 ffffea0004883ec0 ffff8802365e05e8 0000000000000000

[ 2111.906857]  ffffea0004883ec0 ffffffff8114a73f ffff880234fb7b90 0000000000000000

[ 2111.906951]  ffff880234fb7d80 ffff880234fb7b90 ffffea0004883ee0 ffffffff810e2060

[ 2111.907046] Call Trace:

[ 2111.907096]  [<ffffffff8114a73f>] ? try_to_free_buffers+0x4f/0xc0

[ 2111.907153]  [<ffffffff810e2060>] ? shrink_page_list+0x790/0x970

[ 2111.907209]  [<ffffffff810eb60f>] ? __mod_zone_page_state+0x3f/0x50

[ 2111.907265]  [<ffffffff810e080b>] ? update_isolated_counts.clone.56+0x13b/0x170

[ 2111.907356]  [<ffffffff810e2713>] ? shrink_inactive_list+0x233/0x4d0

[ 2111.907413]  [<ffffffff810e3092>] ? shrink_mem_cgroup_zone+0x392/0x4d0

[ 2111.907471]  [<ffffffff810e3e9a>] ? balance_pgdat+0x4ea/0x6b0

[ 2111.907526]  [<ffffffff810e41dc>] ? kswapd+0x17c/0x430

[ 2111.907579]  [<ffffffff816b129c>] ? __schedule+0x27c/0x5e0

[ 2111.907632]  [<ffffffff81059790>] ? wake_up_bit+0x40/0x40

[ 2111.907685]  [<ffffffff810e4060>] ? balance_pgdat+0x6b0/0x6b0

[ 2111.907738]  [<ffffffff810e4060>] ? balance_pgdat+0x6b0/0x6b0

[ 2111.907791]  [<ffffffff81058fee>] ? kthread+0x9e/0xb0

[ 2111.907844]  [<ffffffff816b42d4>] ? kernel_thread_helper+0x4/0x10

[ 2111.907900]  [<ffffffff81058f50>] ? flush_kthread_worker+0xc0/0xc0

[ 2111.907955]  [<ffffffff816b42d0>] ? gs_change+0xb/0xb

[ 2111.908003] Code: 00 00 00 41 55 49 89 f5 41 54 55 53 48 89 fb 48 83 ec 08 48 8b 07 f6 c4 08 0f 84 99 00 00 00 4c 8b 67 30 4c 89 e7 0f 1f 44 00 00 <48> 8b 07 f6 c4 08 74 0e 48 8b 43 08 48 85 c0 74 05 f0 80 48 7b 

[ 2111.908304] RIP  [<ffffffff8114a658>] drop_buffers+0x28/0xc0

[ 2111.908360]  RSP <ffff880234fb79c0>

[ 2111.908707] ---[ end trace c23f7d9f938c612f ]---

[ 2111.908797] note: kswapd0[565] exited with preempt_count 1

```

dmesg:

http://pastebin.com/raw.php?i=LxmU9QKX

.config:

http://pastebin.com/raw.php?i=jV3pQ8a5

Any other ideas?

Thanks,

Mark

----------

## NeddySeagoon

MarkCu,

```
CONFIG_SLUB_DEBUG=y

CONFIG_X86_DEBUGCTLMSR=y

CONFIG_HWMON_DEBUG_CHIP=y

CONFIG_DEBUG_FS=y

CONFIG_KEYS_DEBUG_PROC_KEYS=y
```

Use the search (press /) in

```
make menuconfig
```

 to find the above options and turn them off.

Not having swap does not stop the kernel swapping, it just robs the kernel of the ability to move dynamically allocated RAM to disk.

The kernel will still swap by discarding from RAM data or code that has a permanent home in disk, then reloading it when its needed again.

Unless you are running a diskless node, a small swap, say 512Mb, is a good thing.

You can make a swap file if you want to test your swap theory but I agree, no swap is unlikely to be the problem.

----------

## MarkCu

Ok managed, to get (most) of those other DEBUG kernel options off.

One, I couldn't figure out how to disable:

CONFIG_X86_DEBUGCTLMSR=y 

The help doesn't show the dependencies, nor where it is, nor much else, and I can't 

find it. 

Anyway, similar results:

```

[58460.314388] ------------[ cut here ]------------

[58460.314414] Kernel BUG at ffffffff81116ca6 [verbose debug info unavailable]

[58460.314433] invalid opcode: 0000 [#1] PREEMPT SMP

[58460.314452] CPU 0

[58460.314458] Modules linked in: k10temp

[58460.314474]

[58460.314483] Pid: 562, comm: kswapd0 Not tainted 3.4.9-gentoo #12 System manufacturer System Product Name/C60M1-I

[58460.314513] RIP: 0010:[<ffffffff81116ca6>]  [<ffffffff81116ca6>] free_buffer_head+0x66/0x80

[58460.314543] RSP: 0018:ffff880234f6b9f0  EFLAGS: 00010287

[58460.314558] RAX: ffff880124837ce0 RBX: ffff880124837c98 RCX: 0000000000000000

[58460.314575] RDX: 0000000000000000 RSI: 0000000000001000 RDI: ffff880124837c98

[58460.314592] RBP: ffff88023646d968 R08: 0000000000000003 R09: ffff880234f6a000

[58460.314609] R10: ffff880234f6bfd8 R11: ffff880234f6bbd0 R12: 0000000000000001

[58460.314626] R13: ffffea00048f8c40 R14: 0000000000000001 R15: ffff880234f6bb00

[58460.314644] FS:  00007f2118045700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000

[58460.314696] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b

[58460.314742] CR2: 00007fd7644c1000 CR3: 00000002317dd000 CR4: 00000000000007f0

[58460.314792] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

[58460.314841] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

[58460.314891] Process kswapd0 (pid: 562, threadinfo ffff880234f6a000, task ffff880235433600)

[58460.314973] Stack:

[58460.315011]  ffff88023646d968 ffff880124837c98 ffff88023646d968 ffffffff81116eec

[58460.315099]  ffff880234f6bbb0 ffff880124837c98 ffff880234f6bda0 ffff880234f6bbb0

[58460.315191]  ffffea00048f8c60 ffffffff810b8e00 0000000000010dc0 ffff880234f6bac0

[58460.315286] Call Trace:

[58460.315335]  [<ffffffff81116eec>] ? try_to_free_buffers+0x7c/0xc0

[58460.315392]  [<ffffffff810b8e00>] ? shrink_page_list+0x740/0x8c0

[58460.315447]  [<ffffffff810c01df>] ? __mod_zone_page_state+0x3f/0x50

[58460.315502]  [<ffffffff810b794b>] ? update_isolated_counts.clone.53+0x13b/0x170

[58460.315591]  [<ffffffff810b94a6>] ? shrink_inactive_list+0x286/0x470

[58460.315641]  [<ffffffff810b9d82>] ? shrink_mem_cgroup_zone+0x3a2/0x4e0

[58460.315693]  [<ffffffff810eefbe>] ? grab_super_passive+0x3e/0x90

[58460.315742]  [<ffffffff810baa9a>] ? balance_pgdat+0x4fa/0x6c0

[58460.315792]  [<ffffffff810badf6>] ? kswapd+0x196/0x300

[58460.315840]  [<ffffffff81051b20>] ? wake_up_bit+0x40/0x40

[58460.315887]  [<ffffffff810bac60>] ? balance_pgdat+0x6c0/0x6c0

[58460.315936]  [<ffffffff810bac60>] ? balance_pgdat+0x6c0/0x6c0

[58460.315983]  [<ffffffff8105144e>] ? kthread+0x9e/0xb0

[58460.316032]  [<ffffffff8163e314>] ? kernel_thread_helper+0x4/0x10

[58460.316081]  [<ffffffff810513b0>] ? flush_kthread_worker+0xc0/0xc0

[58460.316130]  [<ffffffff8163e310>] ? gs_change+0xb/0xb

[58460.316174] Code: 65 ff 0c 25 60 e2 00 00 e8 38 ff ff ff 83 6b 1c 01 48 8b 85 38 e0 ff ff a8 08 75 11 48 8b 5c 24 08 48 8b 6c 24 10 48 83 c4 18 c3 <0f> 0b 48 8b 5c 24 08 48 8b 6c 24 10 48 83 c4 18 e9 15 4a 52 00

[58460.316434] RIP  [<ffffffff81116ca6>] free_buffer_head+0x66/0x80

[58460.316483]  RSP <ffff880234f6b9f0>

[58460.319321] ---[ end trace fa6084efedc140cf ]---

```

dmesg:

http://pastebin.com/raw.php?i=rXvLKJ5w

.config

http://pastebin.com/raw.php?i=cHPrW85H

Also tried adding some swap - no difference.

Thanks 

Mark

----------

## NeddySeagoon

MarkCu,

[58460.314414] Kernel BUG at ffffffff81116ca6 [verbose debug info unavailable]

[58460.314433] invalid opcode: 0000 [#1] PREEMPT SMP 

Invalid opcode means the system tried to execute an instruction that the CPU does not understand.

If its in the kernel, you have set the wrong CPU type in the kernel.

If its in a program, your CFLAGS or USE flags do not match your CPU.

Please post your emerge --info output and your /proc/cpuinfo.

If you have anything in /etc/portage/package.use ... all of that too.

----------

## MarkCu

Kernel CPU is just x86_64.  Same for USE

```

% emerge --info

Portage 2.1.11.9 (default/linux/amd64/10.0, gcc-4.5.4, glibc-2.15-r2, 3.4.9-gentoo x86_64)

=================================================================

System uname: Linux-3.4.9-gentoo-x86_64-AMD_C-60_APU_with_Radeon-tm-_HD_Graphics-with-gentoo-2.1

Timestamp of tree: Wed, 10 Oct 2012 00:45:01 +0000

app-shells/bash:          4.2_p37

dev-lang/python:          2.7.3-r2, 3.2.3

dev-util/cmake:           2.8.9

dev-util/pkgconfig:       0.27.1

sys-apps/baselayout:      2.1-r1

sys-apps/openrc:          0.9.8.4

sys-apps/sandbox:         2.5

sys-devel/autoconf:       2.13, 2.68

sys-devel/automake:       1.11.6

sys-devel/binutils:       2.22-r1

sys-devel/gcc:            4.5.4

sys-devel/gcc-config:     1.7.3

sys-devel/libtool:        2.4-r1

sys-devel/make:           3.82-r3

sys-kernel/linux-headers: 3.4-r2 (virtual/os-headers)

sys-libs/glibc:           2.15-r2

Repositories: gentoo

ACCEPT_KEYWORDS="amd64"

ACCEPT_LICENSE="* -@EULA"

CBUILD="x86_64-pc-linux-gnu"

CFLAGS="-O2 -march=x86-64"

CHOST="x86_64-pc-linux-gnu"

CONFIG_PROTECT="/etc"

CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d /etc/fonts/fonts.conf /etc/gconf /etc/gentoo-release /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo"

CXXFLAGS="-O2 -march=x86-64"

DISTDIR="/usr/portage/distfiles"

FCFLAGS="-O2 -pipe"

FEATURES="assume-digests binpkg-logs config-protect-if-modified distlocks ebuild-locks fixlafiles news parallel-fetch parse-eapi-ebuild-head protect-owned sandbox sfperms strict unknown-features-warn unmerge-logs unmerge-orphans userfetch"

FFLAGS="-O2 -pipe"

GENTOO_MIRRORS="ftp://ftp.ucsb.edu/pub/mirrors/linux/gentoo/"

LDFLAGS="-Wl,-O1 -Wl,--as-needed"

MAKEOPTS="-j2"

PKGDIR="/usr/portage/packages"

PORTAGE_CONFIGROOT="/"

PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --stats --human-readable --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages"

PORTAGE_TMPDIR="/var/tmp"

PORTDIR="/usr/portage"

PORTDIR_OVERLAY=""

SYNC="rsync://rsync.us.gentoo.org/gentoo-portage"

USE="X acl amd64 apng berkdb bluray bzip2 cddb cli consolekit cracklib crypt cups cxx dbus dri embedded examples exif ffmpeg fortran gdbm gif gpm gudev hwdb iconv imap inotify ipv6 javascrip javascript jpeg jpeg2k lm_sensors lzma midi minizip mmx modules mp3 mp4 mpeg mudflap multilib ncurses nls nptl ogg openmp pam pcre perl png policykit ppds pppd python readline session sse sse2 ssl svg taglib tcpd thumbnail tiff unicode vorbis x264 zlib" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mmap_emul mulaw multi null plug rate route share shm softvol" APACHE2_MODULES="actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" CALLIGRA_FEATURES="kexi words flow plan sheets stage tables krita karbon braindump" CAMERAS="ptp2" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog" ELIBC="glibc" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock itrax mtk3301 nmea ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf superstar2 timing tsip tripmate tnt ubx" INPUT_DEVICES="keyboard mouse evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LIBREOFFICE_EXTENSIONS="presenter-console presenter-minimizer" PHP_TARGETS="php5-3" PYTHON_TARGETS="python3_2 python2_7" RUBY_TARGETS="ruby18 ruby19" USERLAND="GNU" VIDEO_CARDS="fbdev glint intel mach64 mga neomagic nouveau nv r128 radeon savage sis tdfx trident vesa via vmware dummy v4l" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq steal rawnat logmark ipmark dhcpmac delude chaos account"

Unset:  CPPFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LANG, LC_ALL, LINGUAS, PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS, USE_PYTHON

```

```

% cat /etc/portage/make.conf

# These settings were set by the catalyst build script that automatically

# built this stage.

# Please consult /usr/share/portage/config/make.conf.example for a more

# detailed example.

CFLAGS="-O2 -march=x86-64"

CXXFLAGS="${CFLAGS}"

# WARNING: Changing your CHOST is not something that should be done lightly.

# Please consult http://www.gentoo.org/doc/en/change-chost.xml before changing.

CHOST="x86_64-pc-linux-gnu"

# These are the USE flags that were used in addition to what is provided by the

# profile used for building.

USE="mmx sse sse2 python png X gif jpeg mp3 mp4 mpeg jpeg2k tiff apng ppds ssl dbus gudev policykit embedded consolekit ogg vorbis hwdb midi readline imap -gnome -kde minizip examples lzma perl bluray x264 svg cddb exif ffmpeg inotify javascrip javascript taglib thumbnail lm_sensors"

GENTOO_MIRRORS="ftp://ftp.ucsb.edu/pub/mirrors/linux/gentoo/"

SYNC="rsync://rsync.us.gentoo.org/gentoo-portage"

MAKEOPTS="-j2"

```

```

% cat /proc/cpuinfo 

processor       : 0

vendor_id       : AuthenticAMD

cpu family      : 20

model           : 2

model name      : AMD C-60 APU with Radeon(tm) HD Graphics

stepping        : 0

microcode       : 0x500010d

cpu MHz         : 1000.010

cache size      : 512 KB

physical id     : 0

siblings        : 2

core id         : 0

cpu cores       : 2

apicid          : 0

initial apicid  : 0

fpu             : yes

fpu_exception   : yes

cpuid level     : 6

wp              : yes

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni monitor ssse3 cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch ibs skinit wdt arat cpb hw_pstate npt lbrv svm_lock nrip_save pausefilter

bogomips        : 2000.02

TLB size        : 1024 4K pages

clflush size    : 64

cache_alignment : 64

address sizes   : 36 bits physical, 48 bits virtual

power management: ts ttp tm stc 100mhzsteps hwpstate cpb

processor       : 1

vendor_id       : AuthenticAMD

cpu family      : 20

model           : 2

model name      : AMD C-60 APU with Radeon(tm) HD Graphics

stepping        : 0

microcode       : 0x500010d

cpu MHz         : 1000.010

cache size      : 512 KB

physical id     : 0

siblings        : 2

core id         : 1

cpu cores       : 2

apicid          : 1

initial apicid  : 1

fpu             : yes

fpu_exception   : yes

cpuid level     : 6

wp              : yes

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni monitor ssse3 cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch ibs skinit wdt arat cpb hw_pstate npt lbrv svm_lock nrip_save pausefilter

bogomips        : 2000.02

TLB size        : 1024 4K pages

clflush size    : 64

cache_alignment : 64

address sizes   : 36 bits physical, 48 bits virtual

power management: ts ttp tm stc 100mhzsteps hwpstate cpb

```

----------

## NeddySeagoon

MarkCu,

These flags

```
mmx sse sse2 mmxext ssse3 sse4a
```

frm your cpuinfo indicate optional instruction set extensions that are present.

Usually, AMD CPUs have 3Dnow and 3Dnowext too.

```
-march=x86-64
```

should be safe.

```
mmx sse sse2
```

are in your USE flags but thats OK as your CPU has those flags too.

As you say, your kernel has CONFIG_GENERIC_CPU=y, so thats OK too.

So, its all OK, it just doesn't work :(

Unfortunately, I'm out of ideas.  Your kernel is

```
 Linux version 3.4.9-gentoo
```

you could try the testing gentoo-sources in case it really is a kernel bug and its now fixed.

You could also try rolling your own kernel with the help of kernel-seeds.org.

----------

## MarkCu

Oh, and package.use

```

% cat /etc/portage/package.use 

media-video/vlc dvd ffmpeg mpeg mad wxwindows aac dts a52 ogg flac theora oggvorbis matroska freetype bidi xv svga gnutls stream vlm httpd cdda vcd cdio live lua truetype debug

net-misc/ntp caps

app-portage/layman git subversion

app-admin/gkrellm X lm_sensors

```

----------

## NeddySeagoon

MarkCu,

package.use is all good.

----------

## MarkCu

FYI, for those following - compiled a newer "testing" kernel - gentoo-sources-3.4.11.

Still same crashes.  Going to try 3.6.8 next.

----------

## MarkCu

gentoo-sources-3.6.8 doesn't help either.  Still crashing.

Neddy indicates the fault is showing an illegal opcode.  Can I tell from the

dmesg log what the opcode is that's causing the problem?  Or, is there a 

debug switch I can turn on that's more verbose?

Any pointer are appreciated.  I can google around about kernel

debugging, but it's a big subject.

Thanks,

Mark

----------

## NeddySeagoon

MarkCu,

The opcode is 0000 - that was in one of your posts but it doesn't help.

If it was really a kernel bug, lots of users would see it and it would be all over Google. Its not.

That points to your hardware somewhere.

If you have several sticks of RAM, Remove them all except one. Now what happens.

Try them in turn, one at a time.

Can you try other binary distros or the Gentoo liveDVD.  That would prove its not something gone wrong with your builds as you would be running code built elsewhere..

----------

## toralf

B/c I see kswapd  in the trace and Linus delayed the current kernel due to a kswap issue and GKH added few minutes ago this https://git.kernel.org/?p=linux/kernel/git/stable/stable-queue.git;a=blob;f=queue-3.6/mm-vmscan-fix-endless-loop-in-kswapd-balancing.patch;h=8b0b3bee5d8824bc999adf8bcc2edbc555bc173d;hb=a86ab94abf14d4c06752f5f7363fe72a97f3e372 to the stable kernel tree probably worth to test that particular git piece ?

----------

## krinn

Never been expert at amd64 arch, but there's no x86-64 march setting on gcc 4.5.4, so your -march= setup might do random result.

switch to generic or native

http://gcc.gnu.org/onlinedocs/gcc-4.5.4/gcc/i386-and-x86_002d64-Options.html#i386-and-x86_002d64-Options

----------

## MarkCu

toralf,

Thanks for the pointer.  I went and installed git-sources-3.7_rc8

Configured/make/installed kernel.  

Thought that might have done it.  My test ran about 45 minutes and no crash.  (Previously, it'd always crash in less than 10 minutes).

Still crashed in the end.  

Similar crash report.  Whole thing's here:

http://pastebin.com/raw.php?i=PeGmLcDg

.config:

http://pastebin.com/raw.php?i=KUgSNQnf

Neddy - testing/swapping memory was one of the first things I did.  No change.  Plus memtest86+ ran for 36 hours with no reported issues.  I've got two DIMMS, tried running with just one or the other.  No changes. I've been swinging back and forth over HW vs SW problems.  I've eliminated just about all I can  HW wise - all that's left is the CPU and power supply. 

I can try some sort of other live DVD - although I need to go through the hoops to make it work from a USB stick - no CD/DVD/etc drive installed.

Krinn - thanks for the pointer.  I had -march=native before.  My intent was to un-optimize it even more - just generic x86-64.  The way I understand it native could be using SSE/ etc...  I was trying to remove even these usages to pare down my problem.    

I'll read up more to see what the appropriate march is - probably generic?

Thanks all so far for all the pointers.  Still digging.

----------

## wcg

What is this?

```

Modules linked in: k10temp

```

----------

## MarkCu

 *Quote:*   

> 
> 
> What is this?
> 
> 

 

```

Modules linked in: k10temp 

```

One of the lm_sensors, I think.  Used to monitor temperature. Crash was happening before I installed this, but can take it out.

----------

## wcg

Without having examined everything in detail, a bad opcode

is almost always a result of an inappropriate CFLAG. (Bad

assembly code that uses an opcode not supported by the cpu,

a binutils bug, or a gcc bug would be possible, too, if less common.)

The kernel pretty much sets its own CFLAGS, though, so if you

have the correct architecture and use a stable gcc, inappropriate

CFLAGS would be pretty rare in kernel compiles. I would look for

something in "Processor Type and Features" in the kernel .config.

(I use K8 for K10 cpus. Seems to work.)

----------

## kondor6c

I had an issue with my ASUS p8z77-v, it had seemingly random segfaults. I eventually tracked it down that my RAM was not on their approved memory vendor list.

----------

## BitJam

A software bug should have been much more repeatable, therefore despite passing of your hardware tests I think it must be a hardware bug.  The problem is that something complicated is triggering the hardware bug so it is not caught by the simpler tests.

You have established that it is not a heat related issue or a RAM issue.  I suspect the problem is related to the disk drive subsystem since that is being exercised during your failures but was not exercised in any of your hardware tests.  It is also possible the problem is the CPU.

Unfortunately, the next level of tests involve swapping either the CPU or the motherboard.  I suggest you report this as defective product and try to get a refund or replacement.

----------

## Ant P.

For the C/E-series CPUs, don't use -march=k8 in the kernel or make.conf. That sets 3DNow flags which the newer CPUs don't have.

----------

## wcg

 *Quote:*   

> For the C/E-series CPUs, don't use -march=k8 in the kernel or make.conf.

 

I was wondering why someone would use CONFIG_GENERIC_CPU with a k10

architecture chip. So these AMD apus are not k10s (perhaps some features

in common, but not drop-in replacements that will necessarily run

the same compiled code). While that module may not be the cause of the error,

one wonders if the lmsensors k10temp module actually works with the AMD HSA

(Fusion) architectures.

dmesg:

```

CPU0: AMD C-60 APU with Radeon(tm) HD Graphics stepping 00

```

( http://en.wikipedia.org/wiki/AMD_Fusion )

kernel .config:

```

# CONFIG_MK8 is not set

CONFIG_GENERIC_CPU=y

CONFIG_X86_MINIMUM_CPU_FAMILY=64

```

Could the box need more low memory protection than 64K?

```

CONFIG_X86_RESERVE_LOW=64

```

This can be set as high as 640.

----------

## MarkCu

 *Quote:*   

> 
> 
> For the C/E-series CPUs, don't use -march=k8 in the kernel or make.conf. That sets 3DNow flags which the newer CPUs don't have.
> 
> 

 

Things like this is info I'm looking for.  Thanks.

I've recompiled "world" with new gentoo CFLAGS:

CFLAGS="-O2 -march=athlon64 -mtune=generic"

For the kernel, just this is sufficient right?

```

# CONFIG_MK8 is not set

CONFIG_GENERIC_CPU=y 

```

Still no change in behavior.

I'll take the K10temp module out next time I recompile.  I only added it to check the temps when I started noticing the crashes.  It was crashing without it.

 *Quote:*   

> 
> 
> A software bug should have been much more repeatable, therefore despite passing of your hardware tests I think it must be a hardware bug. The problem is that something complicated is triggering the hardware bug so it is not caught by the simpler tests.
> 
> 

 

The test is very repeatable.  The dd command above fails EVERY time - it never completes successfully, always get a kernel crash. Sometime it takes a little longer, but it always crashes.  

I'm trying to dig more to convince myself it's hardware.  Think it might be trouble getting an RMA for this - I'm not even convinced myself it's HW.  Gotta replace the whole motherboard and CPU - it's a combo (CPU is BGA soldered to the board).  The thing works fine for just about everything else till I start hitting it hard with the backups.

```

CONFIG_X86_RESERVE_LOW=64 

```

I'll try upping this next...

I'm also going to try and change my test to read from the raw drive(s) instead of the raw RAID device.  Those results may be informative.

----------

## BitJam

 *MarkCu wrote:*   

> The test is very repeatable.  The dd command above fails EVERY time - it never completes successfully, always get a kernel crash. Sometime it takes a little longer, but it always crashes. 

 Is the crash always in the same place in the code?   When I first installed Gentoo, I had a hardware issue where I could consistently get my machine to crash when I was doing big compiles when using a ReiserFS but not ext2.  But no two crashes were identical.

 *Quote:*   

> I'm trying to dig more to convince myself it's hardware.  Think it might be trouble getting an RMA for this - I'm not even convinced myself it's HW.  Gotta replace the whole motherboard and CPU - it's a combo (CPU is BGA soldered to the board).  The thing works fine for just about everything else till I start hitting it hard with the backups.

 Usually there is a time limit on an RMA and the big question is who will pay for shipping.  You don't have to ship it back immediately but I don't want you to miss out on options while you are trying to diagnose the problem.  I think you should get the RMA process in motion.  Ideally, you could discuss it with someone and they would give you more time for further testing before you have to send it back.

I really do think you have a hardware problem that is difficult to diagnose.  I've run into a few of these over the years and they can suck up a tremendous amount of time and energy.  At some point you need to treat it like it's hardware problem even if you can't prove (even to yourself) that the problem is hardware.  It is now extremely unlikely the problem is  a bad instruction in the code.  If it were, you'd be much closer to pinpointing where in the codebase the problem is.

There is no way mis-tuning CONFIG_X86_RESERVE_LOW could cause the problems you have if they are due to software.  If so, then the kernel is garbage and I know it is not garbage.   If you want to play around with things to see if you can work around the bug then you could try turning off multi-core support.    If a non-smp kernel did work then that would be further evidence of a hardware problem although it would not constitute proof.

----------

## wcg

I think I would go with the generic cpu for the kernel and

-march=native in make.conf for userspace.

I was thinking the chipset might be corrupting some code

that gets loaded low, hence The CONFIG_X86_RESERVE_LOW

adjustment.

One would think AMD would have worked this out and

supplied the kernel developers with a working set of

gcc flags for the different apu versions. (I guess they

expect these things to be running Windows desktops.)

----------

