# Random crashes

## Shining Arcanine

I am having issues with programs randomly crashing. It seemed to be linked to my usage of a tmpfs, so I stopped using that and things seemed to be okay, but I just encountered it again. Here is an excerpt of the kernel message log:

 *Quote:*   

> [82581.883517] conftest[25553]: segfault at 0 ip (null) sp 00007fffffa7c228 error 14 in conftest[400000+1000]
> 
> [83433.572868] chrome[32721]: segfault at 140 ip 000000000124904d sp 00007fff688d1170 error 4 in chrome[400000+282a000]
> 
> [83437.364417] chrome[9484]: segfault at 8 ip 00000000012764bc sp 00007fff688d09d0 error 4 in chrome[400000+282a000]
> ...

 

This PC used to run Windows 7 and was rock solid until I changed the memory about a few months ago. After that, it would crash every few months. I suspected the memory, but I tried running memtest and it did not detect any errors. Perhaps I am not running it long enough. My only other suspicion is that perhaps when I did the upgrade, some static electricity entered the system, which is what is causing these issues.

Does anyone have any idea on how I can proceed to troubleshoot?

----------

## kimmie

Everything you're saying is consistent with a memory problem, so you're probably right. Memtest sometimes isn't abe to push the memory hard enough to get a failure. It's best to leave it running at least overnight, but even then sometimes it won't show up a problem. Temperature can make a difference too, the hotter things are, the more likely you'll get a failure. Things like chrome and especially vm host programs will really push the memory.

If you've got more than one stick I'd try removing one stick at a time, trying to isolate the bad one. If you find the problem goes away, you're done. Try putting the stick back (making sure it's in a different slot than it was initially). Sometimes if the problem is marginal this is enough to make it go away. Otherwise, get a new stick, or try swapping with memory from another computer. I've had memory/motherboard combinations that just wouldn't get on then swapped the memory between machines and had no problems.

Unfortunately the whole thing's pretty tedious. If anybody's got a quicker/more reliable solution, I'd like too hear it too!

----------

## Shining Arcanine

 *kimmie wrote:*   

> Everything you're saying is consistent with a memory problem, so you're probably right. Memtest sometimes isn't abe to push the memory hard enough to get a failure. It's best to leave it running at least overnight, but even then sometimes it won't show up a problem. Temperature can make a difference too, the hotter things are, the more likely you'll get a failure. Things like chrome and especially vm host programs will really push the memory.
> 
> If you've got more than one stick I'd try removing one stick at a time, trying to isolate the bad one. If you find the problem goes away, you're done. Try putting the stick back (making sure it's in a different slot than it was initially). Sometimes if the problem is marginal this is enough to make it go away. Otherwise, get a new stick, or try swapping with memory from another computer. I've had memory/motherboard combinations that just wouldn't get on then swapped the memory between machines and had no problems.
> 
> Unfortunately the whole thing's pretty tedious. If anybody's got a quicker/more reliable solution, I'd like too hear it too!

 

I ran memtest for more than 9 hours overnight and no issue. Is this not long enough to detect a problem?

Also, I run my CPU undervolted, but my system ran without issue with it undervolted from November to March, so I tended to blame the memory for any crashes/instability I had, because they did not start until I changed the memory. I am right now running a Prime95 torture testing of all 4 cores. I hope it discovers the cause of the problem.

----------

## kimmie

There's no real answer to the question of "how long". Memtest can prove memory is bad, but unfortunately it can't prove it's good. It's just proves it's probably good, the longer it runs the more probably it's good. Same applies to Prime95. The wikipedia article for Prime95 gives some rules of thumb, but that's all they are. 

If you're undervolting your CPU I'd definitely return to spec settings, otherwise you're just asking for trouble. Too many variables. Another variable is the use of Windoes/linux, the OS have very different memory access patterns, and sometimes problems will only show up on one OS or other.

Believe it or not, cleaning your computer can help too. Dust buildup really can cause a computer to crash.

The only way you can be sure is to find a way to make the failure repeatable. If you can't, then the only option is to do what you're doing. Run any torture tests you can, return to spec settings, make sure your computer isn't full of dust. Keep your backups up to date and keep on trucking. Maybe it'll come back and bite you again in a way that makes things clearer. Maybe you'll be lucky, and it won't. Or maybe you'll just end up with a new computer a bit earlier than you intended, not so bad really.

Man, I sound like I'm giving a lecture. I must be getting old. Who am I kidding, I AM old   :Cool: .

----------

## shazeal

Might also want to check your giving the new memory the correct voltage, some chips are binned down and need a higher voltage to run stable.

----------

## Mad Merlin

I'll agree with pretty much everything that kimmie said already, but in particular, do try to find a reasonably deterministic way to reproduce the problem, then you can start trying to fix it. Assuming >1 stick of RAM, removing them one at a time and testing for stability again is also a good suggestion.

----------

## Shining Arcanine

 *kimmie wrote:*   

> There's no real answer to the question of "how long". Memtest can prove memory is bad, but unfortunately it can't prove it's good. It's just proves it's probably good, the longer it runs the more probably it's good. Same applies to Prime95. The wikipedia article for Prime95 gives some rules of thumb, but that's all they are. 
> 
> If you're undervolting your CPU I'd definitely return to spec settings, otherwise you're just asking for trouble. Too many variables. Another variable is the use of Windoes/linux, the OS have very different memory access patterns, and sometimes problems will only show up on one OS or other.
> 
> Believe it or not, cleaning your computer can help too. Dust buildup really can cause a computer to crash.
> ...

 

I just realized that if this is a CPU problem, Prime95 might not show it, because all of the stuff crashing involves integer arithmetic and Prime95 tests floating point arithmetic. :/

Anyway, I will run memtest overnight again tonight. I ordered more memory from newegg and hopefully when all is said and done, I will have 8GB of RAM, but it should help me in my troubleshooting when it arrives.

----------

## kimmie

Good luck!

----------

## Shining Arcanine

I replaced my RAM with the new RAM that arrived. I am running emerge --jobs -ave world and so far, 337/1060 packages have been rebuilt without a single crash. I applied for a RMA from Newegg and ordered a shipping label from them. I will be sending my RAM back to them for replacement when the label arrives.

Unfortunately, there is no physical difference between the RAM I am returning and an opened package, so I suspect that they might just resell the RAM I return to someone else who won't complain about it. I have read about this happening. :/

Edit: It seems that the crashes are not gone as I thought they were. They are certainly infrequent, but not gone. :/

My emerge just died:

 *Quote:*   

> |*** glibc detected *** /usr/libexec/gcc/x86_64-pc-linux-gnu/4.4.4/cc1plus: corrupted double-linked list: 0x0000000002c87fa0 ***
> 
> ======= Backtrace: =========
> 
> /lib/libc.so.6(+0x72695)[0x2b49eef36695]
> ...

 

Here are some kernel messages:

 *Quote:*   

> [  784.874870] conftest[12638]: segfault at 80 ip 00002b27a7c7dd4d sp 00007fff86e10c70 error 4 in libc-2.11.2.so[2b27a7c53000+14f000]
> 
> [  813.827956] conftest[17782]: segfault at 40 ip 00000000555d60f3 sp 00000000ffd187e8 error 4 in libc-2.11.2.so[555b3000+140000]
> 
> [ 3128.887942] EXT4-fs (sda1): mounted filesystem with ordered data mode
> ...

 

I am hoping that the recompilation will get rid of them when it finishes, because everything in RAM right now was built with the old RAM and assuming that the old RAM was the problem, anything built with it could have issues.

----------

## RaceTM

Is it only crashing while emerging? Are you watching your temps? Maybe your cpu is overheating.

----------

## Shining Arcanine

 *RaceTM wrote:*   

> Is it only crashing while emerging? Are you watching your temps? Maybe your cpu is overheating.

 

If my CPU was overheating, that would have been detected by Prime95, but Prime95 failed to crash. As for my system temperatures, I am not sure how to monitor them on Linux.

In other news, since the new memory I purchased exhibited system crashes too, I decided to take the old memory and put it into the system too. So far, there have been no crashes. I am right now compiling open office inside a tempfs directory. If I can go a month without a single crash, I am going to conclude that my motherboard has stability issues unless it operates in 2T mode, which is mandatory to support 4 memory modules.

Edit: I am also recompiling all KDE packages on my system with sudo emerge -1v --jobs $(eix -I --only-names kde-*/*). So far, so good.  :Smile: 

----------

## kimmie

You can monitor your temps, fan speeds, power supply voltages (depending on your motherboard) with the lm_sensors package. That might point to other possible causes too (5V rail is down to 4.7 volts your power supply isn't keeping up). But if you've gone for 2T and the problem goes away, then you've probably found it.   :Smile: 

----------

