# [SOLVED] Random ICE with gcc, memtest86 says my RAM is OK?

## mr-simon

Here's a fun one...

I'm getting random internal compiler errors with gcc, usually when compiling files that require a lot of ram (Webkit, I'm looking at you)

Accepted wisdom is "your ram is faulty, check it with memtest86" - I did this, left it running overnight and it completed three passes with no errors. I've also tried Intel's CPU diagnostic tool (on Windows) and it found no issues, and I've run prime95 in all of its modes for extended periods, and it seems happy enough.

I've also cleaned out all of my fans and made sure all my temps are nice and cool.

Recompiling gcc seems to alleviate the problem for a while, but it seems to recur after a short while. For example, a couple of days ago I rebuilt gcc and then ran `emerge -e world` and everything rebuilt in one pass with no errors, over many hours. I figured I might be in the clear after that, but today it started happening again. (Trying to install libreoffice this time.)

I don't have any fancy use flags. I haven't noticed anything else that I can pin down, but the ICE issues are problematic. Once they've started happening, I can't get past a build of something with Webkit in it - which seems to be a regular occurrence these days.  :Wink: 

I'm starting to wonder if my ssd is succumbing to bitrot as I've checked the CPU and RAM. I've looked at it with smartctl and all the numbers are fine, but perhaps it's not telling me the whole story. What is the best way to actually /test/ the drive?

Any other ideas on how I might track this one down? I've been dealing with it for a while now and it's starting to wind me up.  :Smile: 

----------

## kernelOfTruth

Yes,

you have a swap partition ?

----------

## mr-simon

 *kernelOfTruth wrote:*   

> Yes,
> 
> you have a swap partition ?

 

I do, yep...

----------

## Roman_Gruber

and the PSU is a good one? and biiig enouugh?

smart is rahter useless. as it only tells what the firmware thinks, lol

could be file corruption also, because of unstable kernel / file -system which is not that well tested...

----------

## toralf

and you tried with MAKEOPTS=-j1 ? At my older laptop I created an appropriate entry in /etc/portage/env/ especially for webkit and friends to be compiled just with -j1

----------

## mr-simon

 *tw04l124 wrote:*   

> and the PSU is a good one? and biiig enouugh?

 

Yeah, it's a corsair modular gold 850W. I could try pulling my second GPU and verifying though...

 *tw04l124 wrote:*   

> smart is rahter useless. as it only tells what the firmware thinks, lol

 

That's what I thought. That's why I was looking for a way to physically test the drive. Is `badblocks` still a thing? Is it relevant for ssds?

 *tw04l124 wrote:*   

> could be file corruption also, because of unstable kernel / file -system which is not that well tested...

 

I'm running 4.0.5-gentoo with an ext4 filesystem so there's nothing very experimental there.

----------

## kernelOfTruth

Could you post the exact error message(s) once it occurs ?

Besides that: any other error messages in dmesg ?

----------

## mr-simon

 *kernelOfTruth wrote:*   

> Could you post the exact error message(s) once it occurs ?

 

Here's one from compiling Fractorium (not in portage):

```
g++ -c -include ../../../release/.obj/EmberGenome -pipe -march=native -fPIC -fpermissive -pedantic -std=c++11 -Wnon-virtual-dtor -Wshadow -Winit-self -Wredundant-decls -Wcast-align -Winline -Wunreachable-code -Wmissing-include-dirs -Wswitch-enum -Wswitch-default -Wmain -Wzero-as-null-pointer-constant -Wfatal-errors -Wall -fpermissive -Wold-style-cast -Wno-unused-parameter -Wno-unused-function -Wold-style-cast -D_M_X64 -D_CONSOLE -D_USRDLL -O2 -O2 -DNDEBUG -fomit-frame-pointer -w  -I/usr/share/qt4/mkspecs/linux-g++ -I. -I/usr/include/CL -I/usr/include/GL -I/usr/include/glm -I/usr/include/tbb -I/usr/include/libxml2 -I../../../Source/Ember -I../../../Source/EmberCL -I../../../Source/EmberCommon -o ../../../release/.obj/EmberGenome.o ../../../Source/EmberGenome/EmberGenome.cpp

../../../Source/EmberGenome/EmberGenome.cpp: In destructor ‘EmberNs::CosVariation<float>::~CosVariation()’:

../../../Source/EmberGenome/EmberGenome.cpp:806:1: internal compiler error: Segmentation fault

 }

 ^

Please submit a full bug report,

with preprocessed source if appropriate.

See <https://bugs.gentoo.org/> for instructions.

Makefile:203: recipe for target '../../../release/.obj/EmberGenome.o' failed

make: *** [../../../release/.obj/EmberGenome.o] Error 1

Build failed! Check output for errors.
```

Here's some output from building libmwaw, which is a libreoffice dependency:

```
libtool: compile:  x86_64-pc-linux-gnu-g++ -DHAVE_CONFIG_H -I. -I../.. -I../../inc -I/usr/include/librevenge-0.0 -DNDEBUG -march=native -O2 -pipe -fvisibility=hidden -DLIBMWAW_VISIBILITY -Wall -Wextra -pedantic -Wshadow -Wunused-variable -Weffc++ -c ClarisDrawStyleManager.cxx  -fPIC -DPIC -o .libs/ClarisDrawStyleManager.o

In file included from /usr/include/boost/smart_ptr/shared_ptr.hpp:30:0,

                 from /usr/include/boost/shared_ptr.hpp:17,

                 from libmwaw_internal.hxx:103,

                 from MWAWFontConverter.hxx:45,

                 from ClarisDrawParser.cxx:42:

/usr/include/boost/smart_ptr/detail/sp_convertible.hpp: In instantiation of ‘struct boost::detail::sp_convertible<MWAWInputStream, MWAWInputStream>’:

/usr/include/boost/smart_ptr/detail/sp_convertible.hpp:81:37:   required from ‘struct boost::detail::sp_enable_if_convertible<MWAWInputStream, MWAWInputStream>’

/usr/include/boost/smart_ptr/shared_ptr.hpp:420:5:   required by substitution of ‘template<class Y> boost::shared_ptr<T>::shared_ptr(const boost::shared_ptr<Y>&, typename boost::detail::sp_enable_if_convertible<Y, T>::type) [with Y = MWAWInputStream]’

MWAWInputStream.hxx:212:12:   required from here

/usr/include/boost/smart_ptr/detail/sp_convertible.hpp:48:10: internal compiler error: Segmentation fault

     enum _vt { value = sizeof( (f)( static_cast<Y*>(0) ) ) == sizeof(yes) };

          ^

Please submit a full bug report,

with preprocessed source if appropriate.

See <https://bugs.gentoo.org/> for instructions.

Makefile:951: recipe for target 'ClarisDrawParser.lo' failed

make[3]: *** [ClarisDrawParser.lo] Error 1

make[3]: *** Waiting for unfinished jobs....

make[3]: Leaving directory '/var/tmp/portage/app-text/libmwaw-0.3.5/work/libmwaw-0.3.5/src/lib'

Makefile:383: recipe for target 'all-recursive' failed

make[2]: *** [all-recursive] Error 1

make[2]: Leaving directory '/var/tmp/portage/app-text/libmwaw-0.3.5/work/libmwaw-0.3.5/src'

Makefile:501: recipe for target 'all-recursive' failed

make[1]: *** [all-recursive] Error 1

make[1]: Leaving directory '/var/tmp/portage/app-text/libmwaw-0.3.5/work/libmwaw-0.3.5'

Makefile:408: recipe for target 'all' failed

make: *** [all] Error 2

 * ERROR: app-text/libmwaw-0.3.5::gentoo failed (compile phase):

 *   emake failed

 * 

 * If you need support, post the output of `emerge --info '=app-text/libmwaw-0.3.5::gentoo'`,

 * the complete build log and the output of `emerge -pqv '=app-text/libmwaw-0.3.5::gentoo'`.

 * The complete build log is located at '/var/tmp/portage/app-text/libmwaw-0.3.5/temp/build.log'.

 * The ebuild environment file is located at '/var/tmp/portage/app-text/libmwaw-0.3.5/temp/environment'.

 * Working directory: '/var/tmp/portage/app-text/libmwaw-0.3.5/work/libmwaw-0.3.5'

 * S: '/var/tmp/portage/app-text/libmwaw-0.3.5/work/libmwaw-0.3.5'

```

Rebuilding libmwaw a second time fails with:

```
/bin/sh ../../libtool  --tag=CXX   --mode=compile x86_64-pc-linux-gnu-g++ -DHAVE_CONFIG_H -I. -I../..    -I../../inc -I/usr/include/librevenge-0.0  -DNDEBUG -march=native -O2 -pipe -fvisibility=hidden -DLIBMWAW_VISIBILITY -Wall -Wextra -pedantic -Wshadow -Wunused-variable -Weffc++ -c -o FullWrtText.lo FullWrtText.cxx

FullWrtGraph.cxx: In static member function ‘static void __gnu_cxx::__alloc_traits<_Alloc>::deallocate(_Alloc&, __gnu_cxx::__alloc_traits<_Alloc>::pointer, __gnu_cxx::__alloc_traits<_Alloc>::size_type) [with _Alloc = std::allocator<std::_Rb_tree_node<std::pair<const int, boost::shared_ptr<FullWrtStruct::Entry> > > >; __gnu_cxx::__alloc_traits<_Alloc>::pointer = std::_Rb_tree_node<std::pair<const int, boost::shared_ptr<FullWrtStruct::Entry> > >*; __gnu_cxx::__alloc_traits<_Alloc>::size_type = long unsigned int]’:

FullWrtGraph.cxx:793:1: internal compiler error: Segmentation fault

 }

```

The crash isn't in the same place... It's in a different file.

 *kernelOfTruth wrote:*   

> Besides that: any other error messages in dmesg ?

 

Last thing dmesg had to say was a while ago:

```
[   32.332732] <6>[fglrx] Firegl kernel thread PID: 2267

[   32.332915] <6>[fglrx] Firegl kernel thread PID: 2268

[   32.333036] <6>[fglrx] Firegl kernel thread PID: 2269

[   32.333148] <6>[fglrx] IRQ 69 Enabled

```

Last thing from journalctl:

```
Sep 06 16:12:48 frey.simons-house.co.uk sudo[14978]: simon : TTY=pts/1 ; PWD=/home/simon ; USER=root ; COMMAND=/usr/bin/emerge --resume

Sep 06 16:12:48 frey.simons-house.co.uk sudo[14978]: pam_unix(sudo:session): session opened for user root by (uid=0)

Sep 06 16:13:09 frey.simons-house.co.uk sudo[14978]: pam_unix(sudo:session): session closed for user root
```

----------

## mr-simon

 *mr-simon wrote:*   

>  *tw04l124 wrote:*   and the PSU is a good one? and biiig enouugh? 
> 
> Yeah, it's a corsair modular gold 850W. I could try pulling my second GPU and verifying though...

 

I pulled my second GPU out, and I still get exactly the same symptoms. If my PSU wasn't big enough before, it should be now.

----------

## kernelOfTruth

two things come to mind:

a botched compiler (or filesystem corruption)

or

some CFLAGS weirdness:

you tried going with utterly conservative flags ?

e.g. -O2 -pipe

or even

-Os -pipe

?

----------

## mr-simon

 *kernelOfTruth wrote:*   

> two things come to mind:
> 
> a botched compiler (or filesystem corruption)

 

I've rebuilt gcc and tried with both 4.8 and 4.9. As noted above, 'emerge -e world' worked straight after re-merging gcc (no version change) which is why I was suspecting bitrot on the ssd. I've checked with e2fsck and it seems OK otherwise.

smartctl says my drive is OK, but as noted above that's only as far as the firmware seems to know. I'd be interested in the best way to actually test the drive... Back in the day I did this with `badblocks`, but I'm guessing that there's a better way these days? (I don't think it's even in portage)

 *kernelOfTruth wrote:*   

> some CFLAGS weirdness:
> 
> you tried going with utterly conservative flags ?

 

emerge --info says

```
CFLAGS="-march=native -O2 -pipe"
```

I tried setting to -Os -pipe in my make.conf... I can still repro the problem.

----------

## Buffoon

When memtest86 tells your RAM is bad then bad it is.

When memtest86 passes your RAM then it means nothing. You may still have bad RAM.

----------

## mr-simon

 *Buffoon wrote:*   

> When memtest86 tells your RAM is bad then bad it is.
> 
> When memtest86 passes your RAM then it means nothing. You may still have bad RAM.

 

Guess you might be right. I'll try pulling them pair at a time and re-running emerge to see if I can make the problem go away.

I would still like to rule out SSD issues though. Can anyone please suggest a decent method of (ideally non-destructively) checking my SSD for errors?

----------

## NeddySeagoon

mr-simon,

Bad caps on the Vcore PSU right next to the CPU.

They have a very hard life.

Another fun one ... The 12v power connector to the Vcore PSU. Its 4,6 or 8 wires. in yellow black pairs.

Check its not been getting hot.  If its like mine, it well charred.

Both of these things are only problems under CPU load and even then, they are intermittent.

----------

## mr-simon

 *NeddySeagoon wrote:*   

> Bad caps on the Vcore PSU right next to the CPU.
> 
> They have a very hard life.
> 
> Another fun one ... The 12v power connector to the Vcore PSU. Its 4,6 or 8 wires. in yellow black pairs.
> ...

 

Thanks, Mr. Seagoon. This sounded like the most plausible explanation, but upon visual inspection everything looked fine.

I also tested the ssd with badblocks. No issues there.

However, Buffoon is correct. I had figured that memtest86 was exhaustive enough to verify everything with a good deal of confidence. I wrote a script which gradually compiled more and more things until the computer ran out of RAM (ran out of RAM == ram good, segfault == RAM bad) and then started pulling out and swapping DIMMs until I tracked it down to a faulty pair.

Lesson learned on that one. Thanks for your help, all.

----------

## NeddySeagoon

mr-simon,

I bet if you put them back they will be OK.  Thats called wiping the contacts.

Its probably only one stick if there really is a fault too.

----------

## mr-simon

 *NeddySeagoon wrote:*   

> mr-simon,
> 
> I bet if you put them back they will be OK.  Thats called wiping the contacts.
> 
> Its probably only one stick if there really is a fault too.

 

I already put them back to verify my findings and rule out cosmic rays, Venus in conjunction with Saturn etc. - They showed the fault when I put them back, so it's fairly safe to assume that one or both of them are at fault.

You're right, I should probably narrow it down to one and keep the other as a spare... I think I can only buy replacements in multiples of 2 though.

----------

