# Bad ram questions

## Saundersx

So I just found out my personal server has bad ram. I don't know how long this was the case but I'm guessing probably a month at least before I caught it. Damn thing passed memtest last night but it does fail when running memtester from commandline.

I enabled CONFIG_MEMTEST (which is a godsend), passed "memtest=17" and it promptly found the bad bits.

```
[    0.011273]   0x0000000000100000 - 0x0000000001000000 pattern 4c494e5558726c7a

[    0.017487]   0x0000000002607000 - 0x00000000ca332000 pattern 4c494e5558726c7a

[    1.411565]   0x00000000cca34000 - 0x00000000cca35000 pattern 4c494e5558726c7a

[    1.411572]   0x00000000ccc3b000 - 0x00000000cd083000 pattern 4c494e5558726c7a

[    1.413409]   0x00000000cd7f4000 - 0x00000000cd800000 pattern 4c494e5558726c7a

[    1.413433]   0x0000000100001000 - 0x000000082effa000 pattern 4c494e5558726c7a

[   14.316160]   0x0000000000100000 - 0x0000000001000000 pattern eeeeeeeeeeeeeeee

[   14.322699]   0x0000000002607000 - 0x00000000ca332000 pattern eeeeeeeeeeeeeeee

[   15.735012]   0x00000000cca34000 - 0x00000000cca35000 pattern eeeeeeeeeeeeeeee

[   15.735019]   0x00000000ccc3b000 - 0x00000000cd083000 pattern eeeeeeeeeeeeeeee

[   15.736843]   0x00000000cd7f4000 - 0x00000000cd800000 pattern eeeeeeeeeeeeeeee

[   15.736866]   0x0000000100001000 - 0x000000082effa000 pattern eeeeeeeeeeeeeeee

[   28.605853]   0x0000000000100000 - 0x0000000001000000 pattern dddddddddddddddd

[   28.612362]   0x0000000002607000 - 0x00000000ca332000 pattern dddddddddddddddd

[   30.017710]   0x00000000cca34000 - 0x00000000cca35000 pattern dddddddddddddddd

[   30.017717]   0x00000000ccc3b000 - 0x00000000cd083000 pattern dddddddddddddddd

[   30.019541]   0x00000000cd7f4000 - 0x00000000cd800000 pattern dddddddddddddddd

[   30.019563]   0x0000000100001000 - 0x000000082effa000 pattern dddddddddddddddd

[   42.934573]   dddddddddddddddd bad mem addr 0x000000078519fe98 - 0x000000078519fea0 reserved

[   42.934578]   0x000000078519fea0 - 0x000000082effa000 pattern dddddddddddddddd

[   44.115055]   0x0000000000100000 - 0x0000000001000000 pattern bbbbbbbbbbbbbbbb

[   44.121839]   0x0000000002607000 - 0x00000000ca332000 pattern bbbbbbbbbbbbbbbb

[   45.568860]   0x00000000cca34000 - 0x00000000cca35000 pattern bbbbbbbbbbbbbbbb

[   45.568866]   0x00000000ccc3b000 - 0x00000000cd083000 pattern bbbbbbbbbbbbbbbb

[   45.570706]   0x00000000cd7f4000 - 0x00000000cd800000 pattern bbbbbbbbbbbbbbbb

[   45.570730]   0x0000000100001000 - 0x000000078519fe98 pattern bbbbbbbbbbbbbbbb

[   57.283510]   0x000000078519fea0 - 0x000000082effa000 pattern bbbbbbbbbbbbbbbb
```

and now running memtester passes.

So the question is how do I assess the damage done? Can things like "equery k packages" even find potential issues given that it could very well be silent corruption when installing? Any suggestions short of a full system recompile appreciated.

----------

## eccerr0r

Sorry, yes it's SDC.  Best you can do at this point is see if any programs you installed crash, and yes also do equery check on all your packages to verify if they at least copied correctly.  But no, short of an emptytree world compile it's difficult to tell what could have been affected.

I have a similar issue on my PVR server.  I received two bad DIMMs out of four and didn't know for sure until I tested them.  Well I gave the seller benefit of the doubt as I could have damaged them (my fingers slipped on these crappy short DIMMs and I could have taken a chunk off of them) but eventually after finally repairing the tracks on them (it was failing almost all locations prior to repair) I still had a few bad addresses and the seller clearly shorted me but the damage had been done.   At first I noticed that memtest86+ would not always fail that location, so I tried to ignore it and let it be, but unfortunately I occasionally had random hiccups.  This was not acceptable and I immediately blamed the bad RAM.  So I translated the bad bytes from memtest86+ to Linux and reserved the 4K page that they were in, and all is happy, I just lost 4K of RAM - machine is rock solid once more.

----------

## Saundersx

Here is a mini howto for anyone that comes across this in the future.

so the bad ram detected is "0x000000078519fe98 - 0x000000078519fea0", only 3 bytes. I'm not going to pretend to know why this happens at the hardware level but I want a buffer around those bad bits.

the hex address is 0x78519fe98 which in decimal is 32297844376

I want this in 4k blocks, so 32297844376 mod 4096 = 3736

32297844376 - 3736 = 32297840640

the 4k block i want to reserve = 32297840640 > 32297844736 , in hex that is 0x78519f000 > 0x7851a0000 which covers 0x78519fe98

so add "memmap=4K\\\$0x0078519f000" to GRUB_CMDLINE_LINUX in /etc/default/grub

going forward knowing that you for sure have bad ram I would leave "memtest=17" in there as well and watch dmesg on reboot for any new bad bits.

and this is off topic but I rebuilt all the packages. this boils down to "emerge -eav1 @system ; emerge -eav1 @world" but that is a bit slow.

so i ended up doing -- emerge -eav1 --keep-going --quiet-build --fail-clean @system --exclude portage

and followed up with -- emerge -eav1 --keep-going --quiet-build --fail-clean @world --exclude "$(emerge -pe --color n --columns @system | grep '^\[ebuild   R    ] ' | cut -c18-55 | xargs) portage"Last edited by Saundersx on Thu Feb 03, 2022 7:33 am; edited 1 time in total

----------

## eccerr0r

It's even easier:

1. Take physical address

0x000000078519fe98

2. Chop off last three nibbles

0x000000078519f___ (divide 0d4096 and take integer floor portion, don't round)

3. Replace with 000

0x000000078519f000 (multiplied 0d4096 to get byte address of first byte in page)

You just found the page number without converting to decimal, done all in hex, no modulus or calculator needed!

Do this with all addresses and notice the three bytes lie in the same page.  Just use this (these) address(es) directly in your memmap as a 4K block.  Of course if they're on sequential pages it'd make sense to map an 8K block out instead of two 4K.

--

btw, how extensive is/what algorithms are used in memtest in the kernel?  I know that I have had memtest86+ only catch errors in certain tests and certain patterns.  The first few tests rarely catch errors for me except for those badly damaged DIMMs, it's the later tests that end up catching errors.

--

I got lucky on my DIMMs.  Though they were really badly damaged I was able to repair them up until that one last page.  Machine would not POST without the solder job I did on them :(

----------

## NeddySeagoon

Saundersx,

I'm not a fan of the kernels built in memtest as it cannot test all of RAM.

The kernel is not relocatable, so the RAM used by the kernel cannot be tested.

memtest, loaded by your favourite boot loader, moves around in RAM, so it all gets tested.

The problem may not be RAM. If the failure is the same pattern at the same address, it probably is RAM.

If things are overclocked, it can be the first sign of pushing too hard.  

Memory testers rely on the CPU, RAM PSU (on the motherboard) all operating correctly.

The type of error reported is important in ruling out other causes.

----------

## pjp

 *eccerr0r wrote:*   

> It's even easier:
> 
> 1. Take physical address
> 
> 0x000000078519fe98
> ...

  Would you mind elaborating / clarifying how you're doing hexadecimal division without conversion or a calculator? Doesn't taking the "integer floor" convert to integer?

----------

## Hu

The division / multiplication constant of decimal 4096 has the useful property that in hex it is 0x1000.  Thus, just as you can do quick division-by-10 in decimal by moving the decimal point, and quick floor by discarding all digits to the right of the decimal point, you can in hexadecimal do quick division/multiplication by decimal 4096 by moving the logical decimal point (which was not shown in the example).  Integers can be expressed in any reasonable base.  "Take integer floor" just meant to discard the fractional part of the result, rather than retaining it or using it to round the result.

----------

## NeddySeagoon

pjp,

4096 (decimal) is 0x1000 (hex), whirh is the 4kiB page size.

So to divide by 4096, in hex, its a left shift, much as dividing by 1000 in decimal.

So left shift and throw away the fractional part is the 4k page number.

The method works for this special case.

----------

## pietinger

 *pjp wrote:*   

>  *eccerr0r wrote:*   It's even easier:
> 
> 1. Take physical address
> 
> 0x000000078519fe98
> ...

 

It is the same as Saundersx did in decimal:

 *Quote:*   

> the hex address is 0x78519fe98 which in decimal is 32297844376
> 
> I want this in 4k blocks, so 32297844376 mod 4096 = 3736
> 
> 32297844376 - 3736 = 32297840640

 

(coments in parentheses are a little bit confusing)

In other words: If you want the start address of a address WITHIN a 4 K block, then simply remove the last 3 digits (in hex) = 4096 Byte and set to zero. End address of this block would be: 0x000000078519fFFF

----------

## pjp

Thanks, that makes more sense, although I'd need to work with it to really understand it.

----------

## Saundersx

 *NeddySeagoon wrote:*   

> Saundersx,
> 
> I'm not a fan of the kernels built in memtest as it cannot test all of RAM.
> 
> The kernel is not relocatable, so the RAM used by the kernel cannot be tested.
> ...

 

Like I said in the post "Damn thing passed memtest" but the kernel test hits the same three bytes every time (so far) which makes me relatively confident what it found. 

On another side note I used memtest86, the version that requires EFI, well this is an older computer (amd8350) which according to memtest86 has a known buggy implementation. I cannot scan in "parallel" as it makes it reboot instantly and running single core appears to not stress it enough to fail. I'm not dumping more money into this system. It has served me well for many years and if it dies tomorrow I will have gotten my moneys worth out of it.

----------

## NeddySeagoon

Saundersx,

Remove the RAM, then reinsert it in the same slots.

That's called wiping the contacts.

While the RAM is out, inspect the contacts on the RAM sticks and in the RAM sockets.

Both connectors should be gold or silver, it doesn't matter which. One side gold and the other side silver is a very bad thing. 

Having 'wiped the contacts' rerun the tests.

----------

## Saundersx

Already shifted sticks around between slots, testing individual sticks in different slots, blew the connectors out with compressed air etc. Definitely tried for that low hanging fruit.

----------

## NeddySeagoon

Saundersx,

Good. If the error follows the RAM stick (the fault address changes) when you move it. its the RAM stick that's the problem.

When the error stays at the same address, its something RAM slot related.

----------

