# kernel huge memory allocation & poor swapping

## jpsollie

I have a linux 32core opteron machine with 128GB ram and 128GB swap on an SSD, attached to a 51245 adaptec controller(raid0).

I'd like your guys their opinions about the following situation:

```

Tasks: 511 total,   1 running, 509 sleeping,   0 stopped,   1 zombie

%Cpu(s):  1.0 us,  0.4 sy,  0.0 ni, 98.4 id,  0.1 wa,  0.0 hi,  0.1 si,  0.0 st

KiB Mem : 13201221+total,   562912 free, 44266796 used, 87182512 buff/cache

KiB Swap: 13421772+total, 13399577+free,   221952 used. 86669632 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND

 2048 root      20   0 41.720g 0.041t   2368 D  40.1 33.0 106:16.74 e2fsck

16118 root      20   0   12688   1692   1564 S   2.0  0.0   5:27.71 reptyr

 6809 root      20   0   22612   3272   2292 R   1.0  0.0   0:29.78 top

```

the e2fsck is checking a corrupted SAS -> RAID5 -> dmcrypt ->[/code] ext4 partition of +- 25TB which is filled with 2TB of data.

the filesystem got corrupted after multiple disks getting offline due to a power failure on the SAS expander, and bringing it online again.

so, my point where I'd like you guys their opinion is:

ext2fs is analyzing the filesystem in an attempt to recover the disks, and takes a considerable amount of ram (40GB).

it reports the following errors (as an example):

 *Quote:*   

> 
> 
> Inode 85033944 block 372245002 conflicts with critical metadata, skipping block checks.
> 
> 

 

vmstat reports the following:

 *Quote:*   

> 
> 
> procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
> 
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> ...

 

why is there only 110MB swapped on a memory map of 40GB? and are there still 1024 si's in vmstat?

I know, it's not critical (the filesystem is still repairing) but am I doing something wrong here which slows down the system?

thx

----------

## NeddySeagoon

jpsollie,

swap is only used for swapping dynamically allocated RAM.

Code and data that has a permanent home on storage devices is never sent to swap. 

Its dropped and reloaded as required, after a sync if the data is in a dirty buffer. 

```
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND

 2048 root      20   0 41.720g 0.041t   2368 D  40.1 33.0 106:16.74 e2fsck 
```

That's 41G virtual RAM and 41G or Resident RAM, with 128G RAM, why not?

I would hope that most of the 41G is dynamically allocated but why write it to swap unless the RAM is needed?

That would really slow things down.

I hope your backups are good.

----------

## jpsollie

 *NeddySeagoon wrote:*   

> 
> 
> I hope your backups are good.

 

Well, so much for a bad day:

yesterday I thought the drive recovered (at least the raid controller told me it would), so I told myself 'ok, let's make a tape backup of the drive first', and I took the tape which is dedicated for this partition ...

after one hour tar exited saying there are too many IO errors on the system. guess what the status of the tape is  :Sad: 

----------

## Maitreya

Murphy and:

 *Quote:*   

> 
> 
> 1 backup is no backup, 2 backups is a half backup, 3 backups is 1 backup.
> 
> 

 

Confirmed itself again.

The times that raid cards have actively ruined my data is too high.

"Oh I think I found a disk part of this raid set, let me repair that for you" "NOOOOOOOO!"

----------

## jpsollie

so it seems.

So, for people interested in the current status of the recovery:

- As expected, the e2fs utility ran out of memory at 255GB of ram ... I am happy the swap was on SSD, otherwise I probably wouldn't have lived long enough to see.

- I mounted the filesystem read-only, and started to backup everything using tar --ignore-read-errors, which causes an undetermined part (not everything, but more than nothing) to be recovered.

- I found a full LTO-5 tape (currently I'm using LTO-6) of the same partition, so I can recover a part of the contents of one year ago.

So, the plan is:

-reformat the  /dev/mapper/crypt-data partition, need to dig into my documentation to find stride size and so on, as I forgot a bit about it.

-untar the contents of the LTO-5 tape into the partition

-untar (and overwrite) the contents of the LTO-6 tape into the partition.

A question I'd like you people to think about is (please move this topic if it would be more suitable in another part of the forum):

What will happen with file1 if a correct (but outdated) version exists on the LTO-5 tape and tar put the current version on the LTO-6 but this file is damaged? will ext4 see the file is damaged? is it enough to say 'ok, the kernel did not put an IO error while reading the file to tape, it is OK'?

----------

## NeddySeagoon

jpsollie,

Your new tape will contain files with gaps in.  What 'data' appears in the gaps depends on the details of the error handling when the tape was made.

You will have broken but otherwise correct files.

That's the easy bit. Errors in files.

What happens when there is an error in a directory and a block can't be read?

You may loose it and all the subdirectories that it contains. The data might still be there on the drive but the pointers to it are damaged or missing.

It gets worse ...

Errors in filesystem metatdata

fsck is a very bad thing.  In the face of missing or conflicting data, it makes assumptions, guesses, if you like.

It often makes things worse not better.  It makes the filesystem metadata self consistent but says nothing about user data.

Rule one of data recovery. First do no harm.

That means imaging the drives from the arrary before raid recovery.  That lets you get back to the damaged state.

Its important that you never attempt writes to a damaged filesystem.

Put back your good tape.  Set up some more space for the LTO-6 tape.

Validate the files from the LTO-6 tape and copy them over if they appear to be good.

----------

## jpsollie

Hi NeddySeagoon, 

Thanks for your information.

What do you exactly mean with "validate files"? if I read it correctly, I have to go through 2tb of data to make sure they are all correct? isn't there a faster way?

if not, I learned the hard way, but thank you for your help  :Smile: 

*edit: about the broken files: so tar just fills a read error of eg 4096 bytes with zeroes and still writes the file to tape if it didn't read correctly? am I correct?

----------

## NeddySeagoon

jpsollie,

I don't know of a faster automatic way unless you have hash values for all the files pre damage that you can validate now.

If you know what the error handling puts in place of valid data .. maybe a block filled with some fixed byte value, you could search for blocks like that.

You might reject a few good files that way as it may not be an error to have such a block.

The error handler may leave whatever junk was in the buffer there too.

You would need to read the code of the tool you used for the backup ... or maybe its documentation.

----------

## jpsollie

 *NeddySeagoon wrote:*   

> jpsollie,
> 
> I don't know of a faster automatic way unless you have hash values for all the files pre damage that you can validate now.
> 
> If you know what the error handling puts in place of valid data .. maybe a block filled with some fixed byte value, you could search for blocks like that.
> ...

 

Well, I might as well analyze the filesystem with a hex editor ... I would like things to be a bit faster :p

so, let's summarize this:

- I reorganise the raid partition to be a raid 6 first

- I start the software dmcrypt on it again, and create an ext4 filesystem with 3 dirs: oldbackup, damagedbackup, and recovered.

- I untar the LTO-5 to oldbackup, and untar the LTO-6 to damagedbackup

now a small script comes to mind:

```

if(md5(oldbackup/file*)) != md5(newbackup/file*)) then move newbackup/file* -> recovered/file*.

```

so now newbackup contains valid data, recovered contains the files to be inspected (because they are newer or damaged) and oldbackup can be removed again.

not sure where I will get with this, but at least the old files do not have to be inspected  :Smile: 

----------

## NeddySeagoon

jpsollie,

The (md5(oldbackup/file*)) != md5(newbackup/file*)) only says the files are different. It says nothing about the relative ages.

You also need to handle files that are not present on both backups elegantly.

The unchanged files do not need to be inspected, as you say.

newbackup never contains valid data. Its data to be validated.

Its either validated by showing its not changed form oldbackup or manually.

With a raid6 you should have tried all combinations of two drives missing.

If it works its a self consistent set, raid wise.  It cannot rebuild and self destruct in the process, as there is no redundant data to rebuild from.

With software raid you can inspect the event count and last write time of each element. 

I don't know anything about your hardware raid.

How do you cope with files that have the same timestamp on both backups but are damaged in the new backup?

This assumes that you keep timestamps for last changed.

If the times and sizes are the same the file content is probably supposed to be the same.

You might be able to sort by some degree of confidence, so you look at high degree of confidence files first.

----------

