# Syncing very large number of files to another server

## humbletech99

I've got a folder with a very large number of files and subdirectories (it's full of maildir folders for my company actually) which has files that rank in the millions. I need to replicate this to another server as a backup.

The number of files is not the only consideration, the volume is quite significant, even over gigabit.

I've tried using rsync to try to avoid re-copying a large volume of data by only taking the differences across to the backup server, but the extremely large number of files tends to hurt rsync and take forever while it's building up file lists.

Does anybody have a better idea or another tool for replicating a directory structure with both large volume and a large number of files?

----------

## SnakeByte

hi,

as you have  *Quote:*   

> a folder with a very large number of files and subdirectories

 

you can still use rsync.

just do a 

```
for subfolder in */*; do rsync $subfolder <target>; done
```

this will reduce the number of files to sync in each step.

regards

----------

## jsosic

I'm syncing 6 GB's of data (small files, web server) with rsync, and it works quite fast. Rsync is done after approximately 15 seconds.

If the other server is empty, then make the first copying with scp, and later use rsync for syncing the differences.

----------

## humbletech99

SnakeByte: thanks for the idea but I already did this to split the file list in to multiple smaller file lists but it was still too much of a burden and takes hours to run when I know a lot of this is simply the file generation list.

To put this in perspective for you and jsosic, I'll update you with a count from my server... as soon as it stops aggregating it!!!

----------

## humbletech99

OK here are my current counts to give you an idea of what I am trying to manage:

```
find .|wc -l

12537631

du -cshx .

179G    .

179G    total
```

So there you go, 12 millions files/directories and 179G. It took me to leave that du overnight just for it to finish...!

Now you can see why my filelists take so long to build (even when I loop over each subdir independently) and why I don't want to tar the whole thing etc as this wastes time on a large volume of data that is already on my other server and recopying everything every time is really out of the question...

All ideas welcome at this point.

----------

## think4urs11

-W might help a bit but with that numbers i'd use a file storage server as backend and have that one doing regular snapshots for backup instead of holding the same data on individual servers.

----------

## lucapost

You can use a ftp client that support mirror funcion like lftp (mirror -R)....

----------

## SnakeByte

 *lucapost wrote:*   

> You can use a ftp client that support mirror funcion like lftp (mirror -R)....

 

This would result in a full copy, wouldn't it?

@humbletech99

Can you give some more information about the general directory layout?

Is it symmetric, is there a given pattern for directory / filenames?

regards

----------

## aronparsons

Have you tried using rsync over NFS as opposed to tunneling via SSH (I'm assuming your doing something like "rsync /data/. backupserver:/backups/" since you didn't specify otherwise)?  If you do this, export it read-only and asynchronous (ro,async).  

What is your hard drive configuration (drive speed, interface, RAID, LVM, filesystem, etc)?

Something else that might help is to disable access time updates ('noatime' and 'nodiratime' mount options); this may not apply to your filesystems, but will for ext3 and ReiserFS.

----------

## humbletech99

 *Think4UrS11 wrote:*   

> -W might help a bit but with that numbers i'd use a file storage server as backend and have that one doing regular snapshots for backup instead of holding the same data on individual servers.

 

I'm not sure how this -W (whole file) option would help. The man page says it does not use the rsync algorithm. It that case, does it mean copying all the files regardless? At 179GB I can't even try to do that.

aronparsons: Both servers have 7200rpm sata raid arrays spanning several TB. There is no lvm in use on them.

SnakeByte: the structure is like user/maildirs/... where the directory is split into subdirs of users, each subdir containing files and further subdirs as per the maildir structure.

----------

## think4urs11

 *humbletech99 wrote:*   

>  *Think4UrS11 wrote:*   -W might help a bit but with that numbers i'd use a file storage server as backend and have that one doing regular snapshots for backup instead of holding the same data on individual servers. 
> 
> I'm not sure how this -W (whole file) option would help. The man page says it does not use the rsync algorithm. It that case, does it mean copying all the files regardless? At 179GB I can't even try to do that.

 

e.g. if you've high number of small files or not too much big files which change their content -W simply copies the whole file instead of checking the content - lowers processor usage but needs probably a bit more bandwidth.

Depending on your exact type of datas this as said might be an option.

Actually as you have a mailserver normally not too much of the existing files will change as 1file=1mail (roughly spoken).

There will be always new files, files get deleted but not too much changed files. So there's no real need to have the server really check each and every file in-depth but it might be quicker to simply have changed files transfered completely.

----------

## SnakeByte

 *humbletech99 wrote:*   

> OK here are my current counts to give you an idea of what I am trying to manage:
> 
> ```
> find .|wc -l
> 
> ...

 

So 12 million files split by how many users?

You could give rsync on a per user directory a try.

Or tar and bzip each user directory, copy untar

to save both CPU ( for the change check ) and bandwith.

regards

----------

## cyrillic

 *humbletech99 wrote:*   

> I've tried using rsync to try to avoid re-copying a large volume of data by only taking the differences across to the backup server, but the extremely large number of files tends to hurt rsync and take forever while it's building up file lists. 

 

I think rsync is a good choice in this situation.

Your bottleneck is most likely filesystem performance (or lack thereof).

----------

## humbletech99

 *SnakeByte wrote:*   

> You could give rsync on a per user directory a try.
> 
> 

  Tried that already in a bash script iterating over each user directory individually, it's still too slow and the file lists are too big. *Quote:*   

> Or tar and bzip each user directory, copy untar
> 
> to save both CPU ( for the change check ) and bandwith.

  I've done a very similar thing elsewhere and have to say that this is really not quick at all, even slower than the rsync method I think due to the fact you have to process a huge amount of data needlessly each time. I even did timing tests and found bzip to be a poor choice for this due to the extreme cpu usage, gzip was better as my gigabit network no longer became the bottleneck as much as cpu on a dual opteron server!

cyrillic: that's a good guess. I have observed cpu as the bottleneck on streaming zipping -> coping unzipping type operations, ram to be the bottleneck with this rsync operation as it chews up the entire gig of ram on the server just to build the file list before it ever starts transferring files, and I expect that tarring will definitely leave disk as the bottleneck although I haven't tested the last one.

So I'm back to rsync, I should probably get some more ram, but I'd love to find a superior solution to any of these so far discussed.... I'm tempted to use drbd for this, although not trivial, not rocket science but perhaps a little awkward to make the existing servers and data support this... I have no free block devices and as far as I know it's not supported to fake it with loopback devices either.

Open to suggestions

----------

## sschlueter

Are you already using rsync 3.0? If not, upgrading might help a bit:

 *Quote:*   

> Beginning with rsync 3.0.0, the recursive algorithm used is now an incremental scan that uses much less memory than before and begins the transfer after the scanning of the first few directories have been completed. This incremental scan only affects our recursion algorithm, and does not change a non-recursive transfer. It is also only possible when both ends of the transfer are at least version 3.0.0. 

 

----------

## sschlueter

I think the main problem here is that rsync lacks the feature to maintain and utilize an index. It must scan both local and remote directories each time it is run in order to determine the differences.

That being said, I would suggest git as a solution that uses an index.

This would include the following steps:

1) creating a git repository on the original machine

2) regularly creating new revisions by comitting new/changed/deleted files

3) cloning the repository on the backup machine (by using git clone)

4) regularly synching the cloned repository (by using git pull)

Step 2) is greatly simplified by an additional tool called gibak: "gibak commit" is all you have to do here. This step still requires a filesystem scan but the author claims that it's faster that rsync's scan method.

The major advantage is that step 4) is way more efficient than using rsync because everything is indexed now.

Step 1) may not be space efficient (I haven't checked that) but keep in mind that step 2) gets you a versioned backup in addition to mere replication for free. Step 2) in itself is space efficient, by the way.

Edit: Even the scan in step 2) could be avoided if someone created a demon that utilized the kernel's inotify feature to create the list of new/changed/deleted files. Any volunteers?   :Wink: 

----------

## humbletech99

sschlueter: thanks for the recommendations. It looks like rsync's algorithm improvements will be well worth it, I'm still on 2.x on both ends but will try rsync 3 when I get a chance.

I've not used git, but I have used subversion and I have to say I still have my reservations about such a method... I'll look in to git when I get a chance though.

Thanks again.

----------

## i92guboj

Some random bits.

-If performance is an issue. nfs will beat ssh because of the lack of entryption.

-Git might be an option to consider.

-If you use compresion, use gzip instead of bzip2. It will for sure be a tad faster, and you will save a lot of bandwidth.

Sorry if some or all of these have been already mentioned, I don't have the time right now to read the whole thread.

----------

