# Storage/Backup to avoid bit-rot

## Nicias

I'm not sure if this is the best forum for this, so feel free to move this topic.

I'm looking to revamp my backup system. 

Currently I backup the following items to a pair of disks on a monthly basis (1 disk each month)

A) Work and personal files ~6Gb stored in git repos. All repos are on at least two machines. 

B) System and config files (~500GB) for three machines. 

One of my two backup disks is 2 TB on that disk I also put

C) Rips of DVDS (~1.5TB) mostly just inconvenient to replace. (Every two months) 

All of these backups are being done with rsync. 

So a couple of things have made me reconsider this plan.

1) I'm pretty close to full on the 2TB drive.

2) I'm getting concerned that I'm not protecting against bit-rot.

The solution to 1) is clear. Purchase new drive, or delete files.

2) Is not as clear. One of the machines does a "git gc" on each repo on a montly basis. Does that provide any protection against bit rot?

I'm clear that I'm doing nothing to protect against bit rot on C), and I'm not too concerned about it on B).

So, I guess my questions are this:

I'm also filling up the drives on the system that has the DVD rips and git repos, so I'm getting a new drive for that as well. That gives me the opportunity to change file-systems or whatever. (All machines are gentoo except one Mac.  Data is on a variety of filesystems, most data is on ext4)

What can I do to protect against bit-rot? This seems like a case for maybe using btfrs but I'm not sure. 

Any advice?

----------

## szatox

Raid6 should be a good option as it's not only able to survive 2 failed drives, but it's also recover from a single stripe corruption when none of disks reports read error. Some implementations seem to be recalculating parity on every access, others only when scrubbing. Cheap sata drives let you read (or write) 1 TB in a bit over an hour so it's not a big deal to have cron run it weekly  :Smile: 

Just gotta make sure when and in what way damaged stripes will be restored

Btrfs with it's raid and checksums might be a good option. It's pretty much like "is it stable enough" and "does it use it's features for data protection"

Perhaps kernel-based raid would also do.

----------

## Moriah

I have run my backup server essentially the same for nearly 10 years or maybe even more (can't remember).  I have a RAID-1 mirror with 3 drives.  This gives me redundancy even when a drive fails, or is hot removed, or the raid is re-genning after replacing a drive.  I never want to have a situation where I don't have redundancy.

I currently run 3 drives in the RAID-1 mirror, with all 3 drives being 7200 RPM 3 TB SATA drives,  They are mounted in a hot-swapable SATA backplane (available from MicroCenter).  I first create the RAID array, then I run cryptsetup on it so the raid array looks like a fully encrypted hard drive.  Next I layer LVM on top of that, then an XFS filesystem on top of that.

It takes about 6 to 8 weeks to fill the 3 TB capacity, leaving several hundred GB of free space.  When the array fills up, I pull the oldest drive (determined by SMART run hours), and replace it with a new drive.  The pulled drive goes to a safe deposit box at the local bank. Because the drives are all fully encrypted, if a drive gets lost or stolen, all they get is the drive; the data is still safe.  I then delete about a weeks worth of the oldest backups from the mounted raid, as they are preserved on the drive I just pulled.  Each week I delete the oldest week to keep freeing up space while keeping as many weeks of backup online as possible.  I do not delete past where the pulled drive covers!

I kick off a backup run every night at 1:30 AM.  Each machine on my network gets polled, and rsync updates a directory on the raid array that holds the current backup set for that machine.  Then I make a copy using hard links of that machine's directory to another directory where the nightly backup is named as: machine/yy/mm/dd/hh:mm:ss/<copy of filesystem for that night>.  This allows me to easily find the contents of any file on any machine for any night.  Came in mighty handy when the tax man audited me a few years ago!  Because hard links are used for the copied files, only one copy of a file is needed if that file doesn't change for multiple nights.  That means this is a file level host dedup approach.  Rsync implements a network level client dedup approach, since it only sends changes over the network.  Both of these strategies mean I can hold a lot more than 3 TB of data if you were to use a simplistic full file copying approach, and the nightly backup runs are much quicker.

----------

## Jaglover

https://forums.freenas.org/index.php?threads/do-i-need-zfs-to-protect-against-bit-rot.5332/#post-18832

Methinks if you are paranoid (not condemning!) go ZFS or BTRFS.

----------

## Moriah

I looked at both when they first came out, but decided that zfs required too much ram and disk to be effective for me, and btrfs was not really ready for prime time.  Both of those conclusions might have changed now, as I have not looked into it for a few years.

----------

## BlueFusion

I have been incredibly happy with BTRFS and its few ires and prevention of bit rot. I'm running my systems as btrfs and my backup drives as another btrfs pool. Right now I have a bunch of varying size drives as a JBOD setup with btrfs. Soon I will be replacing with a few 4TB drives and going RAID1 (or maybe try out RAID6 with the 3.19 kernel additions).

I wrote my own BASH script to backup my important data and configurations to the local btrfs pool and an off site server of my own, both using rsync.

You can use the --checksum option with rsync to further guard against bit rot in data transmissions. I generally use --checksum with --dry-run and just check for variations. Only once has there been a file that differed when it should not have. The file on my remote server had been different from the original and local backup. This led to a quick identification and fix.  And by quick I mean it takes a few hours to run the checksum check on 3TB of data....

----------

## davidm

Older post but I will chime in anyway.  I've been extremely happy with btrfs using raid1.  It's been reasonably stable here and in my opinion is pretty safe provided you stay away from corner cases or raid5/6 (which will be better with kernel 3.19 when patches to provide scrub functionality for these two raid levels are said to make it in -- but note they will be largely untested unlike with btrfs raid1 or raid10).  I would stay away from single (jbod) or raid0 (stiping) though but even those should at least notify you via log entries and scrub results of bit rot.  Also in the interests of disclosure there have been some btrfs bugs.  The two most recent severe ones I can think of are one which affected systems using snapshots which caused data corruption (sometimes, though rarely, irreversible)  and another one in the earlier 3.17 kernels which caused severe deadlocks (very minor risk of corruption but caused severe hangs).  ZFS is probably a more stable and mature choice then.

Another thing to consider is ECC ram if you have the choice.  One bit flip can cause all sorts of problems if undetected and the more ram you have the greater the chances of it occurring.  Absent other control measures it is possible for this to translate down into the filesystem and cause corruption.  Whether this (bit flips in memory due to the lack of ECC ram) affects btrfs or zfs is a highly technical discussion best left to others.  But from what I read on the btrfs mailing list with btrfs you will at least be notified of it in all circumstances.  With ZFS apparently there is more uncertainty/debate on this.

----------

## BlueFusion

 *davidm wrote:*   

> I would stay away from single (jbod) or raid0 (stiping) though but even those should at least notify you via log entries and scrub results of bit rot.

 

I've been using JBOD (d=single,m=raid1) for a year now.  No problems and no errors reported.

 *davidm wrote:*   

> Another thing to consider is ECC ram if you have the choice.  One bit flip can cause all sorts of problems if undetected and the more ram you have the greater the chances of it occurring.  Absent other control measures it is possible for this to translate down into the filesystem and cause corruption.  Whether this (bit flips in memory due to the lack of ECC ram) affects btrfs or zfs is a highly technical discussion best left to others.  But from what I read on the btrfs mailing list with btrfs you will at least be notified of it in all circumstances.  With ZFS apparently there is more uncertainty/debate on this.

 

Both BTRFS's scrub ad ZFS's resilvering functions do not write data.  The ECC RAM thing, although preferable for any situations where data integrity is paramount, is not required for a stable system.  ANY filesystem is EQUALLY affected by ECC or non-ECC RAM.  If the data in the RAM being written to the filesystem had a bit flip, it doesn't matter what filesystem it is.  BTRFS and ZFS use checksums to verify the data when read from the filesystem.  Alot of people brought up that if the bitflip occurs in RAM when comparing the checksum, it will write back bad data - this is simply NOT TRUE.  It will throw a checksum mismatch error and upon re-reading the data, it will work without issue (assuming the astronimical chances of the bits flipping AGAIN with non-fauly RAM does not occur).

Since I am pushing the limits of my current capacity, I am going to get a few more 4TB drives and use the RAID5 feature set come 3.19 or 3.20.  I keep regular backups so I'm not worried too much.

----------

## davidm

 *BlueFusion wrote:*   

>  *davidm wrote:*   I would stay away from single (jbod) or raid0 (stiping) though but even those should at least notify you via log entries and scrub results of bit rot. 
> 
> I've been using JBOD (d=single,m=raid1) for a year now.  No problems and no errors reported.
> 
>  *davidm wrote:*   Another thing to consider is ECC ram if you have the choice.  One bit flip can cause all sorts of problems if undetected and the more ram you have the greater the chances of it occurring.  Absent other control measures it is possible for this to translate down into the filesystem and cause corruption.  Whether this (bit flips in memory due to the lack of ECC ram) affects btrfs or zfs is a highly technical discussion best left to others.  But from what I read on the btrfs mailing list with btrfs you will at least be notified of it in all circumstances.  With ZFS apparently there is more uncertainty/debate on this. 
> ...

 

I lost an earlier btrfs d=single filesystem after a crash which is why I switched to raid1.  :Smile:   The problem is scrub can only report the errors and not fix them as with single there is no second copy.

Here's a huge discussion on ZFS and ecc ram on another forum:

https://forums.freenas.org/index.php?threads/ecc-vs-non-ecc-ram-and-zfs.15449/

The author of the first post gets extremely technical about why he believes ECC is needed on ZFS and tries to cite evidence.  It turns into a rather lively debate and ends up over 15 pages.   :Shocked: 

----------

## Moriah

I am mainly concerned about using these larger filesystems for my backup storage device.  I currently use RAID-1 with 3 drives in the mirror.  That way, I always have redundancey even in times on drive failure or removal, and during recovery.  The idea of not worrying about errors because you have backup is great, unless those errors occur on your backup system.

I have looked into dedup systems, and most of the good ones are for huge storage capacities and enterprise class servers.  My needs are considerably more modest.  I am interested in dedup so I can get more stuff on the same size drives.  The obvious problem here is that if you have an error in a block, and that block's contents appear in multiple places in the filesystem, then dedup reduces that to a single block plus a little housekeeping info to point the dedup system to the right block.  This means that a single error in a dedup system can affect multiple files.  Having RAID mirroring helps, but if that error occurs in the ram, the error gets written to all the drives in the mirror, and you are hopelessly lost.

I hace toyed with the idea on extending LVM to provide a dedup target,  but neve got very far with it because I am too busy trying to make a living as an independdent consultant to devote a large amount of time to such a project.  Maybe in a few years, if the stock market keeps climbing...    :Confused: 

----------

