# Btrfs nas?

## The_Great_Sephiroth

Before mentioning Rockstor, don't. I like Rockstor, but due to some oddball design where it sees entire disks as block devices which cannot be partitioned, you need three disks for a RAID1 (OS, then the array). That is plain dumb and wasteful. As such, I want to build my OWN NAS using Gentoo and BTRFS.

I am thinking of doing a BTRFS RAID10 array using four 500GB disks I have laying around. This would give me 1TB of space with increased performance. My goal is to install Gentoo in as small a configuration as possible. It will be shell-only, no Apache or anything, Samba for being a domain member and hosting shares to my gaming rigs, smartmontools on a cron job to monitor disk health, and probably cron jobs for maintaining BTRFS.

I want to control disk spin-downs. I really don't want them to spin down. Spin-downs kill desktop disks much faster than laptop disks, and if I go on vacation, I'll just shut it down myself!

So what advice can you give me before I begin this project? The OS partitions will be on BTRFS also, in case a drive dies the OS can still function. Oh and that reminds me, how about hot-swap in the event of a failed disk?

----------

## vaxbrat

I would do the same thing myself if I only had one box.  Instead I use 6 in a ceph cluster, but I'm based on btrfs mirror sets for my object stores.  If you want to consider this, you would want a minimum of three boxes, two for monitors and object stores, and a single one as the quorum monitor, metadata server and the host for sharing samba and nfs.

Based on your comments, I assuming you intend to put your system boot and root on the same 4 drives as the mirror set.  Use btrfs as your system root along with grub2 (or other boot loader that is btfs savvy) and a kernel that has btrfs support built natively and not as a module.  I did a writeup on the wiki a few years back about btrfs system mirrors that may still be useful:

https://wiki.gentoo.org/wiki/Btrfs/Native_System_Root_Guide

Make sure you enable raid10 for both metadata and data.  Also use a label to make things more readable in the fstab:

```
mkfs -t btrfs -L NASMIRROR -m raid10 -d raid10 /dev/sda3 /dev/sdb3 /dev/sdc3 /dev/sdd3

```

(I assume your boots and roots will be /dev/sda1,/dev/sda2.... etc)

There are two chores you want to stay on top of at a minimum, defragmentation and scrubbing.  For defragging, I use the autodefrag option when mounting the sets.  For scrubbing I would have a crontab entry, but ceph has its own concept of scrubbing in the background that runs through each of its placement groups (PGs) every two weeks and effectively does the btrfs scrubs for me.

Use on the fly compression in btrfs.  At the worst it's a wash if most of your files are compressed images and video, etc.  If your files can shrink to half the size or less, any modern CPU with SSE support can do the compress/decompress with only a few percent of a single core.  If the amount of data transferred on each read/write is cut in half or more, that can be a huge increase in effective disk performance.  The added latency of the compression instructions will be a drop in the bucket compared to the amount of time it takes to bring data in and out from disk.

After running the array for a while you might consider doing balances, but that would probably be a manual task you do that takes a day or so each time you schedule it.  That can happen in the background, but it will slow down performance while it is running.  I've only done it once or twice in my cluster on a single object store at a time to see whether it caused any trouble.  I decided that it wasn't worth the extra work.

Use subvolumes so that you can do snapshots and have the option of doing individual shares for nfs and samba.  At work I helped out an area with a ReadyNAS box that was running a version of debian new enough to have btrfs support.  However their management interface still defaults to doing things with ext4 or xfs on top of mdadm only.  I rooted in and set up the data partitions as a btrfs array by hand and then showed their admin how to take snapshots after each daily backup.  In the past, he was doing a full image of its share each night and could only fit about a week or two of backups on the ReadyNAS.  When I was done, the on the fly lzo compression had shrunk the space taken by the share down to less than 10% of the drive space.  Then I had him do a btrfs snapshot of the share each night instead of the full copy.  The last time I checked, I think he had gone a full year of effectively doing a full image of the filesystem each night in the same amount of space that would have been filled after two weeks.  The admin's face looked something like this once the realization began kicking in   :Shocked: 

Your fstab will probably look something like this with /nas being shared read/write and /backups being shared readonly over nfs and samba

```

LABEL=root           /                btrfs            defaults,noatime,autodefrag,compress=lzo     0   0

LABEL=NASMIRROR      /raid            btrfs            defaults,noatime,autodefrag,compress=lzo     0   0

LABEL=NASMIRROR      /nas             btrfs            defaults,noatime,autodefrag,compress=lzo,subvol=nas   0   0

LABEL=NASMIRROR      /backups         btrfs            defaults,noatime,autodefrag,compress=lzo,subvol=snapshots   0   0

```

----------

## The_Great_Sephiroth

I would use UUID's, but I am not familiar with snapshots yet. I did some basic reading on them, but nothing major. I do intend on learning them and using them though. I WILL be using an eSATA disk as backup as well.

You gave me a BUNCH of info and I believe that I am ready to move on this. Step one however, is to create my new AD DC using Gentoo, then move on to the snapshots and backup. I am at work and have to keep this quick, but I wanted to thank you for your informative reply.

----------

## alexcortes

Unless the objective IS to use BTRFS I would go to ZFS instead. Also, for NAS installations I usually prefer to have the system on a separated small disk, ever on a flash device.

https://wiki.gentoo.org/wiki/ZFS

----------

## vaxbrat

I think the telling thing about this is that while Oracle owns the IP for both zfs and btrfs (inheirited from the Sun buyout), they use btrfs for their dbms backend on their bastardized version of RHEL.  Also Facebook is a major proponent of btrfs.  I've personally been using it for about 4 years now either as standalone raids or as the underlying filesystem to ceph (for about 2 years since Firefly).  The only rare hiccups I've ever had with it were do to bad memory on consumer hardware (ECC would fix that) or from hard lockups due to hardware or power failure.  In those situations I've been able to copy just about everything off of the bad filesystem since it mounts readonly for all but the most screwed up of cases.  When you run a clustered filesystem like ceph on top, you just drop the Object Store Daemon (OSD) that runs the bad filesystem, re-intialize and then re-add the OSD.  The cluster does all of the repair work necessary in the background to rebuild the replicas.

I might consider using zfs if I were on a bsd setup, but btrfs has a few more advantages including the ability to both grow and shrink the volume pool on the fly.

----------

## The_Great_Sephiroth

ZFS requires too much overhead. Loads of RAM, for example. I like BTRFS and use it on a few laptops here that run Gentoo. We use it on the home partition with zlib compression (save more space since data is duplicated) and it works fine.

----------

## alexcortes

Just to point out: ZFS uses a lot of ram just if you use dedupe.

----------

## The_Great_Sephiroth

Yes, and BTRFS supports deduplication also, but at much less cost from what I have read. I believe that combining deduplication with zlib compression should offer a lot of space saving on a RAID array used for file storage.

----------

## The_Great_Sephiroth

Vaxbrat, I am new to CoW on this level. I have used systems with CoW before, but never configured or fully understood it. Can you help point me in the right direction here? I want to understand how it works and how to set it up. I have read reports that using CoW and compression results in less compression or no compression.

----------

## brownandsticky

If you suspect the effort and cost will get too high.   I'll give a nod to Netgear RN100 series of NAS's.

They use BTRFS on top of md RAID.

A dual bay' is serving well; initially as a NAS and now as shared storage for Xenserver.  Admittedly the throughput limits it's usefulness as a Storage Repository.

----------

## The_Great_Sephiroth

I just came upon a question. How do I setup fstab with this? Each partition has a unique UUID and the RAID has a UUID. For example, sda2 and sdb2 have UUID's, but when formatted with BTRFS in RAID1, the RAID1 virtual device has a UUID. Can I use the RAID UUID in fstab? If not, what happens if I use the UUID for sda2 and the entire disk fails? Does it know to mount sdb2 instead? Does it not mount at all and crap itself? What?

----------

## vaxbrat

You probably want to use the UUID that you get back from btrfs fi show:

```
btrfs fi show

Label: 'cephosd0'  uuid: 87a86762-05f6-44fa-860b-f96df085d967

        Total devices 3 FS bytes used 4.00TiB

        devid    1 size 3.64TiB used 2.69TiB path /dev/sdc

        devid    2 size 3.64TiB used 2.69TiB path /dev/sdd

        devid    3 size 3.64TiB used 2.69TiB path /dev/sde

Label: 'thufirraid'  uuid: 5f6e51a3-d8e7-41e1-bdb9-3cd9be0bf7fe

        Total devices 1 FS bytes used 2.63TiB

        devid    1 size 3.64TiB used 2.64TiB path /dev/sdb
```

But as you saw in my example, I just set a LABEL and use that instead.  It's also smart enough that you can just use one of the member devices or partitions

----------

## vaxbrat

It's safe to turn on compression and not worry whether it works out or not.  When writing an extent, the btrfs worker runs the compressor on the buffer before sending it out.  If it compresses down, the result is written out.  If the result is the same or larger, the original buffer is simply stored.  Basically its the same approach taken if you try to run 7zip, jar or the like on a directory with contents that's already compressed such as jpg images or compressed video.

So the resulting filesystem size really depends on what you are putting on it.  Resulting i/o performance may be substantially faster than expected if the compression is on the order of 2:1, or more or it will be more on par with whatever bandwidth you have on the hardware.  For a rule of thumb, the sequential write performance of a spinning disk pretty much maxes out at 100mb/sec.

Because my cluster spreads out its I/O, it's not uncommon for me to see copies of 10gb video files going out to the object stores hitting over 200mb/sec even when they don't compress well, but that's what you get when you run with the big kids   :Very Happy: 

----------

## pjp

Sorry for the interruption, I'll keep it brief...

 *vaxbrat wrote:*   

> I think the telling thing about this is that while Oracle owns the IP for both zfs and btrfs (inheirited from the Sun buyout), they use btrfs for their dbms backend on their bastardized version of RHEL.

  Do you have any references for the dbms stuff (searching isn't returning relevant results)? I'm curious how it's used. I know they use and push ZFS a lot, so it may only be telling wrt the license incompatibility.

----------

## vaxbrat

 *Quote:*   

> Do you have any references for the dbms stuff (searching isn't returning relevant results)? I'm curious how it's used. I know they use and push ZFS a lot, so it may only be telling wrt the license incompatibility.

 

I don't remember where I heard that but it would make sense.  Since the Oracle Enterprise Linux is a ripoff of RHEL, btrfs would be packaged while zfs would not by default.[/b]

----------

