# discussion: iscsi & NAS & high availability

## Frans

Hi,

I'm a little confused about all commercial and OS NAS solutions i came across.

What i'm looking for is a OS solution for our datacentre.

I'm thinking about a cluster of storage servers. For now 4 TiB will be enough, but i like to expand if needed withouth the need to reconfigure the servers.

The web/ftp/mail servers will share the same storage servers, so that we will be more flexible in replacing servers, because only the services/deamons are running on the web/ftp/mail servers.

With this setup the storage servers must be always available, since a failure will be horrible.

So what about a 'raid 1' iscsi config? 2 or 3 servers on separate networks that are in sync and ready to take over eachother.

What are my options?

my requirements/wishes.

- stable

- high availability

- access from linux & freebsd

- open source!

- easy backup (disaster recovery)

- every day a snapshot and keep the latest 6 snapshots.

- easy to expand by either adding a box or adding a disk

Can you point me to documentation and examples?

----------

## nevynxxx

Two things.

1) raid 10( 1+ 0, 0+1 whatever you want to call it) gives you the speed and the saftey for the sake of some wasted disk space. By far the best option in my opinion.

2) For what you want, I would ring Sun, IBM, HP, EMC........Seriously, you will waste more time playing with this than is worth it when they can give you a working, out of the box SAN. Specifically HP do a high availability starter SAN kits for <£10,000.

----------

## tgh

Heh, I just happen to be spending my evening thinking about this as well.  Since we're a small shop, going with name brand isn't something we're willing to do at first.  I'm trying to nip the proliferation of individual machines in the bud, but without the issue where if a loaded machine goes down, I'm without those services or data until we get spare parts in.

We're thinking Xen + iSCSI over gigabit switches.  That would allow us to manage our disks better while being able to move services between servers somewhat on-the-fly.  We could have a bunch of lightweight servers for the apps but keep all the data on the iSCSI SAN.

If I understand everything correctly we need:

- (2) 24-port managed gigabit switches for the SAN fabric

- (2) server machines, each attached to one of the SAN fabric switches to act as iSCSI servers

- (3) NICs in each application server, one to talk to the user machines on the 48-port gigabit swtich, one for each of the SAN switches

The application servers would run Xen, with the guest O/Ss running on iSCSI devices.  Probably in a RAID1 setup where one SAN disk comes from one side of the SAN fabric and the other SAN disk comes from the other side of the SAN fabric.  

If a Xen box dies, we migrate the guest VM over to another Xen box.  If one of the SAN switches or SAN storage units goes offline, the Xen boxes keep running due to RAID1.  

Cost-wise, I think it will be around:

$1500 for the base hardware for each SAN storage unit (SATA disks, 12 hot-swap caddies)

$???? for the SATA drives (probably a pair of 750GB to start)

$1500 for each 24-port managed gigabit switches

$1000 for each Xen head server

(Initially, we would start with a single Xen, a single switch, and a single SAN unit.  Ramping up to multiple Xens, 2 switches, and 2 to 4 SAN units.)

In our setup, we're allowed some downtime.  I can basically take everything offline after 6pm or on the weekends.  As long as I can get a service back up and running within an hour, everything is fine.  But there's still a lot of questions that I have about iSCSI, mostly relating to performance.  Down the road, we could always put in better iSCSI storage units (using SCSI drives), but most of the prices I've seen are $4k-$6k just for the iSCSI box.

----------

## dice

I've been very interested in Coraid products for a while now, although I haven't used them.  I'd be very interested to hear from anyone who has either had experience with them or who has looked into them and decided on something else.  Provided that you don't need to route your disk requests AoE is, I think, a viable alternative.

----------

## tgh

http://www.iscsi-storage.com/iscsi_ha.htm

http://www.diskdrive.com/iSCSI/reading-room/white-papers.html

And for redundancy, we would probably tie together the two 24-port gigabit switches.  So that if one goes down, you can still talk to either SAN box using the other switch.

More links:

http://zdnetasia.com/smb/features/0,39043755,39299491,00.htm

http://www.networkcomputing.com/channels/storageandservers/showArticle.jhtml?articleID=179100602&pgno=2

----------

## tgh

I've got a better handle on costs now.  For the SAN (using iSCSI) you're looking at costs of around:

$6.67/GB for a 3TiB SCSI SAN box

$2.92/GB for a 3TiB SATA SAN box (5000TiB for $2.20/GiB)

$1.77/GB for a 3TiB home-grown SATA SAN box (7700TiB for $1.00/GiB)

Those costs are merely for the hardware. Double those values if you want redundancy at the SAN level.  Plus you have to add on support contract costs (20% of the hardware costs?).  Plus the time to build the units.  The pre-built SATA iSCSI units are almost worth the cost, unless your labor costs are very low and you're looking for a low-cost stepping stone into the SAN waters.

Two to three terabytes is probably the entry level where iSCSI and SAN starts to make sense.  Smaller then that and it may not be worth it.  The cost per gigabyte only makes sense if you fill the SAN units halfway up with disk drives at the very start. 

Once you have a stable SAN, you can do nicer things with virtualization such as moving servers on the fly to different blades / physical servers.

I've got parts ordered for a base SAN box.  I'll probably be using Gentoo and iscsitarget (iSCSI Enterprise Target) to configure the box.  Then I'll build a Xen head unit to test out virtualization and assigning iSCSI storage to the virtual servers.  Eventually we'll add a 2nd SAN box and build a fault-tolerant mesh out of gigabit switches.

----------

## Frans

Ok!

So, iSCSI is the way... What i don't get yet is how to replicate 2 iSCSI boxes.

Say i have 2 iSCSI boxes wired up redundant, redundant power supply etc...

To get this in 2 boxes in raid with LVM on the application server must be possible, but is this the way to do it?

What happens on a server crash and what after a server crash (how to get them in sync again..)?

----------

## tgh

Therein lies the trick (of making the 2 SAN boxes redundant with each other).  Dunno if DRDB is involved or if mdadm can deal with iSCSI devices properly over the SAN.  (i.e. import one disk set from each SAN box, then RAID1 them with mdadm)  There's also the joy of figuring out whether you can bond the ethernet adapters while still hooking them up to two different switches, or whether you have to simply arrange one of the ethernet adapters as fallover for the primary adapter.  Or if there are special switches that can deal with NIC aggregation over a pair of switches.  After a crash I'd imagine that mdadm would simply rebuild the RAID array across the SAN.

This is where a SAN consultant might be worth a few grand.

I do think iSCSI is a good starting point... it's less expensive then FC to get into at first, and you can always use the gigabit switches in the SAN for the regular network later (if you upgrade to FC down the road).

I did find that SMC makes a 16 ($240) or 24 ($320) port gigabit switch that supports aggregation.  It's not managed and not one that I'd use in production, but it's what I plan on using for initial testing.  The SAN box that I'm building uses a pair of Intel dual-port server NICs that I plan on bonding together in some manner.  I'm also planning on a 9-disk SAN to start.  2 drives in a RAID1, 1 hot-spare, and 6 drives in a RAID 1+0 setup.

Due to the costs of having redundant SAN units, we'll definitely only be putting production data on the SAN.  And using a non-redundant SAN for less time-critical tasks such as backups.  We may eventually end up with a 3-layer SAN.  A SCSI disk redundant SAN for the super-critical stuff.  A SATA disk SAN (redundant) for the not-so-critical but still important stuff.  And a slower SAN for bulk storage.

----------

## tgh

After mucking with this for a while longer...

a) DRBD seems to be the watchword for SAN unit failover.  

You'll want the highest speed link that you can get between the two SAN units in order to replicate the DRBD data.  My current plan is to bond a pair of gigabit adapters and link them to the other SAN unit via crossover cables.  That will give me a 2 gigabit DRBD backbone without saturating my primary NICs.

The actual failover setup seems to be tricky and I won't be attempting it until at least next spring when we put a 2nd SAN unit online.  Right now, it seems like everything talks to the primary SAN with automated mirroring to the secondary SAN, then if the primary SAN falls over things switch over to the secondary SAN.

Not sure about fail-over, resyncing, or things like that with DRBD.  Also not sure whether DRBD can be used with pre-built SAN units like the ones from EMC.

b) Supposedly, Linux makes it possible to bond adapters across two switches.  The second adapter is left idle until Linux notices that the primary adapater / cable or switch has failed.  So all traffic runs across the primary switch until things go pear-shaped, then traffic starts flowing over the second switch.  (What I'm not entirely sure of, although it should be possible, is how to bond 2 pairs of NICs, then bond them again to deal with switch failure.)

- Definitely need at least 4 NICs in the SAN to do true redundancy.  Two for the primary switch, two for the backup switch.

- Servers can probably get away with only 2 NICs to link to the SAN switches (one for the primary, one for the backup).

c) iSCSI target on Linux seems to work very well and isn't hard to setup.  In about 15 minutes, I created an LVM volume, mapped it as a iSCSI target (/etc/ietd.conf), and exported it to a Windows XP Pro client for mounting as a block device.

d) The open-iscsi initiator for Linux is much more tempermental.  The current ebuild doesn't work with 2.6.17-r4 for AMD64 and I haven't gone looking for a newer version of the open-iscsi software.

e) Some implementations use VLANs to segment the SAN traffic instead of using separate switches.  This should allow the use of jumbo frames (9KB) within the VLAN.

...

We built our 2TB SAN test unit for around $5000 so far.  It doesn't include nice things like ECC RAM (yet) or redundant power-supplies (yet) or even a hot-spare (yet).  But it has a 2-disk RAID1 and a 4-disk RAID10 for testing.  Hardest part of building a SAN unit is finding a good PCIe motherboard with enough x4 or x16 slots to hold things like ethernet cards and RAID cards.

I suspect that the dual-core Athlon64 X2 4200+ in our SAN test unit will become a bottleneck fairly quickly since we're doing Software RAID.  But it should be enough to get a basic system up and running until the quad-core (I hope) AM2 or AM3 chips come out next year.  Dunno yet if it's going to be the disk access or the network traffic that bottlenecks on the CPU.  We're using Intel PRO/1000 NICs, but I'm not sure if they offload enough traffic from the CPU to be worth it.

----------

