# Building a high capacity storage on Gentoo

## tbart

Hi there!

I have to solve the following and may not really go for a ready-made solution as it will get too expensive. We're a small company with limited budget.

I need to build a storage for HD/SD video, mostly for archiving purposes, but -- if possible -- for working directly on it as well. (that is, accessing the footage on it directly with our editing software).

Target size will be ~150 TB (yes, the number is correct) but to start with, 12-16 TB will be OK as well if it scales later on.

Editing machine is Windows based and cannot be changed although I'd like to.

There will soon be more than one editing machine and the footage shall be accessible by at least all of them.

I think that's the point where iSCSI, AoE, FC and other block-level technologies drop out. If I understand this correctly, only one initiator can connect to a target. If not, please correct me.

(If i were to use iSCSI, the file system on disk would have to be NTFS, as the windows box has to access the block layer, right? No, I don't think I want NTFS on a xxTB volume...)

So I think I have to stay with CIFS/Samba for sharing. Any guesses whether CIFS will be able to fully exploit the potential throughput the disk will provide my linux storage server with?

I think I'll go with software RAID 6 as want an open standard used for the format and not see myself in 5 years' time with a borked H/W RAID controller and not being able to read my data.

There will still be the write-hole problem with this solution as I would not have a battery backed cache. I think I will risk this, as I do have backups and rebuilding this is not too time-critical. A UPS would definitely be used.

Should I go for some sort of RAID-Z(2) (apart from the fuse implementation, as i want speed... this would mean *BSD or (open)solaris/nexenta/etc. ) because of the write-hole problem (and other data corruption issues)?

How often does a "normal" RAID 6 have problems due to corruptions on disk or write-holes?

I will then put 16 disks or so (1 TB or 1,5 TB each) into some sort of enclosure.

How do I connect them to my server? Directly? i.e. buy controllers with 16 or more ports? How about SAS expanders, SATA PM (port multipliers) and the like? Any suggestions/experience?

Any suggestions for cheap&good enclosures for SATA that possibly also support hotswapping or at least slide-in installation in drive cages + the necessary interface(s) to my server?

There seem to be rather sensible enclosures with included PMs like this http://addonics.com/products/raid_system/mst4.asp, but you also get enclosures for both your PC and 16 disks on a so-called midplane or backplane..

What will the server hardware have to look like (i.e. how much CPU power do I need for software RAID 6 on xx TB? I need to have a big fat PCIe bus, I know that. And a lot of RAM will also be I good I guess)

Finally I want to go 10GbE, but I still don't know yet. This can be done afterwards as well. I'll be happy to have 1Gb (or 2 in a trunk) saturated first ;->

Ah yes... I am also considering openfiler, maybe that's worth a look as well. On the other hand I think I will customize the system a lot and add additional services (DB? Backup server? ... ) so Gentoo feels more comfortable to me.

If you have any ideas or even better: experience with stuff like this, please post!

Even if it's just a small question you can answer, it will help me a lot, I guess!

Thanks in advance!

th

----------

## VinzC

As a poor-man's high storage solution I would have thought of a grid of computers with a certain amount of disks, each accessible through iSCSI. These iSCSI logical units would all be involved in an LVM volume group on a Samba server but I don't know that whole solution is reasonable or not.

A cluster of computers is IMHO a good solution for a scalable storage. If you put all these machines in a high bandwidth (Gigabit or Infiniband) switched network, it should be somehow viable. I hope.

I also think there are HOWTO's around for setting up clustered storage.

----------

## szczerb

How about OpenAFS? If I remember right it is highly scalable, very fast and has a windows client.

In short:

http://en.wikipedia.org/wiki/OpenAFS

Our docs:

http://www.gentoo.org/doc/en/openafs.xml

http://en.gentoo-wiki.com/wiki/OpenAFS_with_MIT_Kerberos

----------

## tbart

Thanks a lot for your input.

I have initially thought of clustering as well, but based on a few rough calculations it soon turned out to be a lot more expensive and/or more error prone.

PSUs for example need to be redundant in every node or my arrays need to be configured in a way that a whole node plus its disks can fail.

The same is valid for network cards, network cables, mainboards, storage controllers etc.

I know all of these units can die in one PC as well, but I'll definitely have a redundant PSU, so this should cover roughly 60% of error cases.

And for the rest: Yes it can and will possibly die. But the chances of one mainboard failing is a tenth lower than that of any of 10 mainboards failing. And I do not think it will be affordable to distribute the data in such a way that a whole node can fail. either the overall capacity needed must be too large or the redundant vs. usable data fraction gets too high.

I want to have this in one device more or less.

My current researches brought me one of these two solutions:

1) (And this sounds really promising)

One good PC with a stable Mainboard+RAM; redundant PSU; onboard ICH10R or possibly even card based well supported SATA controller that can do PMP (port multipliers - also known as SATA-PM)

http://ata.wiki.kernel.org/index.php/SATA_hardware_features

one PMP is about 60 EUR and can connect 5 SATA disks to one SATA port on the controller. This makes 4x5=20 drives on a better 6+ SATA ports mainboard (2 ports for OS RAID 1) and should suffice for now.

later, I can always add another cheap 150 EUR 4 port SATA PCIe card for another 20 drives via PMP.

stuff this into a normal PC case with lots of 3,5" bays or use some sort of 5x3,5" drives in 3x5,25" height enclosures like these here:

http://www.addonics.com/products/raid_system/ae4rcs35nsa.asp

I found that http://forums.sagetv.com/forums/showthread.php?t=25709 discusses the same thing, more or less (all the way between pg1 and pg16)

Performance tests (I know this is a MAC, but hey..)

http://www.amug.org/amug-web/html/amug/reviews/articles/addonics/5x1/

2) (My first little wannabe enterprise class storage on the cheap)

big PRO: a proper enclosure with drive monitoring and failure alarm)

a)Buy something complete 

http://www.linux-cluster.de/produkte/File-Server/opteron_RAID_server.shtml

The link provided is german, but I guess you get the point.

5319 EUR for a complete system (red. PSU, 16 x 1TB disks, server hardware)

Linux preinstalled, so this should definitely work as expected.

b) or build something the like, also based on SAS Expanders (backplanes)

http://www.acmemicro.com/estore/merchant.ihtml?pid=5291&lastcatid=283&step=4

http://supermicro.com/products/chassis/4U/846/SC846E1-R710.cfm - couldn't find a price yet - oh yes 1200 USD not that much actually..

The only thing I still wonder is whether SAS expanders really work flawlessy on linux (or are they transparent?) and if yes, with which SAS controllers.

----------

## efagerho

First there are a few things which I don't think you should compromise on and those are the host computer and the controllers, because having administered servers for a large student union, I have seen pretty much all possible "low cost" solutions that have turned out to be expensive disasters. Thus it's a lot better to have quality components to build up from (this needs a bigger initial investment, but gets cheaper in the long-run). I recently built a storage solution for home use with an Adaptec 5085 RAID controller using RAID6 with 16 disks and I'm very happy with the performance. Software RAID6 with only 16 disks already clogs the bus pretty effectively when I tested it, so software RAID is not going to work unless you distribute the system among many computers (which you didn't want). I also have some bad experiences with port multipliers, so I would not recommend them (crappy performance).

I would definitely not go with the cheap solutions that Addonics provide or you would just be asking for trouble. The Adaptec controller costs around 700€ (and is price/performance wise the fastest on the market currently), the battery backup is around 150€, and I would buy two controllers and flash them with the same version of the firmware to have one as a backup (this prevents your borked controller scenario). This is a lot less error prone than running software RAID on a cheap SATA controller and, if something goes terribly wrong, you always have good support. As an enclosure, you would need something like the following:

http://www.pc-pitstop.com/sas_cables_enclosures/scsase16.asp

You can chain four of these enclosures behind one Adaptec card for a maximum of 64 discs for one card. When that fills up, you just buy another RAID controller and put it into the server and buy a new set of these enclosures. Thus, to fill your requirements you need a server with room for two controllers.

Depending on how many workstations you need to connect, you might want a 10GbE controller or you can just bond the 4 adapters that comes with a typical server, but this never comes near a real 10GbE adapter in performance, so I would at least keep this option available (by having a PCI-X slot in the server). To summarize, my recommendation would be to buy the following to begin with:

http://www.sun.com/servers/netra/x4250/specs.xml

http://www.pc-pitstop.com/sas_cables_enclosures/scsase16.asp

http://www.adaptec.com/en-US/products/Controllers/Hardware/sas/performance/SAS-5085/ (two of these)

http://www.westerndigital.com/en/products/Products.asp?DriveID=503 (16 of these, but you'll also want a few spares)

= 3200€ + 2300€ + 1400€ + 150€ + 16*150€ = ~10000€

It's more expensive, but the components should not break on you.

----------

## tbart

thanks a LOT for your input! that really seems interesting...

when and on which bus technology did you test the software RAID setup on 16 disks? I mean do you think this is still the case with current mainboards and increased I/O bandwidth?

many people say (and I also tend to think that way) that in case of an error/failure etc. hardware RAID tends to get problematic as the disk layout is proprietary (well and i also hate proprietarity things anyway, but this is because i am open source minded). in my last company we lost a 480GB volume because a controller has been replaced with the same controller. that's why we started to use software raid and were happy ever after .. ;->

but that has always only been a small amount of disks. so if you really say that I am going to lose a lot of performance with software RAID then, well, I guess I'll have to go with a hardware solution.

As my current investigations lead me to cheaper SAS backplane solutions I guess i will go the "enterprise" route anyway. What do you think of the following?

http://www.supermicro.com/products/chassis/3U/936/SC936E1-R900.cfm

I could also put my PC in it, so this should be really cost effective. redundant PSU already included, etc. i cannot really find a price for it, but a similar model (936A-R900B) costs around 920 USD (this is pretty cheap)

It features a SAS expander with cascading that's SATA II drive compatible. So something very similar to what you proposed.

Btw, how's the support for SAS expanders in the current kernels? or are they transparent? anything i have to watch out for regarding the controller that's on those expanders?

Regarding 10GbE, yes I do consider it. But I don't think I want a board with that old PCI-X slots. There are some PCIe 10GbE RJ-45 controllers on the market already, so I guess if I have the wish to upgrade to higher bandwidth then I'll be able to go with a PCIe card.

Regarding the Server hardware.... well. 10000 EUR is waaay too much, we cannot afford that. Doesn't a good mainboard+CPU+RAM combination (I am thinking of some P45 based chipset and a quadcore on some decent ~150-200 EUR mainboard with good reviews) also do the job in my case? I know a SUN server will possibly last longer, I've been working on Sun Fires a long time, and they're pretty good machines.

Any suggestions regarding a good, say, "entry level" server mainboard?

I don't really care if I have some downtime (i.e. 1 or 2 days to get a mainboard/RAM/CPU if it's broken). This beast is more like 90% archive with occasional access and 10% regular activity that can also be carried out on local video RAIDs on the editing machines. so every cent that can be saved has to be considered. 

Another question: How many drives should I put into one RAID 6 array? Are 16 or 24 possible or is there some magic number where reliability will call for a split into the next array already below this drive count?

Again: Thanks a lot for your input, very much appreciated!

th

----------

## efagerho

I think your biggest problem is the: scale up to >100TB requirement. There's no way that you can get everything into just one server without going for hardware RAID controllers. Even if you buy a PC motherboard and fill every PCIe slot with SATA controllers, you're still only up to maybe 30 drives. This means that you have to buy a disk controller that supports SAS expanders (enclosures with built in expanders don't need any drivers btw) and the only controllers supporting this are the hardware RAID controllers, the cheapest being some HighPoint controller at maybe 350€. However, there's nothing preventing you from configuring the RAID controller to just show the disks as a JBOD and use software RAID, but I would still let the controller handle the RAID array.

Regarding the server, I would not build it myself. You're not going to save much as you can get an OK server for like 1500€ (e.g. a Sun Fire X2100 M2) and they tend to be a lot more robust. Usually the servers that I've had hardware problems with are the ones I have built myself...

Regarding RAID6, I would not go over 16 drives for one array and I would also buy quality disks. If you want to handle the array in hardware you need to buy enterprise drives designed for RAID controllers. The reason is that a desktop drive will stall forever on a read error causing the controller to kick out the whole drive even if just one sector was faulty while the enterprise drive will just abort with error and let the RAID controller read it from another drive. This is pretty dangerous when you have lots of disks.

What kind of a budget do you have? You do realize that an enclosure with redundant PSU and a SAS expander costs about 2000€ and you need to pay about 18*150€ for the drives (you want 2 spares). You're going to shell out 1500+2000+18*150+350= ~6500€ at minimum.

----------

## tbart

Regarding the hardware/software RAID question, I guess I will have to buy a RAID controller anyway, so I am able to try software RAID. But from what you said I think I'll go with a BBU and HW RAID.

You say "OK" server. What kind of key performance parameters do I have to take into consideration when going with hardware RAID? I mean there's nothing parity-wise my CPU has to do. It will only be the (network) file system part, so I guess I'll need a lot of RAM to be able to cache efficiently (with XFS tuned for it) but what about the CPU power? From what I remember from my last company is that storage servers normally max out at a few percent CPU usage when under load..

I understand the point with enterprise drives and I have also already read a lot about this. Just out of interest: will the drive then reallocate its sector and get the data from the controller again or will i have errors upon reading that sector when e.g. 2 drives in a RAID 6 have failed and there is no option for the controller to fetch the data somewhere else?

What budget do I have? Well... It's more like how much do you want to spend from your private money. No not really like this but we're only 2.5 people and I have to watch out how much I spend as it has a rather direct effect on everyone.

I am more or less accepting the given costs of

+ Controller (+BBU)

+ Disks (enterprise, that is)

The server is also more or less OK like this, but I am still asking myself whether I cannot take the exact same hardware as in a turnkey server (i.e. a SUN machine, if you get the parts...) and put them in my combined enclosure.

That would save me some rackspace (although I do not have a rack yet, but well ;-> ) or at least some unnecessary material.

Where's the catch with this 

http://www.supermicro.com/products/chassis/3U/936/SC936A-R900.cfm

920 USD disk+server enclosure? It's got redundant cooling and PSU. Is the backplane buggy? How do you know which backplane is good and which is not? I could not find details about the used chipset with most enclosures..

I know this one cannot do cascading, and I think that's what makes it cheaper. But I can use extra controllers with external cables anyway instead of cascading enclosures. with only 12Gb/s on a 4xSAS port I cannot imagine cascaded enclosures to be that performant... As i have to start new arrays every 16 discs anyway, there does not seem to be any other advantage than having to buy less controllers - but on the other hand the enclosure is a lot cheaper than the ones I found or the one you proposed which are cascading capable. The often also support multipathing/failover which is also something that I do not need...

So the remaining question for me is this:

- Can I get an OK server in parts as well that i can stick into

- a cheaper enclosure? (i.e. is the mentioned a good choice or are there others that cost more around 1000 EUR than 2000 EUR?)

Thanks in advance, I have a pretty good picture now and I do not think there is a lot missing now to get started!

th

----------

## colyte

I believe http://bugzilla.kernel.org/show_bug.cgi?id=12309 will disallow any high performance storage solution on Linux.

A long lasting (2 years or so) IO bug in the kernel.

----------

## kewlx2525

I looked up a lot of info on RAID setups for Database servers and general storage. Seems RAID6 and RAID5 both don't check for data errors. So even if you have a harddrive fail and RAID6 rebuilds it from parity, if the parity has bad data, you will rebuild with bad data.

Also, you should ALWAYS have a dedicated back-up. Nothing worse than 150TB of videos on RAID6 and something going wrong and you lose everything. Remember, your FS could get corrupted and kill data.

Large storage typically isn't fast storage. If you plan to do video editing on large files, you may be better off with a separate 200-400GB 15k-scsi scratch disc or 10k raptor would be cheaper and still fast.

ECC memory. Interesting side note. I was reading up on different Distributed File Systems and it seems that some DFS's with error checking seem to go off the wall every so many gigs of data because of a difference in check sums among the nodes. Seems using ECC memory fixes this problem; which means every so many gigs of data stored, a small amount gets corrupted from memory errors which feeds back into "BACK UP YOUR DATA".

----------

## richard.scott

Hi,

This is how I'm building a storage network so feel free to copy   :Wink:   I'm using the following commodity hardware to create my storage solution:

1) 1 x Management Node (example: P4(ht) 3.6Ghz with 512MB Ram)

2) n x Data Node(s) (example: PIII 1.6Ghz 2 x 72GB SCSI, 2GB Ram)

All my nodes boot Gentoo and run in RAM like the LiveCD does... this enables me to load the Data Node's OS from the Management Node. Without needing to install on OS on the Data Node is a big timesaver. When I need to expand the space available I just pop in a brand new server, switch it on and it boots instantly and all I need to do is configure the disks   :Cool: 

All servers run 64-bit Gentoo with dual bonded network interfaces to help ensure high availability and with the right switch this could give me 2Gbps IO on each Node! However, for now I am thinking about connecting each interface to a different switch to test HA i.e. connect eth0 to switch A and eth1 to switch B etc across all servers so if one switch dies I can automatically use the other one. 

Anyway, for now I have configured each Data Node with Hardware RAID and exported the free space via ATA Over Ethernet (AoE). I use AoE as it has a lower latency than iSCSI as it doesn't use TCP/IP so is as fast on the backend network as it can be  :Wink: 

I have then Software RAID'd the Data Node AoE Devices together on the Management Node, formatted it with XFS and configured this to be available to the Windows machines via Samba. 

I'm running 64-bit Gentoo on all machines.... As far as I know a 32-bit machine has a maximum device size of 16 TB... and on 64-bit machines the limit is 8 Exabytes! I'm managing my partitions on the Management Node via LVM2 so this also needs a 64-bit OS to configure a device bigger than 16TB.

Also with a 64-bit OS I can format devices bigger than 16TB. For me XFS seems to be the most stable and it also has a max size of 8 Exabytes! See this for more info on Filesystem Limitations and from what I can tell Samba doesn't have any limits on a max filesystem size it will work with so it should work ok with XFS and 8EiB of data.

The only limitation I can see is that each raided device on the Management Node has a max size of 8EiB.... but if I need more space I just configure another device and add a 2nd 8EiB to the Management Node using other servers  :Smile:  However, I'd guess at this point that my GB network wouldn't cope on the backend?

Cheers,

Rich

----------

## honp

to richard.scott:

this sounds very interesting. You should write some kind of howto... :Smile: 

H.

----------

## fangorn

Another vote for the solution that VinzC and richard.scott already proposed. 

One PC with multiple disks might work for some time, but you were talking about _massive_ storage. Your solution 1) you posted is good for 40, maybe 60 discs. And then? no more controllers, no more discs. 

With the solution with one controller node you can add network interfaces as long as you have free slots and later add more and more storage points (with a slight degradation in performance because of the shared network connection). 

Each node could do RAID1(0) on 2, 4 or even 8 discs itself, independent from the structuring that your final device is configured to use. Or even a seperate RAID60 on 10 disks.   :Twisted Evil: 

The options are limitless.   :Wink: 

----------

## VinzC

Hi Richard.

This sounds like the Ultimate Gentoo Linux Modular, High Storage Solution.  :Cool:  I was curious how to implement such a solution. Hence another good opportunity to test ATA over Ethernet -- never did that.

Thanks a lot for sharing this.

----------

## Insanity5902

I am looking at simliar situation and setup.

I am wanting to create a single point for all shares and storage. In my head right now, it is a lot has Richard.Scott put it.

Here is what I am thinking of right now.

Having a SAN network over gigabit, using a normal L3 switch.

Using Coraid's disk unit's for my storage nodes. The base models support 15 drives. For 4K, you can load up on 1TB drives (the WD RE3's) for about 3K. So for 7K you have a system of 15TB Raw storage.

Then connect up a box to act as the SAN Gateway for general shares. This will allow access via NFS, CIFS, AFP, FTP, whatever I like. Also each server that needs dedicated storage will have a separate connection on this SAN Network for that purpose (i.e. Exchange data store). This gives me static storage for specific servers, and then for the Gateway machine, I am going to use LVM to manage the AoE devices so you can add on a second AoE drive and then add it to your LVM.

Coraid has a Gateway device, 64 bit linux, that is only 3K which is bad at all. But my issue is down the road, I could see myself changing the gateway to connect to a CX4 10Gbe connection uplink port on the switch. This will let the gateway device communicate at 10Gbe and then provide bonding on the storage nodes to provide 2Gbe connection, which should be more then enough for 15 drives.

Coraid also offers a 44 disk unit, and one of those units has 2 10Gbe connections and 2 1Gbe. That unit is 10K though, but you have some nice room to upgrade. The also have a 44 disk unit with only 2 1Gbe for 5K and another one with 6 1Gbe for 10K

Lots of options for speed and failover at a fairly decent prices. 

I know my thoughts are all over the place, if I need to explain more on something let me know. I was just hired on at a ad agency that had no network admin before me. They just merged and are no2 about 50 strong and no organized storage method. I've been researching this stuff for about 4 days now. We don't have the money to purchase an equivalent size storage for 40K.

Another thought, If you buy your storage nodes in pairs, you could do some fancy LVM stuff to help distrubute the load accross the AoE drives.

----------

## tbart

The principle of spreading storage space as well as load sounds nice in the first place, but there are a lot of implications that seem highly unpractical to me.

1) Direct purchasing costs

more servers = more money (more cpus, more controllers, more chassis)

2) Indirect costs

think

- power costs

- cooling costs

- UPS costs

- rack space

3) Extremely lowered MTBF

more boxen = higher chance of anything breaking

true, redundancy might be better and nodes *may* fail without the whole system going offline. But: You have to do something then.

If for example you have 8 nodes instead of 1 server, your avg error fixing tasks must mathematically be more than 8 times as high (why more? because you introduce even more equipment and dependencies between the nodes, i.e. cables, UPSs, switches, etc. You even have to think of software - the more you use the bigger the chance you run into bugs)

4) Sound level

more systems = louder; we only have a somwhat dedicated server room and I have to keep sound levels down a bit

5) Ecological thoughts

I don't want to buy that much stuff in the first place (to ditch it somewhen) as well as I want to keep power consumption low

In my case, the added (and that's definitely a point in the solutions cited) redundancy does not buy me anything as it is connected with additional maintenance effort on my side and the other points mentioned. I think the mentioned solutions definitely have their applications - in HA environments.

I also do not know whether the performance of any ethernet-based system can outperform 4x3Gbit SAS connections to the disk array..

My current solution will be the following (if only tech support could answer my mails regarding HD compatibility for RE3 drives..):

Supermicro SC836-E1 case for server and 16 disks

it is definitely compatible with 1TB seagate SATA disks (ouch... Seagate's firmware has not been the best in the past weeks..); i have a compatibility matrix from their tech support

this costs only about 950 EUR excl. disks and has storage enclosure management etc. (failing drives quickly identifiable via error LEDs - that sounds difficult to me in the distributed systems)

The exact same case can be used to attach another set of 16 disks via daisy chaining (and again.. and again.., also capable of horizontal branching, the possibilities are endless)

see http://www.supermicro.com/manuals/chassis/3U/SC836.pdf for details (and forget about the statements regarding the backplane only supporting SAS drives

PS: regarding the solution by richard.scott:

I haven't done any 64bit linuxen yet, but I imagine it being difficult to run 64bit linux on PIIIs..?

----------

## Insanity5902

Not going to argue with your points, those are mostly personal stipulations for your environment. The MTBF can been seen has higher, but as you expand your storage this will rear it's ugly head anyways.

 *tbart wrote:*   

> PS: regarding the solution by richard.scott:
> 
> I haven't done any 64bit linuxen yet, but I imagine it being difficult to run 64bit linux on PIIIs..?

 

Then it will also be harder for you to get anything larger than 16TB

----------

## jon123

AOE

AoE seems like a cool idea (I also wonder how you can run 64bit on PIII  :Smile:   ).  I think your will eventually max out your head node.  Is there a way to do N+1 redundancy with your head node?  Because if that goes down or gets slow its going to hurt.  Does LVM mind you growing a pool that uses replication?

Distributed file systems

I am afraid of using AFS because its not a "Distributed parallel fault tolerant file systems".  Its just distributed. Which means if a node goes down your entire array is down.  Or just the data located on that node is down.  Good info here

I really like GlusterFS and it works great.  It is great glue for combining extra space together on all your commodity systems.  One of my favorite features is that it sits on top of you existing file system so you can do file recover just like normal if something gets really hosed.  No special file systems or anything like that.  There is even an ebuild at gpo.zugaina.org

SAN

I have also looked into building a large SAN using several SC836E2  In Cascaded configuration it gives you expansion using the backplane Expanders.  I have read through the manual on how to chain them together.  It looks really promising.  The manual has some good diagrams (pg C-22 -> C-25) of how this works.  It does leave a lot of questions

Here are the configuration options:

Single SAS HBA with Simple Cascaded Configuration

Dual SAS HBA and Cascaded Configuration

Dual SAS HBA with Cascaded Configuration and Branching

HBA = Host Bus Adapter.  Which I believe is your SATA Controller.

With Dual HBA can you have two different Host servers maybe an N+1 configuration?

Do you need a motherboard, CPU, Ram etc... in the expanded systems?  If not can I put one in there anyway and not use any hard drives?  ( I need CPUs as well as Hard disk space and these empty boxes take up a lot of rack )

As you connect more and more expansions together does each one appear as a new drive to the HBA?  Do they add HOT?  Or do you have to reboot?

I believe they all appear to the host machine as JBOD?  Or do they appear as one big disk?  So your limited on how many disks your Controller can support?

If you were using hardware or software raid on the HOST server would you have to grow that to include the expansion? Scary  :Sad: 

Backup?

All of these solutions are scary.  Not because they are to complex.  It is because of the value of data.  How do you backup 100TB of data?

----------

## VinzC

 *jon123 wrote:*   

> How do you backup 100TB of data?

 

I'm afraid there's no secret for that...

----------

## magic919

 *VinzC wrote:*   

>  *jon123 wrote:*   How do you backup 100TB of data? 
> 
> I'm afraid there's no secret for that...

 

Sony PetaSite would do it.

----------

## fangorn

Depends on your definition of backup. 

If you need versioning for every day, your pretty much bound to complete solutions, and the tape robots are not really cheap. 

If you need just a recent copy of the data, build two identical sets and sync them over night and/or lunch. 

For the complexity of the setup, the multi machine setup has four levels of failure: 

1. harddisk failure in one of the nodes. can happen, but is addressable with internal RAID configuration. Just a little odd to handle for a lot of nodes.

2. node failure in hardware or software. I am using aged hardware and rock solid system configuration for my production systems. I haven't had an unintended shutdown/hang in quite some years now. But this needs a bit of planning before the buy. 

3. Network failing. Real big problem as it would most likely corrupt the storage permanently. But as you will setup a seperate network for this anyway with a batterybacked power supply, I don't see a real problem here. And if your controllers in a single machine solution fail you have the same problem. 

4. master backend fail. The same is true that I said for the nodes.

----------

## richard.scott

 *tbart wrote:*   

> PS: regarding the solution by richard.scott:
> 
> I haven't done any 64bit linuxen yet, but I imagine it being difficult to run 64bit linux on PIIIs..?

 

oops, yes, some data nodes are 32bit and some are 64bit   :Embarassed: 

----------

## Insanity5902

 *fangorn wrote:*   

> 3. Network failing. Real big problem as it would most likely corrupt the storage permanently. But as you will setup a seperate network for this anyway with a batterybacked power supply, I don't see a real problem here. And if your controllers in a single machine solution fail you have the same problem.

 

Note sure on the official term, but there are ways to help prevent this, I believe it is multipath. Meaning that your server has connections to the same storage via two different paths.

The only example I think of is something like iscsi or aoe, where your storage nodes have at least two network adapters, each one connected to a different switch. Then the server mounting the storage has the same thing. Then if a cord gets unplugged, or a switch goes down, the path is still live. I've heard of problems with the Linux implementation of this. Solaris's is suppose to be better, but it is written by 3rd party vendor. Not sure how accurate those last two statements are.

----------

## richard.scott

 *Insanity5902 wrote:*   

>  I've heard of problems with the Linux implementation of this. Solaris's is suppose to be better, but it is written by 3rd party vendor. Not sure how accurate those last two statements are.

 

There is both the ietd and open-iscsi daemons for iSCSI in portage.....

AFAIK ietd doesn't support multipath.  I don't know if open-iscsi does either as I've not had time to find and read the documentation   :Embarassed: 

For HA on a data node I have found that mirroring two data node servers with DRBD and using keepalived to monitor node status works better with AOE than iSCSI. From what I can tell, iSCSI keeps TCP comms open all the time (from a windows box anyway) and when the iSCSI terminator (i.e. data server) goes down then windows disconnects the iSCSI device as the connection has been terminated. The bad news is that when the iSCSI server comes back up windows doesn't detect this and doesn't re-connect it!

That is why I ended up using AOE for movememnt of data on my back end storage and iSCSI on my Management server.

Rich.

----------

## VinzC

 *richard.scott wrote:*   

> The bad news is that when the iSCSI server comes back up windows doesn't detect this and doesn't re-connect it!

 

How about an abstraction layer like samba for windows machines? Does that problem also occur under Linux?

----------

## richard.scott

 *VinzC wrote:*   

> How about an abstraction layer like samba for windows machines? Does that problem also occur under Linux?

 

From my simple tests the Open-ISCSI initiator automatically re-connects to the iscsi device if the network has issues.

----------

## tbart

 *Insanity5902 wrote:*   

> Then it will also be harder for you to get anything larger than 16TB

 

I for one will definitely go for 64bit, I just meant PIIIs won't do 64bit..

 *jon123 wrote:*   

> With Dual HBA can you have two different Host servers maybe an N+1 configuration?

 

Two different ones: yes definitely. This is for HA if a storage head/HBA/cable/expander fails. I don't really understand how you mean N+1, but I guess SAS only allows to initiators.

 *jon123 wrote:*   

> Do you need a motherboard, CPU, Ram etc... in the expanded systems? If not can I put one in there anyway and not use any hard drives? ( I need CPUs as well as Hard disk space and these empty boxes take up a lot of rack )

 

No you don't, that's the cool part. You can put any server in it if you like, too. You just have to connect the expander backplane to the next JBOD or the storage head.

 *jon123 wrote:*   

> As you connect more and more expansions together does each one appear as a new drive to the HBA? Do they add HOT? Or do you have to reboot? 
> 
> I believe they all appear to the host machine as JBOD? Or do they appear as one big disk?

 

They appear as single drives. SAS expanders are more or less like ethernet switches, if you like.

 *jon123 wrote:*   

> So your limited on how many disks your Controller can support?

 

In a way... The more limiting factor is the chassis. The manual says you can have a maximum of 122 drives in a single path configuration (less if multipathing, I guess half the number if I remember correctly) when cascading chassis. (That's pretty cool I think, after that you can put another SAS HBA into your storage head and add another 122 drives; don't know where the performance limit is, then, but I am not so much in need for performance and I won't reach 122 drives any time soon)

 *jon123 wrote:*   

> If you were using hardware or software raid on the HOST server would you have to grow that to include the expansion?

 

Well yes, if you like, as they're only JBODs. I won't go beyond a chassis per volume, that is, 16 drives. everytime I add a new chassis I create a new volume. As they will be RAID 6, 16 drives is already a quite high drive count.

 *jon123 wrote:*   

> How do you backup 100TB of data?

 

Well, as i am very restricted financially wise the backups will be on external harddisks. I know this is far from perfect, but there's no other way for us. Either my data will be in quota'd folders and be rsynced with single drives or I do a "portable" second RAID for every chassis... All of this still makes my head ache...

----------

## GNUtoo

hello,

I'd also like to build a storage but for backups and for migrating data from a ton of smallers hdds.

what I thought is that:

raid6 of 1TB hdd(*4)->lvm->luks

so I could increase the capacity later:

*adding 1TB hdds

*adding another raid array of bigger hdd wich I would add to the lvm setup and then grow the luks volume

I wonder if raid10 or raid5 is not a better idea?

I've a 32bit sempron 2500+ with 1GB of ram and some 3 PCI slots(others are taken by the 1GB nic,and the wifi card):

for starting I'll buy a 4 port sata pci card like this one:

PROMISE SATA 300 TX4

I hope it will work on a 32bit/33Mhz PCI system

then I'll buy 4 drives...but I heard on this post that there are entreprise drives that have less chance of failling(and so rebuilding the array has more chance to succeed)

Gentoo will be on a simple IDE drive

I would also need a plan for the backups:

I need something that:

*won't backup evrything(only chosen part...need to exclude some directories that can easely be recreated)

*would be incremental(so if I made a change to a file it would have like git the 2 versions of the files)

There was a prorgam written in perl that does it but I don't remember the name...it was based on rsync

----------

## VinzC

As for backup, I recommend Bacula.

----------

