# Clustering on Gentoo - What's the best option

## eohrnberger

Hi!

I have to admit that I'm a bit new to the clustering aspects of Linux, but I've done some reading on the Internet and it's left me a bit confused as to what's the 'right' decision WRT clustering and distributed file systems for Gentoo. 

Me?  I've been running Gentoo since '02, and I have to admit that I don't think I'll go back to using a binary distro again.  Why live with someone else's compromises when you can make your own?

The Home Network:

I have a main storage system on the network that serves as a NAS for Linux as well as the TiVo using Galleon, and the Windows clients.  It has about 7 TB of online storage pools, which is using ReiserFS.  I've just about managed to squeeze out all the EIDE HDs so it's mostly SATA now.  This is where all the files are kept.  It serves video, music, software repository and file backup drop for the data files for the Windows desktop clients, etc.  This is a 64 bit Intel dual core 3 GHz machine with 4 GB RAM.  This system runs well, and since I switched to Gentoo, hasn't lost a single file.  Yeah!  It's running a 3.10.17 based kernel, and it's using EVMS as an LVM2 management interface (kinda tells you how long it's been running and evolving).

I've also have 2 compute servers (likely to grow in number), both are Intel 64 bit dual cores 3.0 GHz, inherited cast offs, also on the network, also running Gentoo, and presently NFS mounting from the storage / NAS server.  They are running a 3.10.25 based kernel.

All this is connected with a Gbit switch and each server is connected with a Gb NIC.

Lately, playing around with VirtualBox, with a variety of guest OSs for a variety of purposes and needs.

Right now, I have a central registry of VMs on storage system (the VirtualBox.xml file), with virtual HDs mostly on the storage machine, but a few on the compute servers (they have smaller LVM2 storage pools attached as well).  It'd be nice to be able to run a VM on the least loaded compute server, and access the virtual HD on whichever server, storage or compute it resides.  Problem is, when running a VM across an NFS mount, it really crawls.

Reading on the Internet, the reports are that NFS is really slow in high demand / high bandwidth situations, so the thought is that to make the greatest use of the Gbit network, the network file access protocol needs to change.  Seems like this means iSCSI targets for the server and iSCSI initiators for the clients.  That part emerged nicely.

Since it appears that iSCSI is going to be used as a network transport protocol, and since iSCSI doesn't have a clue about filesystems, only data blocks that need to be read and written, locks, distributed cache coherency, some sort of cluster aware file system needs to be used to keep things in sync and not corrupting any data, hopefully one that is bandwidth fast, but also more towards meager in resource consumption (I don't have the hardware budget to stand up a dedicated server as a file access traffic cop for example).

The eventual goal, which may not be possible, is to start a VM on one compute node, and have the workload migrate to other compute nodes as their demand drops off, sort of self-balancing.  These abilities seems to be specialized in the extreme parallel computing, and not for heavy weight VirtualBox VMs.

Since it's a home network, there really isn't a lot of transactional demand (there's no user community of clients banging away at it), it's more like I'm the only real user, and only when I'm running VMs.

I tried emerging GFS, but that's locked to way back kernels with it's need for smp_lock.h (the great big kernel lock - or so I've read).  I also tried emerging the gfs2-utils, thinking that it would drag along the needed kernel changes, but that also didn't emerge due to some compiler error.

Yeah, I probably need to update the kernels, and that shouldn't be a big problem, but the first thing to do is to test the cluster setup in a one server, one client pair of VMs, just to get the needed practice in before changing the physical systems.  Since I have 64bit Gentoo VMs hosting on Windows machines as distcc servers, I started with one of those as a base, and am updating it as a starting substrate for the test VMs.

So we got the direction and the end goal.  We have the usage pattern, and we have the first steps to testing the setup.  What's left is which cluster software and possibly which file system to use (although if at all possible, I'd like to keep the filesystems that are in place now - but hey!  I've been meaning to get an 8 port SATA controlled for the PCIe slot on the storage machine, and a set of new 2 TB hard disks which to LVM2, so if needed I can start working on spreading the purchases out a bit and converting (although I have to admit, I'm a bit concerned about concentrating the IO demand across so few spindles and across a single bus).

So I ask the more knowledgeable people of the forum, what's the right direction / choices to make here for continuing the evolution of my Gentoo network?  What's easy to administer, reliable (as most Linux software is) to meet this need?

----------

## szatox

 *Quote:*   

>  I've done some reading on the Internet and it's left me a bit confused as to what's the 'right' decision

  Good! Clustering is tricky, there are a lot of ways to do that and each of those serves different purpose. What is YOUR goal?

It seems you're trying to build your own vSphere using VB instead, do I get it right? A few physical hosts running VMs and allowing live migration and perhaps some other features? What is the stuff you care for?

 *Quote:*   

>  start a VM on one compute node, and have the workload migrate to other compute nodes as their demand drops off, sort of self-balancing

  No idea about VB, but vmW and qemu do support live migration. If you want load balancing, you will most likely need some external app for tracking VMs, VMs' resources usage, available resources and finaly triggering the migration based on some criteria.

 *Quote:*   

>  Problem is, when running a VM across an NFS mount, it really crawls. 

  Does it crawl due to NFS, low network bandwidth, or maybe your VMs lack hardware boost? E.g. physical hosts have virtualization disabled in BIOS. Silly as it sounds, it has happened to me. Replacing NFS with iSCSI might only help with first issue. And it's not necessarly the only solution, sometimes you just have to change mount options (or bump the number of server processes). Doesn't VB limit IO performance? Maybe you need some particular drivers? Or use client-side caching. How much does it hurt you when a VM crashes for any reason and some data has not been stored yet?

 *Quote:*   

>  tried emerging GFS

 Hold on, are you going to let many VMs access the same storage at the same time? What is the purpose? What benefits do you expect?

----------

## eohrnberger

 *szatox wrote:*   

>  *Quote:*    I've done some reading on the Internet and it's left me a bit confused as to what's the 'right' decision  Good! Clustering is tricky, there are a lot of ways to do that and each of those serves different purpose. What is YOUR goal?

 

Goals:

Ability to leverage additional PCs for running VMs (live migration may not be a requirement but a nice to have)

Ability to leverage distributed hard disk resources (eventually I'm going to run out ways to connect hard disks to a single machine)

Increase network disk IO performance (for running VMs - pending network benchmark data to be gathered first)

 *szatox wrote:*   

> It seems you're trying to build your own vSphere using VB instead, do I get it right? A few physical hosts running VMs and allowing live migration and perhaps some other features? What is the stuff you care for?
> 
>  *Quote:*    start a VM on one compute node, and have the workload migrate to other compute nodes as their demand drops off, sort of self-balancing  No idea about VB, but vmW and qemu do support live migration. If you want load balancing, you will most likely need some external app for tracking VMs, VMs' resources usage, available resources and finaly triggering the migration based on some criteria.
> 
>  *Quote:*    Problem is, when running a VM across an NFS mount, it really crawls.   Does it crawl due to NFS, low network bandwidth, or maybe your VMs lack hardware boost? E.g. physical hosts have virtualization disabled in BIOS. Silly as it sounds, it has happened to me. Replacing NFS with iSCSI might only help with first issue. And it's not necessarly the only solution, sometimes you just have to change mount options (or bump the number of server processes). Doesn't VB limit IO performance? Maybe you need some particular drivers? Or use client-side caching. How much does it hurt you when a VM crashes for any reason and some data has not been stored yet?
> ...

 

Good point.  I need to check the BIOSs and see if the visualization option is set.  I usually do that, but then, I've inherited these chassis.

Yeah, I've read some articles about NFS mount parameters and tuning and such.  Starting with network and NFS benchmark baselines first as a starting measurement, then start playing with the parameters.

Eventually, I'd like to 'grow up' into a process transparent migration cluster, where plugging in another compute node increases the overall throughput of cluster aggregate performance.  I'd like to still run all the software, such as video reformatting, and have these threads migrate around to the least loaded hardware.  Computing Nirvana, pretty much.   :Wink: 

----------

## szatox

 *Quote:*   

> Ability to leverage distributed hard disk resources 

  I wonder if you really want to have a filesystem split acros multiple machines with all disadvantages it has or perhaps just a bigger disk shelf. I don't say it's bad. It's more like "you will know when you need it".

I've just checked what there is in store, and the first few links showed me 450$ internal SAS* HBA and 750$ external SAS HBA providing 16 6Gbps connectors each and 1000$ tower with 16 hotswap bays, power supply and expansion port for another shelf.

You mentioned you have 7TB storarage, this can be achieved with 2 drives. How long would it take you to actually run out of space for plugging new drives?

* SAS controllers do speak SATA too, and do that better than SATA controllers

 *Quote:*   

> Eventually, I'd like to 'grow up' into a process transparent migration cluster, (...)and have these threads migrate around to the least loaded hardware

  I think I have seen some kernel patches for this. The goal was to run parts of a single, multithreaded application on several nodes, each running it's own OS. Let's say it had to form ethernet-connected NUMA machine. Too bad, latency was so punishing the performance completly sucked, but since you mentioned that "computing nirvana" you might be interested in this project. As a funny thing, if nothing more. And, well, who knows what would happen if those nodes were talking PCIe instead of Ethernet?

Edit: found some links on this topic, in case anyone was interested  :Smile: 

http://www.kerrighed.org -> multinode version on SMP, for kernel 2.6

http://linuxpmi.org/  -> the New Hope of OpenMOSIX, kernel 2.4 and 2.6. Hard to say if it's usable

http://www.researchgate.net/publication/43807913_Designing_a_Thread_Migration_Facility_for_Linux_Operating_System -> some paper on sharing resources between machines for parallell executionLast edited by szatox on Sat Feb 21, 2015 7:01 pm; edited 1 time in total

----------

## eohrnberger

 *szatox wrote:*   

>  *Quote:*   Ability to leverage distributed hard disk resources   I wonder if you really want to have a filesystem split acros multiple machines with all disadvantages it has or perhaps just a bigger disk shelf. I don't say it's bad. It's more like "you will know when you need it".
> 
> I've just checked what there is in store, and the first few links showed me 450$ internal SAS* HBA and 750$ external SAS HBA providing 16 6Gbps connectors each and 1000$ tower with 16 hotswap bays, power supply and expansion port for another shelf.
> 
> You mentioned you have 7TB storarage, this can be achieved with 2 drives. How long would it take you to actually run out of space for plugging new drives?
> ...

 

OK, I knew about the SAS controllers, but looked around, and it seems that they are out of my price range, except for evil bay.

At one time, I standardized on EIDE 500 GB hard disks for the storage.  Collected a number of them, and when they become more expensive than the SATA version started migrating into the SATA ones.  You can tell, if you look at the system's drive list below.

/dev/sda    WDCWD5000AAKB-00H8A0 

/dev/sdb    WDCWD5000AAJB-00UHA0 

/dev/sdc    ST500DM002-1BD142    

/dev/sdd    WDCWD5000AAKX-00ERMA0

/dev/sde    WDCWD5000AAKX-00ERMA0

/dev/sdf    ST500DM002-1BD142    

/dev/sdg    WDCWD5000AAKS-00UU3A0

/dev/sdh    WDCWD5002AALX-32Z3A0 

/dev/sdi    WDCWD5002AALX-32Z3A0 

/dev/sdj    WDCWD2500JD-98HBC0       <<== System drive

/dev/sdk    WDCWD5000AACS-00G8B1 

/dev/sdl    WDCWD5000AAKS-00A7B2 

/dev/sdm    WDCWD5000AAKS-00A7B2 

/dev/sdn    WDCWD6400AADS-00M2B0 

/dev/sdo    WDCWD5000AAKS-22A7B0 

(Yes, I have a second power supply providing power to the hard disks, and an external mounting for them)

Well, here's the thing.  This entire network, as are the additional compute nodes I've recently integrated, is pretty much a low buck effort.  It's a hobby powering the house network, and as learning experience, wherever my interests take me.  Just a platform to get my nerd on, I guess.

 *szatox wrote:*   

> 
> 
>  *Quote:*   Eventually, I'd like to 'grow up' into a process transparent migration cluster, (...)and have these threads migrate around to the least loaded hardware  I think I have seen some kernel patches for this. The goal was to run parts of a single, multithreaded application on several nodes, each running it's own OS. Let's say it had to form ethernet-connected NUMA machine. Too bad, latency was so punishing the performance completly sucked, but since you mentioned that "computing nirvana" you might be interested in this project. As a funny thing, if nothing more. And, well, who knows what would happen if those nodes were talking PCIe instead of Ethernet?

 

Yeah, I can see where NUMA over Ethernet isn't a good choice, even if it's a dedicated Gb /sec. subnet.  NUMA want's / needs to run at memory speeds, and general purpose network isn't going to support that.

Suffice it to say, that I'm starting to play with NFS tuning to be able to run a VM across an NFS mount, and yea, if that can't get to acceptable performance, looking for a faster throughput solution, which I figure is going to be iSCSI and some sort of distributed file system.

----------

## eohrnberger

Well now.  Interesting development.

I can in fact run a VM on a compute server and have the virtual hard disk file on a remote server across an NFS mount.  The performance impact is a little bit noticeable, but well within the acceptable range, at least to me.

The bit of magic that if ferreted out the Internet was to apply the following options to the NFS mount options:

rsize=32768,wsize=32768,hard,bg,intr,retry=1000 0 0

This increases the read / write blocks to 32 KB, allows a long interruptible time out should the NFS server be off line for a period of time.

I think I'm still going to explore Ceph, as a means for distributing disk space across multiple machines, and concatenate it all together into a single larger storage object as presented to the clients.

As we concluded, a NUMA type architecture using Gb Ethernet as a communications medium for process migration just isn't going to work, so giving up on that for now.

Thanks for being a sounding board on this.  Much appreciated.

----------

## szatox

ok, 3 more hints about NFS:

in /etc/exports add "async" option - this in not completly safe, as any writes to NFS share will report success BEFORE actually writting anything. But it will boost performance and is a good pick for non-critical stuff.

Try using jumbo frames on your network: bump MTU from default 1500 to 9000. Some network devices don't support it, but if they do it reduces protocol overhead a bit.

Check if you have enough nfs daemons. I don't remember how it was done, but if you don't have enough servers to handle the load it will generate some errors you can track. General suggestions were "at least 1 per core, but 4 or 8 per core might be a better start". I doubt I would need this many, but you want to put some strain on those so your mileage might vary. 

# Number of servers to be started up by default

OPTS_RPC_NFSD="8"

----------

## eohrnberger

 *szatox wrote:*   

> ok, 3 more hints about NFS:
> 
> in /etc/exports add "async" option - this in not completly safe, as any writes to NFS share will report success BEFORE actually writting anything. But it will boost performance and is a good pick for non-critical stuff.
> 
> Try using jumbo frames on your network: bump MTU from default 1500 to 9000. Some network devices don't support it, but if they do it reduces protocol overhead a bit.
> ...

 

Good hint those.  Thanks.

Starting 8 NFS deamons on each node, so I think that's probably OK.  May have to try the jumbo frames, and see if they can get through my Gb switch.

Many thanks!

----------

## eohrnberger

Well, it's been a busy couple of weeks here.  Thought that I'd better pot up and bring everyone up to speed.

I now have ZFS implemented on all 3 of the dual core x64 PCs, which includes the storage system (NAS).  All EIDE drives have been squeezed out and replaced with SATA 6.0 Gbps drives (well, there were only 2 left).

A new PCIe 4 port SATA controlled added, and the 6.0 drives plugged in (too bad both pools cover a mix of 6.0, 3.0 drives), but now identified, going to start squeezing those out and replacing them with the 6.0 drives as well.

ssh is now password-less from any machine to any machine, as are all the NFS / Samba shared.  There's even 2 more x64 VM guests on Windows hosts that have been integrated to the same level.  All have been added to the local DNS server, so going from anywhere to anywhere works seamlessly, as it should.

The 2 file systems on the NAS system have both been converted to ZFS, and the vast majority of the files have been COW re-distributed across the drives in each of the storage pools.  The 2 storage pools (primary and secondary) are made of discrete sets of hard disks, 8 each, so if a drive up and dies, it'll only impact a single storage pool.

zpool iostat says that highest throughput was around 280 MB/sec, which I guess is pretty good striping across the drives considering the mix of high and low performers in the group.  Interesting to note that in cases of heavy IO, from local pool to local pool, the system load climbs pretty quickly to over 12.  Rather worry some.  Better if under these loads, processes were pushed off of the storage machine, and onto the other machines, to relieve that overburden, and then migrate back, after the IO load subsides.

As indicated earlier, running VMs across the NFS mount is now possible, at a slight, but acceptable, performance penalty.

Now, it's back to thinking about which SSI clustering to implement.  Are there any in portage?  (looked but haven't found any).   It's been a joy finding all the rest of the experimentation, zfs, btrfs, ceph, etc. and all the rest in portage, just waiting to be emerged.  Kudos to the portage guys!

----------

## szatox

 *Quote:*   

> zpool iostat says that highest throughput was around 280 MB/sec, which I guess is pretty good striping across the drives considering the mix of high and low performers in the group

  I'm not impressed. My cheap platers handle 170-200MBps read/write during sequential access. With 8 drives in pool I'd expect raid5 or 6 which gives 6-7 drives with stiped data. There should not be any penalty during read, and write penalty can be mitigated with caching. My 170MBps times your 6 drives is..... 1020MBps. You have some needle's eye on your way.

 *Quote:*   

>  running VMs across the NFS mount is now possible, at a slight, but acceptable, performance penalty.

  if ethernet is not saturated, bumping NFSD count from 8 to e.g. 32 might help.

 *Quote:*   

> Now, it's back to thinking about which SSI clustering to implement. Are there any in portage?

  There is for example OpenMPI. What application you want to run? (What does it support?).

Bad news is, when I tried openmpi on VMs the best results I got was on a single machine. Adding second VM (increasing total CPU count from 1 to 2) would decrease performance. Adding 3rd VM with yet another CPU would make it even worse. I haven't launched it on 4 CPUs as I wanted to keep one core clear for host system. Performance wasn't important in that fun project, but it was disappointing anyway. Sure, I could have screwed up networking. Whatever, I had my own SSI cluster of gentoo VMs  :Very Happy:  everything you need is waiting for you in portage.

 *Quote:*   

> . Interesting to note that in cases of heavy IO, from local pool to local pool, the system load climbs pretty quickly to over 12. Rather worry some. Better if under these loads, processes were pushed off of the storage machine, and onto the other machines

  openmpi works like this: you use ssh to launch your program on all cluster's nodes. On of them is the leader and the rest can be considered "childern threads". Leader distributes data and collects results. You can do some load balancing, but I suppose it must be implemented in your application.

Alternatively you could use cgroups to limit resources available to your application.

And if you feel like doing something totaly crazy, you could try abusing checkpoint-restart feature. Linux containers are said to support live migrations between physical hosts. Perhaps you could use this (kernel feature) to distribute your application. If you made the bunch of processes talk UDP to each other you would have a chance to get something really funny to work. Unfortunately I'm not familiar with it, just know it is implemented.

----------

## eohrnberger

 *szatox wrote:*   

>  *Quote:*   zpool iostat says that highest throughput was around 280 MB/sec, which I guess is pretty good striping across the drives considering the mix of high and low performers in the group  I'm not impressed. My cheap platers handle 170-200MBps read/write during sequential access. With 8 drives in pool I'd expect raid5 or 6 which gives 6-7 drives with stiped data. There should not be any penalty during read, and write penalty can be mitigated with caching. My 170MBps times your 6 drives is..... 1020MBps. You have some needle's eye on your way.

 

OK.  I guess.  Still working on it.

 *szatox wrote:*   

> 
> 
>  *Quote:*    running VMs across the NFS mount is now possible, at a slight, but acceptable, performance penalty.  if ethernet is not saturated, bumping NFSD count from 8 to e.g. 32 might help.

 

Bumped up.  Let's see how that helps.  Thanks for the tip, yeah, I know, I think you've mentioned this before.

 *szatox wrote:*   

> 
> 
>  *Quote:*   Now, it's back to thinking about which SSI clustering to implement. Are there any in portage?  There is for example OpenMPI. What application you want to run? (What does it support?).
> 
> Bad news is, when I tried openmpi on VMs the best results I got was on a single machine. Adding second VM (increasing total CPU count from 1 to 2) would decrease performance. Adding 3rd VM with yet another CPU would make it even worse. I haven't launched it on 4 CPUs as I wanted to keep one core clear for host system. Performance wasn't important in that fun project, but it was disappointing anyway. Sure, I could have screwed up networking. Whatever, I had my own SSI cluster of gentoo VMs  everything you need is waiting for you in portage.
> ...

 

No specific parallel application.  More general purpose cluster, wanting leverage the equipment that I do have better.

 *szatox wrote:*   

> And if you feel like doing something totaly crazy, you could try abusing checkpoint-restart feature. Linux containers are said to support live migrations between physical hosts. Perhaps you could use this (kernel feature) to distribute your application. If you made the bunch of processes talk UDP to each other you would have a chance to get something really funny to work. Unfortunately I'm not familiar with it, just know it is implemented.

 

I found this rather interesting:

 *Quote:*   

> 
> 
> http://www.kerrighed.org/wiki/index.php/Status
> 
>  Current Features
> ...

 

This is sort of the direction I was thinking, but this also asserts that it's not ready for production work, but this really isn't production work anyway.  This is just play right now.

----------

