# Cluster filesystem for HPC

## Bircoph

Hello,

I'm setting up a small scientific cluster: it contains five nodes, while a CPU and a RAM onboard are good enough, a disk space is limited: 120 GB per node, no hardware upgrade is possible.

I want to use ~80-100 GB partitions from each node to create a single distributed file system to use it on all nodes simultaneously for user data and software, the internal network is 1Gb/s and should handle this well. Left space will be used for OS files, swap, local caches and a scratch space.

There are dozens of cluster file systems available, so it is hard for me to choose.

So, in spite of hardware limitations and project goals (mostly pbs, mpi and distcc), these are the features I need:

1) The replication ratio should be controllable, the most likely I will disable replication at all to save space. (Data will be backed up regularly using external storage. But external storage will be slow (no more than 100 Mb/s), so I need to use internal fast cluster file system.)

2) FS should be kernel space supported. No fuse solutions are acceptable: heavy I/O is expected, I do not want to waste CPU cycles available.

3) A solution should be open source, no commercial licenses.

4) User/group quotas support is desirable (to protect a file system from overflow due to program error or malicious actions).

5) Nodes holding FS can't be exclusive: the very same nodes will be used for computations.

6) It is highly desirable that distributed FS blocks which are on current computing node will be used directly instead of transferring to the master node (mount source) and back to the client node to save network bandwidth from noops.

I have considered several cluster file systems, so there are my findings:

1) ceph — a good design, a very good documentation, full replication control, load and disk fill balancing; it provides everything I need except for quotas. It is already present in Gentoo, so it will be easier to support. Currently I'm looking forward to use this FS as the best choise available from my point of view.

2) ocfs2 — can't find any way to control or disable the replication ratio, a poor documentation, otherwise it looks good.

3) pvfs — looks good, but no mention of quotas and it can't intelligently distribute files over free space servers.

4) lustre — a good, powerful stuff, but it is too complex, perhaps this is an overkill for a small cluster. It requires heavy kernel patching for the server side, with any problems or urgent (security) kernel updates I will be on my own. I'd like to use something less intrusive.

5) gfs2 — looks interesting, but bad reports from my colleagues about both stability and performance outside of RH builds. Also this will required gnbd with very complicated setup.

6) gpfs — a commercial license required.

7) glusterfs — userspace solution, unacceptable.

8) moosefs — userspace solution, unacceptable.

9) fhgfs — free for use, but no sources available, and I do not want a pain with binaries.

I have no experience in setting up distributed FS myself, except for a setup where separate systems is used to handle local discs and export all storage space via nfs, so advice will be welcomed.

ATM I'm looking toward the first five mentioned with ceph in great preference, but any opinion will be considered.

----------

## chithanh

10) afs

 *Bircoph wrote:*   

> 2) FS should be kernel space supported. No fuse solutions are acceptable: heavy I/O is expected, I do not want to waste CPU cycles available.

 You are very quick to discard the userspace solutions. If your jobs are I/O bound anyway, then it won't matter if some extra CPU cycles are used for the filesystem.

Benchmarks with a representative subset of your particular workload will probably answer your questions better than any speculation about the best cluster fs.

----------

## Bircoph

 *chithanh wrote:*   

> 10) afs
> 
> 

 

Thank you for this suggestion. I forgot that AFS may combine different volumes in the same cell. 

But AFS volume itself is a blocking problem: if I'll have 100 GB partitions it will be impossible to create 110 GB volume, moreover: if I'll have 49 GB free on all nodes, I woldn't be able to write 50 GB in any single volume. I want more flexible storage control to be able to temporarily allow users to use large storage amount. Also AFS is really bad on storage balancing: in general you should move volumes by hand, I found a balance application for AFS, but it is not realtime and will work on cron job with rather large time intervals.

By the way, AFS is redundant here and it is questionable if AFS cache will improve performance, because network bandwidth in my case is larger, than hdd i/o rate. AFS is excellent for long-range distributed or even geographically separated setups. But when all your cluster is in the same blade rack, other approaches may be more efficient.

 *Quote:*   

> 
> 
> Benchmarks with a representative subset of your particular workload will probably answer your questions better than any speculation about the best cluster fs.

 

You are right, but I'll die to build, install, configure and test all of them, and choose of cluster fs may affect overall design of the setup. That's why I'm asking for people's experience and suggestions.

----------

## Bircoph

After some research I found more info:

There is a good comparison between nfs, gfs2 and ocfs2:

http://www.gpaterno.com/publications/2010/dublin_ossbarcamp_2010_fs_comparison.pdf

and between nfs and ceph (on btrfs):

http://ceph.newdream.net/2009/01/some-performance-comparisons/

It looks like ocfs2 beats gfs2 almost everywhere, and this is not surprising, because gfs2 works on top of clvm, and lvm is a tough troublemaker both in terms of performance and stability. Also I prefer to use the most simple design with the smallest number of additional layers, because all of them have non-zero failure probabilities.

And I took back my words about ocfs2 documentation: rather good one may be found here:

http://oss.oracle.com/projects/ocfs2/dist/documentation/v1.6/ocfs2-1_6-usersguide.pdf

The only problem with this docs was to understand that ocfs2 doesn't support replication itself (in the same way as gfs2).

I'll try ocfs2 after some preparations and will report my experience here in several weeks or so.

----------

## John R. Graham

Is strict POSIX filesystem semantics one of your requirements? That seems to reduce the playing field somewhat.

- John

----------

## Bircoph

 *John R. Graham wrote:*   

> Is strict POSIX filesystem semantics one of your requirements? That seems to reduce the playing field somewhat.
> 
> 

 

Speaking frankly I do not understand everything behind the "strict POSIX filesystem semantics", I looked into the Single UNIX Specification, Version 4, but definitions look too vague for me. 

But most probably: yes, I need this semantics, because I need a distributed file system to behave as local filesystem as close as possible from user's applications point of view. In particular, I need these requirements:

- standard Unix permissions; ACL's are welcome, but not strictly required;

- quota support;

- consistency guaranty for multiple and possibly overlapping writes (so POSIX locks should be supported, my guess).

- xattr support and probably other features commonly used by local system applications.

This cluster is not being designed for a single computation task, it may be used in a variety of ways for a lot of computing tasks with the limits it can technically handle, including but not limited to educational purposes.

----------

## John R. Graham

I was referring to things like this from the Ceph web site: Differences from POSIX. I've by no means made a comprehensive survey but OCFS, for instance, does not suffer from these corner case limitations.

- John

----------

## Bircoph

 *John R. Graham wrote:*   

> I was referring to things like this from the Ceph web site: Differences from POSIX.

 

Hmm, it's hard to say... Concerning these two cases I definitely do not care about wrong df reports for sparse files, but an issue with simultaneous writes around object boundary may be a problem... and may be a not, I can't be sure, but it is better to take some precautions.

 *Quote:*   

> 
> 
> OCFS, for instance, does not suffer from these corner case limitations.
> 
> 

 

According to ocfs2 docs they are targeted to be as close to local filesystem behaviour as possible, I hope this will work well. So ocfs2 will be the first to try.

----------

## Mad Merlin

You seem to have found this already, but it's worth reiterating that GFS2 and OCFS2 are shared filesystems, not clustered filesystems.

In a shared filesystem there is exactly one copy of the data on exactly one node with the caveat that all nodes can access it concurrently once the underlying block device is exported over the network (typically via iSCSI). You can add additional copies by putting something like DRBD underneath the filesystem, but I don't think that's what you're looking for. You gain no additional (storage) performance or space by adding more nodes.

In contrast, a clustered filesystem like Ceph stores data across multiple nodes on its own. You generally gain (storage) performance and/or space by adding more nodes.

----------

## Bircoph

 *Mad Merlin wrote:*   

> You seem to have found this already, but it's worth reiterating that GFS2 and OCFS2 are shared filesystems, not clustered filesystems.
> 
> In a shared filesystem there is exactly one copy of the data on exactly one node with the caveat that all nodes can access it concurrently once the underlying block device is exported over the network (typically via iSCSI). You can add additional copies by putting something like DRBD underneath the filesystem, but I don't think that's what you're looking for. You gain no additional (storage) performance or space by adding more nodes.
> 
> 

 

Yes, thanks for reminding me than I need to use some distributed block device beneath OCFS2. And this is a problem, because I can't find an appropriate one: DRBD implies RAID-1 mirroring, NBD can't be mounted rw on several hosts, RADOS block device requires Ceph to be used, so it is irrational to use it for OCFS2, because it will be more appropriate to use Ceph itself. Another approach maybe to use something above OCFS2 like AUFS3, but this may lead to a disaster since on-top fs will not be cluster aware.

However, this is not valid for GFS2, because it can work on top of CLVM (and OCFS2 can't do this). And clvm allows to create logical network distributed block device, which can be used beneath gfs2. So this is a good combination to try, but gfs2+clvm has poor performance, even when you have concurrent writes from just two nodes.

 *Quote:*   

> 
> 
> In contrast, a clustered filesystem like Ceph stores data across multiple nodes on its own. You generally gain (storage) performance and/or space by adding more nodes.

 

Well, perhaps I should give Ceph a try. Maybe someone have an idea how to use disk quotas on Ceph using some external layer (selinux, aufs, nfs... I'm running out ideas)?

Though, please note that for my setup scalability is nothing, because hardware setup is unscalable and any external disk storage will be at least one degree slower, so even if it will be used in the future, it will be deliberately mounted at a separate mount point.

----------

## Mad Merlin

 *Bircoph wrote:*   

>  *Mad Merlin wrote:*   You seem to have found this already, but it's worth reiterating that GFS2 and OCFS2 are shared filesystems, not clustered filesystems.
> 
> In a shared filesystem there is exactly one copy of the data on exactly one node with the caveat that all nodes can access it concurrently once the underlying block device is exported over the network (typically via iSCSI). You can add additional copies by putting something like DRBD underneath the filesystem, but I don't think that's what you're looking for. You gain no additional (storage) performance or space by adding more nodes.
> 
>  
> ...

 

Ah, you appear to be right. A fascinating option for GFS2 which I had not realized up until now.

In that case, the distinction is that the filesystem itself is only aware of a single copy of the data for shared filesystems.

----------

## tmgoblin

Perhaps a ceph setup wherein you create per user or per use (sparse) disk images, mounted via loop device and formatted with a quota supporting file system if needed?

----------

## Bircoph

 *tmgoblin wrote:*   

> Perhaps a ceph setup wherein you create per user or per use (sparse) disk images, mounted via loop device and formatted with a quota supporting file system if needed?

 

1) As discussed with ceph developers, if one needs to mount ceph filesystem on the same node where disk server is located (this is my case), fuse must be used, otherwise deadlock will occur on memory pressure. And I want to avoid fuse if possible.

2) The same is valid for any loopback mounts as you proposed, this approach is unsafe:

https://docs.google.com/a/dreamhost.com/viewer?a=v&q=cache:ONtIKJFSC7QJ:https://tao.truststc.org/Members/hweather/advanced_storage/Public%2520resources/network/nfs_user+nfs+loopback+deadlock+linux&hl=en&gl=us&pid=bl&srcid=ADGEESgpaVYYNoh2pmvPVQ9I_bpLLcoF3GJIMKavomIHNgTb-cbii6RVtWg28poJKdHBqQgKGXzVA2NOsC25FtWMP3yywTfNkX9N26IrKVIcVA9eRz6ZGBx1_Ur0JerUrfBQlPcmcBBz&sig=AHIEtbSjGX_hCVny345iF

3) If you will use non-cluster aware filesystem over a distributed disk image, this will kill your I/O to a miserable condition. Also this filesystem must be cluster-aware, like ocfs2 or gfs2.

ATM I gave up with user quotas, this is not so critical after all, throughout and sane CPU load is more important now.

It looks like any distributed filesystem has its very own bugs, perils, benefits and drawbacks and none of them works ideally. Currently I plan to try OrangeFS as PVFS2 successor, though a work is postponed due to some other reasons.

----------

## Bircoph

Well, I finally finished this project. In short: after thorough analysis and tests for selected few I chose

OrangeFS (branch of PVFS2).

The main limitation in my work was a need to use the same nodes for both storage and computing.

This through away most of available solutions at once, because they appear to be either incapable to

mount (and work well) on the same nodes as data server nodes or they are not designed to share

available CPU/RAM resources with other applications sanely.

After theoretical considerations and search through various sources I tested three installations on

practice:

1) Ceph. Good thing in theory, maybe even the future of data storage, but it is immature currently:

a) Ceph support for multiple active metadata servers is unstable now; that means all client metadata

requests will be served by a single host and this bottleneck limits scalability and performance on peak

loads greatly.

b) Ceph kernel client can't be safely used on the nodes serving as data servers. See discussion in

previous posts.

2) GlusterFS. As of benefits this distributed filesystem is, perhaps, the simplest one to deploy and

manage. Everything else are drawbacks. 

a) Performance is acceptable only in linear mode, where client get no benefits from parallel i/o to

multiple nodes, so use of this mode in HPC is a mockery. And even in this mode even for large data

chunks (dd bs=1M) CPU consumption is 30% of single core (Xeon E5410). I/O speed for large chunks

is about 52 MB/s, which is close to a limit of local hard drives (57 MB/s).

b) Even in simple stripe mode performance is horrible: while good on write (100 MB/s) it drops to

absolutely unacceptable 30 MB/s on read operations. CPU utilization is terrible: it consumes about

250% of single core, so on single node with 2 Xeons and 8 cores one must drop 2.5 cores for

nothing. Tested on large chunks. On small i/o requests all distributed filesystems behave worse.

3) OrangeFS. It does not support quotas nor file locks (though all i/o operations are atomic and this

way consistency is kept without locks). But it works, and works well and stable. Furthermore this is

not a general file storage oriented system, but HPC dedicated one, targeted on parallel I/O including

ROMIO support. All test were done for stripe data distribution.

a) No quotas — to hell quotas. I gave up on them anyway, even glusterfs supports not common

uid/gid based quotas, but directory size limitations, more like LVM works.

b) Multiple active metadata servers are supported and stable. Compared to dedicated metadata

storage (single node) this gives +50% performance on small files and no significant difference on

large ones.

c) Excellent performance on large data chunks (dd bs=1M). It is limited by a sum of local hard drive

(do not forget each node participates as a data server as well) speed and available network bandwidth.

CPU consumption on such load is decent and is about 50% of single core on a client node and about

10% percents on each other data server nodes.

d) Fair performance on large sets of small files. For the test I untared linux kernel 3.1. It took 5 minutes

over OrangeFS (with tuned parameters) and almost 2 minutes over NFSv4 (tuned as well) for comparison.

CPU load is about 50% of single core (of course, it is actually distributed between cores) on the client and

about several percents on each node.

e) Support of ROMIO MPI I/O API. This is a sweet yummy for MPI aware applications, which allows to use

PVFS2/OrangeFS parallel input-output features directly from applications.

f) No support for special files (sockets, fifo, block devices). Thus can't be safely used as /home and I use

NFSv4 for that task providing users quota-restricted small home space. Though most distributed

filesystems don't support special files anyway.

As I spend some considerable amount of time on analysis of different distributed filesystems, I found

that the most close solution to PVFS2/OrangeFS is famous LustreFS both by design and features provided.

The main difference between them is that lustre is in kernel space, while orangefs is in userspace except

for kernel VFS part. As a result of kernelspace approach lustre supports only few rather old and heavily

patched kernels like RHEL ones. It will be a pain to install that on Gentoo, it will be even more pain to deal

with such old and heavily patched kernels. On top of that lustre is targeted on dedicated storage nodes

only and consumes all available resources for good. So I'm out of even trying it.

I found that PVFS2 was once in portage, so I used that ebuild for a start. It was removed from tree

because lacked support for recent kernels, though tree cleaners overdone their job and removed

package days before new release was out and fixed support for recent kernels. This removal was even

more stupid, because PVFS2/OrangeFS provides fuse client which can be used on not yet supported kernels.

As of now I heavily remade that ebuild accommodating new or earlier missed features of the filesystem,

as well as heavily patching it from trunk to fix some bugs, all this with interaction with upstream. Also I

fixed some build issues and send them patches as well. This ebuild (orangefs-2.8.5) may be found in my

overlay ("bircoph", registered in layman). I plan to post it on bugzilla and probably to move it to the

science overlay where I can support it, and this program will fit the science overlay really good, because

this filesystem is targeted on scientific applications. Less likely it will go upstream because it is extremely

hard to put anything upstream this day, though proxy maintaining is still a thinkable solution.

Currently all kernels up to 3.2 are supported. Support for 3.3 is under way, but this will take time, because

considerable changes are required. Though filesystem is fully functional even without kernel support:

1) Fuse client may be used instead of kernel-based one. Functionality is about the same.

2) OrangeFS server is userspace and does not need kernel support at all.

3) It is possible to use filesystem without VFS at all (yes, no mouts and usual file access). ROMIO API

allows to access data directly from application using parallel I/O routines.

----------

## TequilaTR

Very nice summary. The orange-fs solution really sounds promising.

----------

## Bircoph

I opened bug 411173 for this package:

https://bugs.gentoo.org/show_bug.cgi?id=411173

----------

