# ext3 on 4TB array?

## humbletech99

Hi,

   has anybody used ext3 on a 4TB array before?

I'm using gpt to create one huge partition on a 4500GB /dev/sdb and want to know the best most reliable filesystem to use on it, I've read that ext3 can do it if I used a 4KB block size. Is there a better choice (I'm looking for reliablilty over all else)

Any advice anybody can give regarding filesystems of this size would be most appreciated.

----------

## NeddySeagoon

humbletech99,

I hope you have a UPS on that 4.5Tb array. Just think how long fsck will take to run.

ext3, rieserfs and xfs are all candidates, with ext3 being the most mature and best supported with recovery tools.

You will have up to date validated backups and a UPS, so you always get clean shutdowms though - won't you?

----------

## humbletech99

haven't got a ups yet, perhaps I should just add a battery backup to the raid card, I think it supports one, although perhaps that will protect the array rather than the filesystem on the array which may still go down dirty? I'll investigate putting it on UPS.

----------

## humbletech99

Can somebody advise me on the best options to give to mke2fs for a 4TB partition?

The array is 10x500GB Sata disks on a 3ware 9550SX controller, with a 256K chunk size on the array itself.

Should I use "-T largefile" or "-T largefile4"?

Also, using "-O sparse_super" seems to be the way to go for this, and I think "-b 4096" is a safe bet to add.

----------

## think4urs11

 *humbletech99 wrote:*   

> I'm using gpt to create one huge partition on a 4500GB /dev/sdb and want to know the best most reliable filesystem to use on it, I've read that ext3 can do it if I used a 4KB block size. Is there a better choice (I'm looking for reliablilty over all else)

 

hmm, 4.5 TB might be a bit too big

 *ext3 FAQ wrote:*   

> Ext3 can support files up to 1TB. With a 2.4 kernel the filesystem size is limited by the maximal block device size, which is 2TB. In 2.6 the maximum (32-bit CPU) limit is of block devices is 16TB, but ext3 supports only up to 4TB.

 

about largefile(4) it depends

largefile creates one inode per megabyte, largefile4 one per 4megabyte

i.e. if you will have mostly very big files on the FS use largefile4, otherwise largefile

----------

## humbletech99

ext3 formats over 4TB okay without probs. I read it can do up to 32TB.

I know what largefile does from the man page, I just haven't heard a satisfactory explanation of the implicaitions and how specifically this ties into blocks sizes and space wastage etc...

If the inode size resulted in space wastage, and the min size was 4K, then why bother with blocksize less than 4K, why would it even be an option?

I know the inode holds file info, I just don't fully understand how it all links together.

The filesystem will have a lot of small files as well, so I probably won't do largefile...

----------

## chunderbunny

With something that large wouldn't you be better off using GPFS or Lustre?

----------

## zeek

I have a 1+ TB array at work using XFS.  Tons of small files and good performance.  The array is almost full of 4-8K files being served by NFS.

Caveat: running the xfs_check and xfs_repair (fsck equivalent) require a lot of RAM (> 4 GB).  This means you can't check or repair the array if you're using 32bit linux (I'm using amd64).  I don't have that much RAM in my server, I just add a 32GB swapfile from the local disk.  Actually I've never had to run xfs_check/xfs_repair.  I don't know if the ext3 fsck has this RAM limitation.

The xfsdump package includes a backup utility and a defragmenter: xfs_fsr.  The defrag is completly safe to run and a nice feature to have on big arrays.

Whatever you do, don't use ReiserFS.

----------

## zeek

 *chunderbunny wrote:*   

> With something that large wouldn't you be better off using GPFS or Lustre?

 

GPFS and Lustre require special hardware (usually a fibre channel SAN) that allow direct concurrent access to the block device.  They're like NFS, except they don't have the single point of failure that NFS has.

You'll only want to use these kinds of filesystems if you're having multiple machines mounting them.  They're extreemly slow compared to a regular filesystem like ext2.

----------

## simeli

i'd certainly go for xfs on an array that large. xfs has been designed from the start for extremely large filesystems and offers very good performance, possibility for guaranteed i/o and a slew of other advanced features. the codebase for xfs is very mature and well tested. one thing to consider however: xfs aggerssively caches data in ram, so a ups is a must if you don't want to loose data on a power outage. xfs also only journals metadata and not data, thus the filesystem will be consistent but the files in question may be filled with zeros thus unusable. xfs does not try to guess what the data should be like ext does, it relies on the applications doing atomic writes. so this is actually not a shortcoming of xfs but a design.

----------

## humbletech99

I'd rather the performance was a little less and the reliablility a little more, so I think I'm leaning towards sticking with ext3 for now, I don't have a ups yet, and data integrity on the filesystem is paramount.

----------

## simeli

most often reliability is confused with what a filesystem may optionally do. most filesystem are supposed to recover A CONSISTENT STATE of the METADATA ONLY from such events as unclean shutdowns. 

'ext3' additionally can make an attempt at recovering the contents of files too, if ordered or data journaling is enabled. 

However the proper way to ensure data (as opposed to metadata) recoverability is to ensure the application handles that, using atomic data transactions, because that's the only way, and even if 'ext3' often succeeds blindly, that is not the right way and not what the term reliable implies.

Large scale filesystems like JFS and XFS, designed for mission critical applications, don't do any attempt at data recovery, because indeed that should be handled by the applications themselves. 

Many people who don't understand this then complain that then these two filesystems cause loss of data...

again, with no ups in place xfs is not the best choice. beware of extremely long (speaking hours or even days) fschk periods with ext. and a "quick reboot" after a year may not be all that quick like that.

----------

## humbletech99

you can't rely on having an atomic application working on the filesystem though, it could be any regular joe app from windows or linux accessing the data when the system goes down for any reason and you still don't want to lose data.

I think the job of the filesystem should be make sure no data is lost, and that filesystem corruption does not occur, that is inherant in the term "robust filesystem".

I understand what you are saying, but we aren't some big player like IBM which can get everything else to go around the XFS filesystem in order to make it a good choice. Also, this filesystem isn't going to be used by some acid compiant database so it needs to have as many of it's own safeguards as possible, it's simply a big file repository...

----------

## NeddySeagoon

humbletech99,

You pay your money and take your pick. 

To get any deeper into your decision making process you need to do some formal reliabiliy analysis and determine what failures you are really trying to guard against. 

Maybe you should have a 'hot mirror' system on a seperate power supply, even in a seperate building, so the sever side had no single point of failure. It depends on what you are prepared to pay to achive the reliability you want and what you perceive the threats to your data to be.

You will still need backups to recover files that users accidently delete, or realise they didn't mean to do that, a few days later. That will be an impressive amount of tape.

----------

## humbletech99

we don't back up to tape, only hard disk, usually 4x400GB once a week...

I'm most worried about filesystem corruption since this is not something the raid array can help me with... therefore this is the most likely single point of failure that I can tell apart from a hardware failure for which I can just replace the component and can afford lose 1 disk out of 10 and still have all our data.

----------

## simeli

if one disk fails in such a big array, the rebuilding process is going to take a very long time in which you must not loose another disk. your stating that you can loose one out of ten disks suggests you are using a raid 5 setup. i would probably go for a raid 6 allowing for two disks to die, just to be safer during the rebuild process. using different makes/models of disks also reduces chances of several disks dying at the same time.

----------

## NeddySeagoon

humbletech99,

You need to do a rational failure analysis based on your hardware selection to see where 

single points of failure exist, what the probablilty of each failure is and the cost to guard against it.

Then you can make fix it/live with it decisions for them all.

e.g. Do you have a single PSU, are all the drives on the same controller ? on the same PCI bus ....

Providing you backup works, it sounds like the most you loose are all the files changed over the last week.

Presumably you rotate backups to guard against backing up already trashed data and losing your backup too.

You speak of 'our' as in a business of some sort. Do you have a disaster recovery plan?

What happens to your data if the building burns down?

Does the business fold or can you get the data back in a day or so.

----------

## humbletech99

well this machine will be used as a nightly backup server, from where we take our weekly backups, so people shouldn't miss it if it goes down for 3 hours to rebuild the raid array...

it will also store older data that will be backed up less frequently, say once a month.

our backups go offsite and we have a couple of locations with other servers that we could do some work with in the event that the main building burns down...

I think the fact that the machine isn't mission critical and won't be under heavy loads means that I'm not too worried, I already have a spare raid controller on the shelf and the other components aren't of too much interest. Yes there is a single power supply, but again it's not a serious concern...

----------

## zeek

 *humbletech99 wrote:*   

> I'd rather the performance was a little less and the reliablility a little more, so I think I'm leaning towards sticking with ext3 for now, I don't have a ups yet, and data integrity on the filesystem is paramount.

 

XFS is 100x more reliable than ext23.  I think you've misunderstood the warning.  Once you do a `sync`(on any fs) the data has been flushed to disk.  This applies to XFS as well.

The problem stems from unflushed buffers, ie if the power is cut in the middle of a file copy.  XFS will give you 0s for the data that was allocated but never synced.  ext23 will give you the previous contents of those sectors.  When people see the 0s they think that XFS deleted their data, but when they see the corrupted files with old data in them they just think that the file got corrupted from the power failure.

Another thing to consider: ext3 is journalling functionality tacked on top of ext2.  ext2 was written when linux filesystems were measured in tens and hundreds of megabytes.  Do you think that it scales up to filesystems in the TB range nicely?

----------

## humbletech99

I haven't noticed any problems with ext3 at this size, it can do up to 32TB so I assume that 4 isn't pushing it too much. Since I'm much more familiar with ext3 I'd prefer to stick with than since I know it's tools unless there is some reeaaly good reason to change. I've read that ext3's journalling is very good and safer than most other meta-journalling. Although I guess it depends on the mode of journalling you use.

downside is, running something like badblocks takes an absolute age...

----------

## NeddySeagoon

humbletech99,

The badblocks program is from a bygone age whan drives were manufactured with surface defects an these were mapped out by the filesystem.

Modern drives, from about 4Gb up, are still manufactured with sufrace defects but spare sectors and tracks are provisioned too. The drive maps out faulty sectors to spares for itself, both when its new and through life. This means that the operating system (and the badblocks program) never sees any bad blocks. They are hidden by the drive.

However, using smartmontools, you can read the drives internal error log to see what its been doing. This has worked fine with IDE drives for a number of years. Before it works with SATA drives libsata needs to get ioctrl passthrough, which has been around as a patch for a while now. I'm unsure if its in the mainline kernel or gentoo-sources patch set yet.

----------

## humbletech99

ok, so you're saying badblocks is a waste of time, which I can understand, but if the drive itself is faulty, or the cable or whatever. then I still think it will show up in a badblocks, i have found disks that have given many errors in badblocks and binned them and they were all over 4GB.

----------

## NeddySeagoon

humbletech99,

Drives can only hide bad sectors until the spares are all used. After that, further bad blocks are visible to the operating system and the drive is end of life. With smartmontools you can read the internal error log to see the bad block mapping through time.

Practice on an IDE drive if you have one handy.

----------

## humbletech99

thanks, will give smartmontools a try...

----------

## troymc

Furthermore, bad blocks can only discovered when they are accessed, and are only spared out on a write(). If found during a read() the drive doesn't know what the correct data is, so it only marks it bad and spares it on the next write().

Badblocks is simply one tool to systematically test each block on the drive.

Importantly, this also includes the areas of a drive that have no user data, thus catching bad blocks before you lose data to them.

And it's non-destructive read/write mode (-u) triggers the drives automatic block sparing, too!

What's not to like? (other than the fact that it can take a day or so to run on a large modern drive)

troymc

----------

## humbletech99

probably more like a week on this array...

----------

## boniek

I would consider switching to Open Solaris and trying out ZFS. Read more about it maybe it is solution just for you.

----------

## humbletech99

linux only shop I'm afraid and solaris looked a little dated last time I checked... besides, given enough time and linux will finish solaris off, along with most other things...

----------

## troymc

 *boniek wrote:*   

> I would consider switching to Open Solaris and trying out ZFS. Read more about it maybe it is solution just for you.

 

Can't wait for the linux port to come out.

If ZFS lives up to expectations, it's going to be one awesome filesystem.

RAID, LVM, zetabyte size filesystem all rolled into one.

I haven't tried OpenSolaris yet, but ZFS is due to be released next month for real (x86 & sparc versions).

troymc

----------

## humbletech99

indeed zfs looks very promising, perhaps it will become the de facto filesystem instead if it is released open source?

----------

## boniek

ZFS is released under CDDL licence, which is incompatible with GPL, which in turn means that we won't see kernel implementation anytime soon. I heard guys from FUSE are working on it though.

----------

## humbletech99

of course Sun would rather you switch to Solaris...

----------

## jtang

 *zeek wrote:*   

>  *chunderbunny wrote:*   With something that large wouldn't you be better off using GPFS or Lustre? 
> 
> GPFS and Lustre require special hardware (usually a fibre channel SAN) that allow direct concurrent access to the block device.  They're like NFS, except they don't have the single point of failure that NFS has.
> 
> You'll only want to use these kinds of filesystems if you're having multiple machines mounting them.  They're extreemly slow compared to a regular filesystem like ext2.

 

actually, you dont need specialised hardware to run lustre or gpfs, it is possible to use a standard block device (/dev/hda, /dev/sda etc.....) on a normal machine using the requried kernel(s).

gpfs and lustre scales up in terms of performace as you add more machines/disks to the setup. its kinda useless for a one machine setup. and regarding the "slow" speeds, thats probably true if you dont tune the block sizes to what you want to do. 

but in a typical HPC environment (which gpfs and lustre are designed for) you do large blocked io's and its great for streaming data in parallel off a number of storage servers.

lustre looks nice in terms of performance and being able to scale up, but the current 1.4.x series looks awefully painful to administer on a large scale, where GPFS has a really nice "everyone is a client" model with designated tasks to specific sets of nodes in your system.

----------

## sentenza

Hello,

I'm actually running many of these 3ware controllers, mainly with mutliple raid1 arrays, however i think that my comments will also apply to this discussion...

 *humbletech99 wrote:*   

> haven't got a ups yet, perhaps I should just add a battery backup to the raid card, I think it supports one, although perhaps that will protect the array rather than the filesystem on the array which may still go down dirty? I'll investigate putting it on UPS.

 

Well, the Battery Backup Unit can only help you in one case: you want to use write cache. That little nice feature that is used to perform those incredible write performance you see in marketing sheets has really a bad side efect: caching your write IOs in the controller cache will render you FS mad (any type of FS) in case of powerloss, hard reset, kernel crash... Still, your array will be valid. That is what the BBU is used for: save your cached IO so they will be applied at (re)boot. Using write-cache without a BBU is just insane if you care about your data. I'm using write cache with a BBU in database load an can only confirm you 2 nice efects: first the arrays rocks, second i can sleep at night  :Smile: 

Now on XFS: i've used it for years, mainly for file servers, and yes it rocks:

limits are just insane as long as you don't have a SAN with thousand drives

performances are great and regular, except when approching the full file system (ie > ~98%)

you can grow the filesystem quite easily when needed, however you must use a volume manager for that

small or big files have quite equivalent performances (based on experience with a 500GB raid5 with > 1M files)

metadata are really fast once loaded at least one time: perhaps a little advantage but try to rsync 1M files and you'll understand this point

XFS is based on a well designed, and proved technology developed by SGI, not really a little player when you begin to talk about BIG things

Yeah, XFS as also caveats:

You can loose data because of the journaling mechanism used, but this has already been well explained in this thread (one precision: this only applies to open files when crash occurs). However this should be quite easily overcome by keeping your system as stable as possible: use proven, reliable hardware, don't play too much with compiler options, kernel options or "experimental" features, choose a redundant PSU (you can't trust those things), USE A UPS, USE A UPS, USE A UPS, and USE A UPS!

The xfstools are different than ext tools. Seems quite logic, no? Just try them and you will discover the're also good tools, however google wont give millions pages for your searches... Perhaps also a sign that less people need to use them than ext tools  :Wink: 

Never ever play with external log on your root file system!!! Booting can really become difficult if you do. However external log is definitely good for performances... really good.

Ext3 is a good file system, but frankly, you're talking about bakups, and trust me, when things goes wrong, you can't afford to risk a 2 days fsck on you 2TB FS, ie: it's always when you will need your backups that the backup server will have a problem, that you will need to reboot, and that you won't have foreseen the damn fsck  :Mad: 

Regarding the mkfs options, using largefile4 means that your average file size should be > 4M, and that you wont have millions directories: 1 inode per 4M mean less inodes, so less wasted space, and generally quicker FS creation and allocation speed, however if you have too much little files or directories, you end up with 600G of free space but no more possibilities to add files because you're running out of inodes. Each file or directory will use one inode, and the only solution to get more of them will be to format your fs once again with a more considered setting. I experienced it on a critical system, and trust me, you don't want that to happen! However if you're using a backup software that create large chunks, or where you can set size of them, it's a good solution. I think that bacula do that, and i know that it is certainly the best free backup solution actually on linux.

Last consideration: you're talking about backup, please think one more time about tapes... Yes i know, disk is cheaper, perform better and... is less reliable. Tapes are still the best solution for backups in my opinion, because they have the big advantage that you can really simply add more of them, and so backup more data, extend your retention policy, and send tapes offsite. Disks are good for snapshots, not backups. Sony make nice, compact, quite affordable autoloader with AIT tapes, you should take a look on them. However you certainly know the biggest problem with tape: someone has to change them and nobody want to do it   :Laughing: 

Just my 2 eurocents...

----------

## zeek

 *sentenza wrote:*   

> 
> 
> you can grow the filesystem quite easily when needed, however you must use a volume manager for that
> 
> 

 

Would this not be true for all filesystems?

 *sentenza wrote:*   

> 
> 
> small or big files have quite equivalent performances (based on experience with a 500GB raid5 with > 1M files)
> 
> 

 

Yup, I have a 1TB array filled with 2-8K files and performance is great.

 *sentenza wrote:*   

> 
> 
> You can loose data because of the journaling mechanism used, but this has already been well explained in this thread (one precision: this only applies to open files when crash occurs). However this should be quite easily overcome by keeping your system as stable as possible: use proven, reliable hardware, don't play too much with compiler options, kernel options or "experimental" features, choose a redundant PSU (you can't trust those things), USE A UPS, USE A UPS, USE A UPS, and USE A UPS!
> 
> 

 

This is also true for all metadata journalling fs.  ie: create a new file, write, close it, and before fsync is called power is lost.  When power is restored xfs will realize that the data was not flushed and overwrite the disk blocks allocated for the file with 0s.  Reiserfs and Ext3 (metadata journal) will give you a file with the previous contents of the disk blocks allocated for the file (ie junk).

----------

## sentenza

 *zeek wrote:*   

> 
> 
> Would this not be true for all filesystems?
> 
> 

 

Definitely... You can't grow your partitions with fdisk...

 *zeek wrote:*   

> 
> 
> This is also true for all metadata journalling fs.  ie: create a new file, write, close it, and before fsync is called power is lost.  When power is restored xfs will realize that the data was not flushed and overwrite the disk blocks allocated for the file with 0s.  Reiserfs and Ext3 (metadata journal) will give you a file with the previous contents of the disk blocks allocated for the file (ie junk).

 

Yes... As said, already well explained in this thread...   :Wink: 

----------

## troymc

 *zeek wrote:*   

>  *sentenza wrote:*   
> 
> you can grow the filesystem quite easily when needed, however you must use a volume manager for that
> 
>  
> ...

 

Not for ZFS.

Of course, it's just now being released. 

The 1/06 release of Solaris 10 officially releases it. I'm playing with it now.

troymc

----------

## Limit

Nice tread!

...And what about recovery ext2/3 xfs and other -- what filesystem is more tolerance for power disconnect or unplanned reboots? RAID1/5/6/10 can prevent only hardware, not software failure.

----------

## humbletech99

i'd go with ext3 for this, not the best performer but I'd rather not lose any data.

also see the previously mentioned battery backup unit. got mine.

----------

