# very bad raid 5 performance

## R!tman

Hi all,

I have got 4 sata drives in a raid 5 like this:

```
# cat /proc/mdstat 

Personalities : [raid1] [raid5] 

md1 : active raid5 sdd3[3] sdc3[2] sdb3[1] sda3[0]

      467973120 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

      

md0 : active raid1 sdd1[3] sdc1[2] sdb1[1] sda1[0]

      40064 blocks [4/4] [UUUU]

      

unused devices: <none>
```

I use reiserfs on them.

The performance is really bad:

```
# hdparm -Tt /dev/md1

/dev/md1:

 Timing cached reads:   3612 MB in  2.00 seconds = 1805.37 MB/sec

 Timing buffered disk reads:  108 MB in  3.01 seconds =  35.85 MB/sec
```

Even a single drive should be 50% faster that the whole raid 5.

Here is another example:

```
 $ time cat /scratch/big.file > /dev/null

real    0m47.697s

user    0m0.004s

sys     0m3.984s

 $ du -h /scratch/big.file

2.2G    /scratch/big.file
```

What is wrong with it?

I use all four drives on the nforce4 sata controller of the Asus A8N-SLI Deluxe. This controller supports sata2 and should therefore be able to achieve 3GB/s. So, this is not the limiting factor.

----------

## bollucks

hdparm is not smart enough to test a raid array's performance; it is only testing one hard drive. Use a real benchmark like bonnie++ instead. Reiserfs is also not so good on raid.

----------

## R!tman

 *bollucks wrote:*   

> hdparm is not smart enough to test a raid array's performance; it is only testing one hard drive. Use a real benchmark like bonnie++ instead. Reiserfs is also not so good on raid.

 

I did not try bonnie++, indeed. But doesn't this say all?

```
$ time cat /scratch/big.file > /dev/null

real    0m47.697s

user    0m0.004s

sys     0m3.984s

 $ du -h /scratch/big.file

2.2G    /scratch/big.file
```

[edit]

bonnie++ stats come in some minutes.

[/edit]

----------

## R!tman

I do not really know what this all means. I just used some example bonnie++ commands from the internet.

 *bollucks wrote:*   

> Reiserfs is also not so good on raid.

 

What is a good raid 5 filesystem?

[edit]And what is a good chunk size?[/edit]

So here are the bonnie++ stats:

```
# bonnie++ -u root -s 1024 -r 512 -n 5

Using uid:0, gid:0.

Writing a byte at a time...done

Writing intelligently...done

Rewriting...done

Reading a byte at a time...done

Reading intelligently...done

start 'em...done...done...done...done...done...

Create files in sequential order...done.

Stat files in sequential order...done.

Delete files in sequential order...done.

Create files in random order...done.

Stat files in random order...done.

Delete files in random order...done.

Version 1.93c       ------Sequential Output------ --Sequential Input- --Random-

Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--

Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP

clark            1G   364  99 80414  19 21170   5  2124  97 59295   8 959.7   9

Latency             31730us    4122ms     233ms   16482us   34591us    3631ms

Version 1.93c       ------Sequential Create------ --------Random Create--------

clark               -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--

              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP

                  5 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++

Latency             40947us    1526us    1088us   27593us      83us     603us

1.93c,1.93c,clark,1,1110409349,1G,,364,99,80414,19,21170,5,2124,97,59295,8,959.7,9,5 \

,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++, 31730us,4122ms, \

233ms,16482us,34591us,3631ms,40947us,1526us,1088us,27593us,83us,603us,1730us, \

4122ms,233ms,16482us,34591us,3631ms,40947us,1526us,1088us,27593us,83us,603us
```

and

```
bash-2.05b# bonnie++ -u root -s 1 -r 0 -n 2 -d /test

Using uid:0, gid:0.

Writing a byte at a time...done

Writing intelligently...done

Rewriting...done

Reading a byte at a time...done

Reading intelligently...done

start 'em...done...done...done...done...done...

Create files in sequential order...done.

Stat files in sequential order...done.

Delete files in sequential order...done.

Create files in random order...done.

Stat files in random order...done.

Delete files in random order...done.

Version 1.93c       ------Sequential Output------ --Sequential Input- --Random-

Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--

Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP

clark            1M   351  98 +++++ +++ +++++ +++ +++++ +++ +++++ +++ 10657  20

Latency             33544us      32us      35us    4815us      11us     677ms

Version 1.93c       ------Sequential Create------ --------Random Create--------

clark               -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--

              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP

                  2 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++

Latency               144us      10us     818us     263us      10us     588us

1.93c,1.93c,clark,1,1110410176,1M,,351,98,+++++,+++,+++++,+++,+++++,+++,+++++,+++, \

10657,20,2,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,33544us, \

32us,35us,4815us,11us,677ms,144us,10us,818us,263us,10us,588us
```

and

```
bash-2.05b# bonnie++ -x 3 -u 0 -n1

Using uid:0, gid:0.

format_version,bonnie_version,name,file_size,io_chunk_size,putc,putc_cpu,put_block,put_block_cpu, \

rewrite,rewrite_cpu,getc,getc_cpu,get_block,get_block_cpu,seeks,seeks_cpu,num_files,max_size, \

min_size,num_dirs,file_chunk_size,seq_create,seq_create_cpu,seq_stat,seq_stat_cpu,seq_del, \

seq_del_cpu,ran_create,ran_create_cpu,ran_stat,ran_stat_cpu,ran_del,ran_del_cpu,putc_latency, \

put_block_latency,rewrite_latency,getc_latency,get_block_latency,seeks_latency,seq_create_latency, \

seq_stat_latency,seq_del_latency,ran_create_latency,ran_stat_latency,ran_del_latency

Writing a byte at a time...done

Writing intelligently...done

Rewriting...done

Reading a byte at a time...done

Reading intelligently...done

start 'em...done...done...done...done...done...

Create files in sequential order...done.

Stat files in sequential order...done.

Delete files in sequential order...done.

Create files in random order...done.

Stat files in random order...done.

Delete files in random order...done.

1.93c,1.93c,clark,1,1110410072,2G,,319,99,77938,19,16697,4,1731,96,43070,7,410.5,7,1,,,,, \

+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,38190us,2932ms,340ms, \

26782us,51842us,3942ms,41788us,121us,531us,171us,10us,278us

Writing a byte at a time...done

Writing intelligently...done

Rewriting...done

Reading a byte at a time...done

Reading intelligently...done

start 'em...done...done...done...done...done...

Create files in sequential order...done.

Stat files in sequential order...done.

Delete files in sequential order...done.

Create files in random order...done.

Stat files in random order...done.

Delete files in random order...done.

1.93c,1.93c,clark,1,1110410072,2G,,328,99,61914,18,18969,5,1714,95,23683,4,329.7,6,1,,,,, \

+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,36842us,775ms,385ms, \

28124us,109ms,3964ms,38056us,28us,367us,1451us,112us,10225us

Writing a byte at a time...done

Writing intelligently...done

Rewriting...done

Reading a byte at a time...done

Reading intelligently...done

start 'em...done...done...done...done...done...

Create files in sequential order...done.

Stat files in sequential order...done.

Delete files in sequential order...done.

Create files in random order...done.

Stat files in random order...done.

Delete files in random order...done.

1.93c,1.93c,clark,1,1110410072,2G,,299,95,58416,17,13012,3,1540,95,18668,3,342.3,6,1,,,,, \

+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,52042us,788ms,258ms, \

50056us,148ms,4029ms,36770us,10us,7215us,261us,123us,540us
```

----------

## R!tman

I did another test with iozone. Here the raid looks quite good compared to my old system with only one hd.

old system:

```
# iozone -s 4096

        Iozone: Performance Test of File I/O

                Version $Revision: 3.226 $

                Compiled for 32 bit mode.

                Build: linux 

        Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins

                     Al Slater, Scott Rhine, Mike Wisner, Ken Goss

                     Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,

                     Randy Dunlap, Mark Montague, Dan Million, 

                     Jean-Marc Zucconi, Jeff Blomberg,

                     Erik Habbinga, Kris Strecker.

        Run began: Thu Mar 10 12:43:36 2005

        File size set to 4096 KB

        Command line used: iozone -s 4096

        Output is in Kbytes/sec

        Time Resolution = 0.000001 seconds.

        Processor cache size set to 1024 Kbytes.

        Processor cache line size set to 32 bytes.

        File stride size set to 17 * record size.

                                                            random  random    bkwd  record  stride                                   

              KB  reclen   write rewrite    read    reread    read   write    read rewrite    read   fwrite frewrite   fread  

freread

            4096       4  104049  314064   512255   486165  397089  266838  392078  385437  394798   278637   286135  468387   

451354

iozone test complete.
```

and the raid 5 system:

```
# iozone -s 4096 

        Iozone: Performance Test of File I/O

                Version $Revision: 3.226 $

                Compiled for 64 bit mode.

                Build: linux-AMD64 

        Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins

                     Al Slater, Scott Rhine, Mike Wisner, Ken Goss

                     Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,

                     Randy Dunlap, Mark Montague, Dan Million, 

                     Jean-Marc Zucconi, Jeff Blomberg,

                     Erik Habbinga, Kris Strecker.

        Run began: Thu Mar 10 12:27:55 2005

        File size set to 4096 KB

        Command line used: iozone -s 4096

        Output is in Kbytes/sec

        Time Resolution = 0.000001 seconds.

        Processor cache size set to 1024 Kbytes.

        Processor cache line size set to 32 bytes.

        File stride size set to 17 * record size.

                                                            random  random    bkwd  record  stride                     

              

              KB  reclen   write rewrite    read    reread    read   write    read rewrite    read   fwrite frewrite   

fread  freread

            4096       4  323947 1037443  2026691  2119975 2020731 1160033 1967279 1410868 1952300   297132   985321 15

76018  2020731

iozone test complete.
```

Here the raid seems to perform quite good, but I would like to be able to use this performance for daily use too.  :Sad: 

----------

## firetwister

reiserfs had problems (file system corruption) on a softraid5 I thought these were solved. bollucks do you know something more about the disadvantages of reiserf on raid?

R!tman how high is the cpu load? I had a raid0 on 2 ata drives and it was about 10-30 percent on a 850mhz athlon. Hdparm performed worse on the raid0 than on a single drive I think thats usual. My experiences with the raid were not so good, the performance was certainly not the double of a single drive, maybe a third to half faster than a single drive, but the (onboard) controller was bound to the pci bus, what certainly was the bottleneck.

Did you have a look at your dmesg for something like that?

```
sdb: assuming drive cache: write through
```

That should be quite unusual, but could explain your performance problem a bit. "write back" should be the desired mode for best performance. My notebook harddrive copys big files nearly as fast as your raid! Keeping in mind, that the access time increases a little bit and the tranfer rates should dramatically increase you can be damn sure there is something wrong  :Wink: 

Did you repeat the tests under different circumstances? e.g. removing all unnecessary modules (sound dri lan ...) I managed to get over 170 buffer underuns while burning (at 16x cd) and listening at music at the same time, but that was with an os I'm very happy to got rid of  :Very Happy:  ! 

I found something intresting on the asus page  *Quote:*   

> The nForce4 chipset incorporated four Serial ATA and two parallel connectors with high performance RAID functions in RAID 0, RAID 1, RAID 0+1 and JBOD. The Silicon Image controller provides another four Serial ATA connectors for RAID 0, RAID 1, RAID 10, and RAID 5 functions.

  If that is a real raid5 controller with xor engine and so on you shoud really give it a try but I strongly asume that the parity calculations are still done by the cpu. http://www.tweakers.net/reviews/557/1 recently made a raid5 benchmark with several controllers, the cpu load variet between 1.22% and 31.90% !

Maybe lspci and google will help you finding that out.

----------

## bollucks

 *firetwister wrote:*   

> reiserfs had problems (file system corruption) on a softraid5 I thought these were solved. bollucks do you know something more about the disadvantages of reiserf on raid?

 

No it was just a recent comment somewhere on lkml... Yeah I know how little help that is.

----------

## R!tman

 *firetwister wrote:*   

> R!tman how high is the cpu load?

 

CPU load is 0% to 1% when idle. When the system was reconstructing the raid cpu load NEVER went over 20%. Most times 12%-16%.

Oh, regarding the reconstruction: When the raid was reconstructed, gkrellm showed me more than 30MB/s for EACH of the hds. But when copying files, I get about 40MB/s, so 10MB/s for each hd.

 *firetwister wrote:*   

> Did you have a look at your dmesg for something like that?
> 
> ```
> sdb: assuming drive cache: write through
> ```
> ...

 

```
# dmesg | grep write

PCI bridge 00:09 from 10de found. Setting "noapic". Overwrite with "apic"

SCSI device sda: drive cache: write back

SCSI device sdb: drive cache: write back

SCSI device sdc: drive cache: write back

SCSI device sdd: drive cache: write back
```

 *firetwister wrote:*   

> Did you repeat the tests under different circumstances? e.g. removing all unnecessary modules (sound dri lan ...) I managed to get over 170 buffer underuns while burning (at 16x cd) and listening at music at the same time, but that was with an os I'm very happy to got rid of  ! 

 

No, I did not try that. I only tried booting with the livecd 2004.3. Then I did

```
mdadm --assemble /dev/md1 /dev/sda3 /dev/sdb3 /dev/sdc3 /dev/sdd3
```

But the results were the same.

 *firetwister wrote:*   

> I found something intresting on the asus page  *Quote:*   The nForce4 chipset incorporated four Serial ATA and two parallel connectors with high performance RAID functions in RAID 0, RAID 1, RAID 0+1 and JBOD. The Silicon Image controller provides another four Serial ATA connectors for RAID 0, RAID 1, RAID 10, and RAID 5 functions.  If that is a real raid5 controller with xor engine and so on you shoud really give it a try but I strongly asume that the parity calculations are still done by the cpu. http://www.tweakers.net/reviews/557/1 recently made a raid5 benchmark with several controllers, the cpu load variet between 1.22% and 31.90% !
> 
> Maybe lspci and google will help you finding that out.

 

The Silicon Image Controller is just another of those pseudo hardware raid controller, but I will check out that link.

----------

## firetwister

u can adjust the maximum speed for reconstruction afaik the default is not the maximum. For those who don't know how to adjust this in /proc (including me  :Wink: ) it can be done with powertweak.

 *Quote:*   

> But when copying files, I get about 40MB/s, so 10MB/s for each hd.

 

A 300MB file copied to the array will be "split" to 4x100MB. You probably knew that, but it wasn't clear in the above statement.

I think there are few people complaining about ext3 in case you feal uncomfortabe with reiserfs  :Wink: 

Unfortunatelly I have no more ideas, damn i wanted to buy a nforce pro board and build a raid5:evil:

----------

## R!tman

 *firetwister wrote:*   

> u can adjust the maximum speed for reconstruction afaik the default is not the maximum. For those who don't know how to adjust this in /proc (including me ) it can be done with powertweak.

 

Thanks, I didn't know that.

 *firetwister wrote:*   

>  *Quote:*   But when copying files, I get about 40MB/s, so 10MB/s for each hd. 
> 
> A 300MB file copied to the array will be "split" to 4x100MB. You probably knew that, but it wasn't clear in the above statement.

 

That 40MB/s with 10MB/s per hd was when I did this

```
cat /scratch/big.file > /dev/null
```

 *firetwister wrote:*   

> I think there are few people complaining about ext3 in case you feal uncomfortabe with reiserfs 
> 
> Unfortunatelly I have no more ideas, damn i wanted to buy a nforce pro board and build a raid5:evil:

 

I hope I can get an external hd from a friend. Then I will copy my system there and do some further testing with. I will rebuild the raid with different chunk sizes, different filesystems (including reiser4). 

I would appreciate command suggestions for iozone and bonnie++. Best would be some that do not take too long. Maybe approximately 10mins?

Thanks for posting, I will keep you up to date.

[edit]Oh, do you know a way to get hdparm fully working with sata drives? I read some posts here in the forum but could not get specific information (so far). 

The strange thing is, I saw people being able to simple use

```
hdparm -i /dev/sda
```

This does not work for me, I get some errors. I cannot post them, because I am not at home currently. 

[/edit]

----------

## firetwister

Do you get errors like this?

```
Jan 17 14:04:54 ryqvy kernel: raid5: switching cache buffer size, 4096 --> 1024

Jan 17 14:04:58 ryqvy kernel: raid5: switching cache buffer size, 1024 --> 4096
```

Or in other words do you have filesystems with different blocksizes on the raid?

----------

## MaDDeePee

 *bollucks wrote:*   

> hdparm is not smart enough to test a raid array's performance; it is only testing one hard drive. Use a real benchmark like bonnie++ instead. Reiserfs is also not so good on raid.

 

Yes, but its still enough to compare if something is wrong in your system.

Sorry, i cant explain whats your problem there, but i can confirm that you MUST get beyound my RAID0 results with your RAID5! Even in hdparm! (im also on reiserFS)

```
bash-2.05b# hdparm -tT /dev/md1

/dev/md1:

 Timing cached reads:   3852 MB in  2.00 seconds = 1926.29 MB/sec

 Timing buffered disk reads:  282 MB in  3.01 seconds =  93.61 MB/sec

bash-2.05b# hdparm -tT /dev/md2

/dev/md2:

 Timing cached reads:   3728 MB in  2.00 seconds = 1863.35 MB/sec

 Timing buffered disk reads:  312 MB in  3.01 seconds = 103.64 MB/sec

bash-2.05b#

```

----------

## firetwister

more intresting stuff  :Smile: 

 *Quote:*   

> RAID-5
> 
> On RAID-5, the chunk size has the same meaning for reads as for RAID-0. Writing on RAID-5 is a little more complicated: When a chunk is written on a RAID-5 array, the corresponding parity chunk must be updated as well. Updating a parity chunk requires either
> 
>     * The original chunk, the new chunk, and the old parity block
> ...

 

http://unthought.net/Software-RAID.HOWTO/Software-RAID.HOWTO-5.html

----------

## firetwister

Maybe I found it!

On the Linux Raid Mailinglist there are some posts about poor raid5 performance in Linux2.6 under some conditions 2.4 is twice as fast! I think you're getting some problems with an nforce4 board and 2.4.

http://marc.theaimsgroup.com/?l=linux-raid&m=110288290029516&w=2

----------

## R!tman

 *firetwister wrote:*   

> Do you get errors like this? 
> 
> ```
> Jan 17 14:04:54 ryqvy kernel: raid5: switching cache buffer size, 4096 --> 1024
> 
> ...

 

No, I have only reiserfs on it.

 *MaDDeePee wrote:*   

> Sorry, i cant explain whats your problem there, but i can confirm that you MUST get beyound my RAID0 results with your RAID5! Even in hdparm! (im also on reiserFS)

 

I totally agree with you. The performance of the raid 5 should not be THIS bad. Like I said, even an old ata drive was faster than the raid 5 with 4 pretty fast and new sata drives.

 *firetwister wrote:*   

> more intresting stuff
> 
> ...

 

I had read quite a few raid guides an knew most of that already. Only in one of the often linked to sw-raid-howto they use 64kB chunk size instead of 128kB. But only this will not cause the raid to be THAT slow. 

BTW, 64kB is the standart chunk-size of mdadm.

 *firetwister wrote:*   

> Maybe I found it!
> 
> On the Linux Raid Mailinglist there are some posts about poor raid5 performance in Linux2.6 under some conditions 2.4 is twice as fast! I think you're getting some problems with an nforce4 board and 2.4.
> 
> http://marc.theaimsgroup.com/?l=linux-raid&m=110288290029516&w=2

 

That is interesting. But I will not go back to a 2.4 kernel  :Smile: .

Tomorrow morning I know wether or not I get an external hd to secure my system and thus be able to  run some tests.

Any commandlines for bonnie++ and/or iozone would be very welcome.

Also chunk-, block- and stride-size suggestions, as well as filesystem suggestions to test.

----------

## R!tman

I am at testing right now....

In a raid0 configuration with standart settings I get over 200MB/s in hdparm when using reiser4.

With standarts, reiser4 and raid5 I get 60MB/s.

But testing the single drives, I get about 65MB/s each.

[edit]

OMG, using a chunksize of 128k, instead of 64k standart, and reiser4. I get 159MB/s. YEAH. 

Some further tuning will be done  :Very Happy: . 

I would never have expected that the chunk size has THAT much of an influence.

[/edit]

[edit2]

 :Sad: 

I still get that ~160MB/s, but only when the system is reconstructing the raid. When it finishes that, and is ready to use then, it drops to about 80MB/s.

What the hell is wrong?

[/edit2]

----------

## yottabit

 *R!tman wrote:*   

> I hope I can get an external hd from a friend. Then I will copy my system there and do some further testing with. I will rebuild the raid with different chunk sizes, different filesystems (including reiser4). 

 

Just hang tight. I've already been benchmarking SATA softRAID-0 performance of various chunk sizes (4 KB to 4 MB) for sequential reads/writes of files from 64 KB to 2 GB for ext2, ext3, Reiser3.6, Reiser4, JFS, XFS, SMB, CIFS, and VFAT  :Twisted Evil: .

I'll be done in a few days to a week.

J

----------

## R!tman

 *yottabit wrote:*   

>  *R!tman wrote:*   I hope I can get an external hd from a friend. Then I will copy my system there and do some further testing with. I will rebuild the raid with different chunk sizes, different filesystems (including reiser4).  
> 
> Just hang tight. I've already been benchmarking SATA softRAID-0 performance of various chunk sizes (4 KB to 4 MB) for sequential reads/writes of files from 64 KB to 2 GB for ext2, ext3, Reiser3.6, Reiser4, JFS, XFS, SMB, CIFS, and VFAT .
> 
> I'll be done in a few days to a week.
> ...

 

Would you please post some stats, even if they are not finished yet? You can write a personal message if you like.

Although, the main problem on my machine seems to be raid5. Raid0 performs pretty good, in my eyes. I get up to 220MB/s with hdparm. Copying a big file was about 145M/s, even when using Reiser4.

[edit]

This litte raid0 test proves that the sata2 bus has the capabilities.

[/edit]

----------

## yottabit

Okay, here are the prelim stats.

Notable Notes:

All KB/MB/GB numbers are base-2, not base-10.Results are KB/s first-instance write,read of 2 GB file with maximum record length (64 MB), as reported by Iozone 3.226 with kernel caching disabled(full Iozone parameters: -aIMopg 2g -i 0 -i 1) (Goal was to test performance with very large sequential reads & writes. More detailed results, including small files and small record lengths, available in individual test sheets.) Native means the standalone test disk without any striping. ( 4K > stripe-size > 4096K ) not supported by md driver in stock kernels. Reiser 3.6 format with default 4096 block-size because 8192 block-size refused mount. Reiser 4, at this time, does not seem to support kernel cache disabling, which resulted in the drastic performance deviations from the other filesystems (the write performance is likely close, but obviously the read performance is not even possible for these disks' interface specifications. See the detailed log. Also note that Reiser 4 seems to have some seriously disturbing shortcomings compared to Reiser 3.6 regarding large file accesses). VFAT also doesn't support kernel cache disable, go figure, so I set the minimum and maximum file size to 2 GB to work around kernel caching problems I've had in the past. SMBFS and CIFS performance measured to a remote WinXP workstation over directly-connected Gigabit Ethernet Jumbo Frame with 64K TCP window size and storage on IBM/Hitachi 180GXP 180 GB 7200 RPM 8 MB Cache drive in external Firewire400/IEEE1394 enclosure with Oxford 911 ATA bridge. Moo. 3:-o

Hardware:

```
Processor      AMD Athlon XP 2100+

Core Logic     nVidia nForce2 (ASUS A7N8X-Deluxe)

Memory         Corsair (Samsung), 1 GB, DDR266 (3 modules, multi-channel)

Controller     Promise S150 TX4 PCI

Test Disks     (2) Hitachi 7K250, 250 GB, SATA-100, 7200 RPM, 8 MB Cache

NIC            D-Link DGE-530T, Gigabit Ethernet, Point-to-Point, Jumbo Frame 9000

Kernel         Gentoo Linux 2.6.11-mm2

I/O Scheduler  Deadline
```

And the stats so far:

```
       ext2         ext3         Reiser 3.6   Reiser 4    JFS          XFS         VFAT :o)

Native 35797,59364  35116,54926  31529,55640  27772,56984 38272,59910  38648,59908 17450,58635

4K     39148,116473 38712,106688 33801,108084 28592,62522 39098,116346
```

(Sorry about any line-wrapping. I can't seem to get HTML to turn on for the post so I can use tables. I guess the admins have HTML code off.)

Keep in mind that these stats are for sequential writes/reads of huge files, per my testing requirement. The stats for small files will also be available in the detailed data of my report.

J

----------

## dannysauer

 *bollucks wrote:*   

>  *firetwister wrote:*   reiserfs had problems (file system corruption) on a softraid5 I thought these were solved. bollucks do you know something more about the disadvantages of reiserf on raid? 
> 
> No it was just a recent comment somewhere on lkml... Yeah I know how little help that is.

 

FYI, I currently have a backup system with 80 million files in 5 million directories running under an LVM-partitioned software RAID5 (on firewire disks, which suck ass re: speed, BTW).  The filesystem is reiser, and I've had 0 problems with the nightly snapshots on this volume.  I've been running another software RAID  - with ReiserFS - under less stressful conditions for over 5 years now.  Also, with no problems (except that I really need to update that machine someday soon).  The ext2 system on another partition is just fine, but there's no significant difference in performance on any of those, though an unclean shutdown and the ensuing fsck will definitely favor the reiserfs partition.  :Smile:   Yeah, there was a problem with a specific release of Reiser and software RAID, but that was fixed pretty quickly.  Reiser has been totally stable for me on lots of systems for a long time, is super-awesome with small files, and has some of the best recovery tools around - but there's always someone willing to spread some FUD (usually because they dislike a Reiser developer, from what I've seen).  That's probably what you read - 'cause there are a few kernel developers who really dislike good ol' Hans 'n crew.

----------

## R!tman

hmm... I found out interesting new things.

This raid 5

```
mdadm -C -v /dev/md1 -l5 -n4 -c128 /dev/sda3 /dev/sdb3 /dev/sdc3 /dev/sdd3
```

had these two benchmark results:

writing (good):

```
time dd if=/dev/zero of=/mnt/gentoo/test5.tmp bs=1024k count=4096

44.790s => 91MB/s 
```

reading (really bad)

```
time dd of=/dev/null if=/mnt/gentoo/test5.tmp bs=1024k count=4096

2m39.804s => 25MB/s
```

and the same with this raid0:

```
mdadm -C -v /dev/md2 -l0 -n4 -c64 /dev/sda4 /dev/sdb4 /dev/sdc4 /dev/sdd4
```

writing:

```
time dd if=/dev/zero of=/mnt/gentoo/scratch/test0.tmp bs=1024k count=4096

28.838s => 142MB/s
```

reading:

```
time dd of=/dev/null if=/mnt/gentoo/scratch/test0.tmp bs=1024k count=4096

54.247s => 75MB/s
```

Usually, read speed should be greater than write speed!

----------

## yottabit

Are you doing this from LiveCD? IIRC the LiveCD loads fs drivers with debug options on... that may have something to do with it.

You're right, read speed should be much faster than write speed, array or not. And all my benchmarking so far confirms.

----------

## R!tman

 *yottabit wrote:*   

> Are you doing this from LiveCD? IIRC the LiveCD loads fs drivers with debug options on... that may have something to do with it.
> 
> You're right, read speed should be much faster than write speed, array or not. And all my benchmarking so far confirms.

 

Yes, 2004.3 for amd64 and a reiser4 enabled for amd64.

----------

## yottabit

I would recommend against Reiser4 (Reiser 3.6 is great, though), and benchmarking should be done from an installed system simply because I believe the LiveCD enables debug modes in the filesystem drivers (which slow them down).

If you're looking for a generally good filesystem to choose for your installation, I would highly recommend Reiser 3.6 (reiserfs) or the good ol' ext3. I've been using Reiser 3.6 for many many years now and I love its benefits.

If you're going to be storing primarily huge files, go with JFS or XFS. (XFS seems to just barely out-perform JFS so far in my benchmarking.)

----------

## R!tman

Like firetwister already mentioned in a personal message (thank you), 2.4 kernels seem to be better in handling raid5. 

Here are a few test I made with a livecd that contains several kernels. 

I did this to test reading

```
time dd if=/mnt/test.tmp of=/dev/null bs=1024k count=4000
```

and this for writing:

```
time dd of=/mnt/test.tmp if=/dev/zero bs=1024k count=4000
```

And here are the results:

```
kernel 2.4.7:

Raid5: reading: 0m28.300s writing: 0m45.355s

Raid0: reading: 0m30.414s writing: 0m27.254s

2.6.7 kernel:

Raid5: reading: 1m9.246s writing: 0m46.719s

Raid0: reading: 0m30.848s writing: 0m36.378s

my kernel (2.6.9-r14, no livecd):

Raid5: reading: 1m7.024s writing: 1m3.255s

Raid0: reading: 0m30.362s writing: 0m31.161s
```

Something is wrong with 2.6 kernels in combination with raid5.

I would like to test a 2.4 kernel from my system, no livecd. But I think I will get trouble, because I run a udev-only system. Am I right?

----------

## apanjocko

i am reading up on this as i am designing a fileserver on my own,

and it DEFINETELY seems to be the 2.6 kerneltree that is causing the problem. i'm quite sure i will run 2.4 on it and be happy.

hope i don't miss out on a lot of nifty features  :Smile: 

/d

----------

## dannysauer

 *apanjocko wrote:*   

> i am reading up on this as i am designing a fileserver on my own,
> 
> and it DEFINETELY seems to be the 2.6 kerneltree that is causing the problem. i'm quite sure i will run 2.4 on it and be happy.
> 
> hope i don't miss out on a lot of nifty features 

 

You'll miss out on better scheduling and driver updates, in exchange for a speed difference that's not much in practice (if you're pegging throughput numbers frequently, you should really spend the extra couple hundred bucks and get a 3Ware RAID card - I'm using both where appropriate, and the 3Ware cards rock).  Also note that a RAID will behave differently with one large file as compared to several small files.  You generally want a larger stripe size for a bunch of small files, and a smaller stripe for large files..

Either way, it's worth noting that the md maintainer is aware of the problem, and is actively trying to solve it.

home page: http://cgi.cse.unsw.edu.au/~neilb/SoftRaid

talking about RAID5 performance in 2.6: http://cgi.cse.unsw.edu.au/~neilb/01102979338

----------

## rasmussen

BTW, is your RAID controller on the PCI-bus?

With my Promise SATAII TX4 I found that increasing the latency using setpci can improve the throughput with 5-10MB.

----------

## R!tman

 *rasmussen wrote:*   

> BTW, is your RAID controller on the PCI-bus?
> 
> With my Promise SATAII TX4 I found that increasing the latency using setpci can improve the throughput with 5-10MB.

 

I hope you meant me with "your..."  :Smile: . 

But no, it is part of the nforce4 chipset. 

Sata2 on PCI only is kind of weird im my eyes. Isn't PCI much so slow for Sata2? 

That will probably not play a role with 1 or 2 hds on the controller, but with eg 4 like I have, the hds are faster than the PCI bus.

Please correct me if I am wrong.

----------

## dannysauer

 *R!tman wrote:*   

> Sata2 on PCI only is kind of weird im my eyes. Isn't PCI much so slow for Sata2? 
> 
> That will probably not play a role with 1 or 2 hds on the controller, but with eg 4 like I have, the hds are faster than the PCI bus.
> 
> Please correct me if I am wrong.

 

The SATA controllers on a motherboard are generally connected to the processor via an internal PCI bus.  At 64bits of bus width and 66MHz, typical 64bit PCI is capable of 500MB/s (that's megaByte).  Even at 32MHz and the same 66MHz clock, there's 251MB/s of bandwidth there.  PCI-X runs at 133MHz and is 64 bits wide, which gets up to 1GB/s.

SATA2 is supposedly capable of 300MB/s per channel.  However, there are no drives out there yet that come anywhere close to filling that up with one drive per channel.  Even 32-bit PCI can handle several comonly available hard drives, and there's no one seriously running a high performance file server on a machine that limits them to 32-bit PCI anymore - most server-quality boards have at least 64-bit PCI.  :Smile: 

----------

## R!tman

 *dannysauer wrote:*   

>  *R!tman wrote:*   Sata2 on PCI only is kind of weird im my eyes. Isn't PCI much so slow for Sata2? 
> 
> That will probably not play a role with 1 or 2 hds on the controller, but with eg 4 like I have, the hds are faster than the PCI bus.
> 
> Please correct me if I am wrong. 
> ...

 

Thanks for clearing that up  :Smile: 

----------

## geoffwa

 *Quote:*   

> The SATA controllers on a motherboard are generally connected to the processor via an internal PCI bus. At 64bits of bus width and 66MHz, typical 64bit PCI is capable of 500MB/s (that's megaByte). Even at 32MHz and the same 66MHz clock, there's 251MB/s of bandwidth there. PCI-X runs at 133MHz and is 64 bits wide, which gets up to 1GB/s.
> 
> 

 

Most modern motherboards have some SATA ports that are based off the north/south bridge. ICH6s have a whopping 4 SATA ports connected this way. On the other hand if you have a onboard Silicon Image SATA controller, odds are that's going through the PCI bus just like everything else.

 *Quote:*   

> i am reading up on this as i am designing a fileserver on my own,
> 
> and it DEFINETELY seems to be the 2.6 kerneltree that is causing the problem. i'm quite sure i will run 2.4 on it and be happy.
> 
> hope i don't miss out on a lot of nifty features 
> ...

 

Although 2.6 may be slower than 2.4 for I/O operations, the impact of large amounts of I/O on the system is greatly reduced. One of the easier ways to hang a 2.4 machine is to create a script that forks and does a recursive directory listing from root. This gums things up so badly the machine will stop responding to network traffic. On a 2.6 machine it has no effect (except if you try to read/write from the hard disk of course).

----------

