# Software Raid5, 10x500gb drives poorly performing. Ideas?

## colyte

I'm fully aware of http://bugzilla.kernel.org/show_bug.cgi?id=12309. Which really keeps me from enjoying my hardware.

But it can't explain away the performance I get.

Computer hardware: Intel Q9450, 4GB RAM. SMP enabled kernel.

Raid5 10x500GB hardware:

8xhard(Seagate 7200.11 SATA2) drives over JBOD on an Areca 1220 and 2x (Seagate 7200.11 SATA2) on Intel ICH9R.

hdparm doing it's thing:

```
hdparm -Tt /dev/md0

/dev/md0:

 Timing cached reads:   11792 MB in  2.00 seconds = 5902.27 MB/sec

 Timing buffered disk reads:  1172 MB in  3.00 seconds = 390.43 MB/sec
```

Bonnie++ benchmarks:

http://www.dump.no/files/15939361214c/test.16gb.html

http://www.dump.no/files/15939361214c/test.8gb.html

The system has gone through tons of kernel upgrades. I guess the most stable parameter is my CFLAGS:

CFLAGS="-march=nocona -O2 -pipe -ftracer -funroll-loops -fpeel-loops -funswitch-loops"

Current kernel config: http://www.dump.no/files/15939361214c/kernel.config

Just swapped to gentoo sources today, been using vanilla sources the last year or two. Most if not all options in the kernel have been tried, IO schedulers and so forth.

Due to the bug mentioned up top I rebuilt my raid some time ago, went with 512k chunks and made ext3 with a matching stripe width. (I'm storing mostly files between 6GB-15GB).

(Reason for rebuilding, aforementioned bug + 6x500gb raid5 + XFS = total lockup)

4395451392 blocks level 5, 512k chunk, algorithm 2 [10/10] [UUUUUUUUUU]

tune2fs -l /dev/md0 output:

```
tune2fs 1.41.3 (12-Oct-2008)

Filesystem volume name:   <none>

Last mounted on:          <not available>

Filesystem UUID:          c17d4488-34bb-4678-b50a-47afdfcd26aa

Filesystem magic number:  0xEF53

Filesystem revision #:    1 (dynamic)

Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery sparse_super large_file

Filesystem flags:         signed_directory_hash 

Default mount options:    journal_data

Filesystem state:         clean

Errors behavior:          Continue

Filesystem OS type:       Linux

Inode count:              274718720

Block count:              1098862848

Reserved block count:     10988628

Free blocks:              26995488

Free inodes:              274539282

First block:              0

Block size:               4096

Fragment size:            4096

Reserved GDT blocks:      762

Blocks per group:         32768

Fragments per group:      32768

Inodes per group:         8192

Inode blocks per group:   512

RAID stride:              128

RAID stripe width:        1280

Filesystem created:       Tue May 20 21:48:53 2008

Last mount time:          Sun Jan 25 14:35:09 2009

Last write time:          Sun Jan 25 14:35:09 2009

Mount count:              6

Maximum mount count:      -1

Last checked:             Fri Jan 16 06:10:38 2009

Check interval:           15552000 (6 months)

Next check after:         Wed Jul 15 07:10:38 2009

Reserved blocks uid:      0 (user root)

Reserved blocks gid:      0 (group root)

First inode:              11

Inode size:             256

Journal inode:            8

Default directory hash:   tea

Directory Hash Seed:      f6ed98c9-ef49-4706-900f-493cc064820f

Journal backup:           inode blocks
```

Any ideas why the performance is so shitty?

I'm using software raid even tho i got a perfectly nice hardware raid controller because I like the flexibility and control. I also got the CPU power to spare so it's not really an issue.

Must say, Opensolaris + ZFS looks mighty fine compared to this. That coming from a '00 gentoo'er.

----------

## colyte

Shamelessly bumping.

----------

## NeddySeagoon

colyte,

Show us your lspci and explain how the drives are connected.

To read any data at all, the system has to read 9 drives. What parallelism is there for drive access ?

----------

## colyte

lspci:

```
00:00.0 Host bridge: Intel Corporation 82X38/X48 Express DRAM Controller (rev 01)

00:01.0 PCI bridge: Intel Corporation 82X38/X48 Express Host-Primary PCI Express Bridge (rev 01)

00:06.0 PCI bridge: Intel Corporation 82X38/X48 Express Host-Secondary PCI Express Bridge (rev 01)

00:1a.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 02)

00:1a.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (rev 02)

00:1a.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 (rev 02)

00:1a.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 02)

00:1c.0 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 1 (rev 02)

00:1c.3 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 4 (rev 02)

00:1d.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 02)

00:1d.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 02)

00:1d.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 02)

00:1d.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 02)

00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 92)

00:1f.0 ISA bridge: Intel Corporation 82801IR (ICH9R) LPC Interface Controller (rev 02)

00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA AHCI Controller (rev 02)

00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)

01:00.0 VGA compatible controller: nVidia Corporation GeForce 8800 GT (rev a2)

02:00.0 PCI bridge: Intel Corporation 80333 Segment-A PCI Express-to-PCI Express Bridge

02:00.2 PCI bridge: Intel Corporation 80333 Segment-B PCI Express-to-PCI Express Bridge

03:0e.0 RAID bus controller: Areca Technology Corp. ARC-1220 8-Port PCI-Express to SATA RAID Controller

05:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet Controller (Copper) (rev 06)

07:01.0 Multimedia audio controller: C-Media Electronics Inc CMI8788 [Oxygen HD Audio]
```

I thought this explained how my drives were connected:

 *Quote:*   

> Raid5 10x500GB hardware:
> 
> 8xhard(Seagate 7200.11 SATA2) drives over JBOD on an Areca 1220 and 2x (Seagate 7200.11 SATA2) on Intel ICH9R

 

But in any case md0 consists of 10 drives.

8 drives connected to Areca-1220 which in turn is connected to the motherboard on a pci-e bus.

2 drives are connected to the motherboard itself on the ICH9R controller.

Not sure what you mean about parallelism though at least in this context. Do you mind elaborating?

----------

## NeddySeagoon

colyte,

There are your two controllers.

```
00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA AHCI Controller (rev 02) 

03:0e.0 RAID bus controller: Areca Technology Corp. ARC-1220 8-Port PCI-Express to SATA RAID Controller 
```

The first pair of digits is the PCI bus number. This tells that your two controllers are on different buses -thats good.

Bus 00 has to share its bandwidth with lots of other junk - thats bad but they look like they are only low bandwidth devices, so its not as bad as it could have been.

As the on board controlls has 6 ports, I would try splitting the raid drives 5 per controller, to see if that improves performance.

The Areca card can only pass data from one drive at a time over the bus, so even if it can address all 8 drives at the same time, which I doubt, data transport may still be an issue.

----------

## colyte

Thanks for the reply if nothing else.

Certainly I'll try to split the drives. If that gives me a leg up performance wise it certainly wont hurt.

(Won't be able to do this before tomorrow, but ill run the tests i did earlier just for comparison.)

But as far as I see it, it's not a real solution. As if I were using the hardware raid controller, with 8 drives plugged in with a raid5 array created on it. Areca or 3ware for instance would never be able to sell products with this type performance. 

Something is awfully wrong somewhere. And it deeply pains me. But if you take into account the aforementioned bug in regards to IO and this type of problems. 

This more or less excludes linux from being anything but a desktop box, ironicly enough.

----------

## Akkara

 *colyte wrote:*   

> Certainly I'll try to split the drives. If that gives me a leg up performance wise it certainly wont hurt.
> 
> [...]
> 
> But as far as I see it, it's not a real solution. As if I were using the hardware raid controller, with 8 drives plugged in with a raid5 array created on it. Areca or 3ware for instance would never be able to sell products with this type performance. 
> ...

 

Well, in fairness, even if you try to get the Areca to do raid in hardware, it can't know about the other two drives hanging off the motherboard controllers, and sooner or later all the data will have to traverse the busses to make a 10-drive raid.

You *might* get better performance with a raid 5+0 arrangement with each raid set on its own controller, but I'm no expert in raid to know for sure.

 *Quote:*   

> ...ext3... (I'm storing mostly files between 6GB-15GB)

 

I would recommend looking into xfs for this.  It has live de-fragging to keep your data contiguous.  (But bear in mind that one of the older kernels - I think one of the 2.6.19's - has a xfs corruption bug that has since been fixed.)

----------

## colyte

As I said in my intial post:

 *Quote:*   

> Due to the bug mentioned up top I rebuilt my raid some time ago, went with 512k chunks and made ext3 with a matching stripe width. (I'm storing mostly files between 6GB-15GB). 
> 
> (Reason for rebuilding, aforementioned bug + 6x500gb raid5 + XFS = total lockup) 

 

A smidge of IO activity and the load went up so high that my box hard locked with XFS. There was a substantial difference going to ext3. One them being load doesn't go as high, nor does my box hard lock.

----------

## Dairinin

What's your readahead settings for mdX?

----------

## colyte

My readahead is untouched and defaults to 

```
blockdev --getra /dev/md0 18432
```

Now I tested with 5xdrives on motherboard SATA controller and 5xdrives on Areca controller:

Following bonnie++ results:

http://www.dump.no/files/b232fb5ff27c/test.5x5.8gb.html

http://dump.no/files/b232fb5ff27c/test.5x5.16gb.html

Also tested with :

```
blockdev --setra 16384 /dev/sda[x]

echo 1024 > /sys/block/sd[x]/queue/read_ahead_kb

echo 256 > /sys/block/sd[x]/queue/nr_requests

blockdev --setra 65536 /dev/md0

echo 1 > /sys/block/sd[x]/device/queue_depth (disabling NCQ)
```

Got the following bonnie++ results:

http://dump.no/files/b232fb5ff27c/test.modified.16gb.html

http://dump.no/files/b232fb5ff27c/test.modified.8gb.html

Finally I when I attached the hard drives back to the 8xdrives on Areca and 2 on motherboard controller. I also configured the drives with different SCSI IDs istead of 1 SCSI ID with different LUNs.

Got the following bonnie++ results but to be frank they're all the same:

http://dump.no/files/b232fb5ff27c/test.scsi.id.16gb.html

http://dump.no/files/b232fb5ff27c/test.scsi.id.8gb.html

I've more or less tried everything between heaven and earth. 

I sincerely hope someone with more intimate knowledge around this could provide me with an eye opening experience.

----------

## Cyker

I'm starting to wonder if the problems the AMD64 users with their system becoming unusable during heavy IO are the same ones affecting us RAID people too.

I was looking at mine and during a 30MB/s read+write op via a gigabit network line, the IOwait was flatlining at near-100% on both cores for the whole op.

From array to array, I get 50MB/s sustained with one core flatlining.

If I tell the kernel to do a resync, iostat and bwm-ng report a sustained transfer of 250MB/s!

----------

## colyte

Well I've had both RAID and AMD64 for a long time. AMD64 more or less since the launch of the platform.

Curious to know, what do you do to see the IOwait on different cores?

And how do you tell the kernel to re-sync?

Well it's all essentially very sad, my RAID performance should be staggering. It's just reached a boiling point for me. Expensive hardware currently not really worth a damn.

This thread is my last hope.

Just bought some cheap AMD hardware that's opensolaris compatible to give ZFS + the solaris kernel a whirl, and say bye to GNU/Linux as any form of a server OS till the day I can get confirmation that things have changed.

What I find most surprising is that there ain't more data center users that have complained especially considering the IOwait bug. It's been there since 2.6.19 about 2006 that is. And we're now in 2009.

----------

## Dairinin

Just an assumption. Your chunk size is too big for your setup. I mean, the stripe size is 512k*9=4.5M. Each write operation requires a 512k from each drive and checksum calculation over 4.5M of data. Maybe latency is too big to archive good results, as IO system has no sliding windows and a like, allowing to compensate for latency. Huge cache does help, so bonnie results show good seq. write.  But iowait must be very high when pdflush wakes up.  

Again, your huge chunk does not allow for effective io parallelism. With my controller (Adaptec), I've never seen request sizes more than 512k in iostat, so there might be the chance your controller does not merge requests and fires them to the disks as is. In this case one read request just reads one chunk, and your array has an average linear read speed of a single disk.

----------

## colyte

Well I suppose that makes sense.

The funny thing is I only went that high of a chunk size because of the content I got on it, 8-16GB or so files.

And the reason I even recreated the raid was because 5x500gb discs with only defaults creating the raid5 array as well as XFS got me hard locks in early 2007.

Load shot through the roof and hard locked my box. And by the nature of how XFS this wasn't exactly a good thing.

What would your ideas Dairinin be for chunk size, stride size and everything tuned for 8-16GB files? You had some nice thoughts in terms of write capabilities so I'd love to hear your thoughts on it!

But as it may not be effective, my read values are just ridiculous. My 10 drive raid5 array should be giving me big numbers, not dismal like this. This is still very much of an issue.

----------

## Cyker

No, you are right - If you're mostly dealing with huge files, large stripe sizes will give a significant boost to linear reads and writes BUT, for best effect you must also have equivalently large block sizes (i.e. NOT use the standard 4k) and have the correct stride values too.

It will absolutely cripple performance for small files, espescially when there are lots of them, but for large files the filesystem should fly!  :Smile: 

That said, I think 512k is a bit excessive...! Maybe I am just not brave enough...!  :Wink: 

As for your  other questions, to make the array resync (Basically a consistency check. fsck for the RAID  :Smile: , run 

```
echo check >> /sys/block/mdX/md/sync_action
```

WARNING: This will take several hours and will hammer the RAID array for the entire time, so only do this when the system isn't going to be used much and not on a hot day!  :Wink: 

You can monitor its progress using 

```
cat /proc/mdstat
```

 or

```
mdadm -D /dev/md0
```

Running some sort of disk throughput monitor like iostat or bwm-ng is quite interesting while it's doing the resync  :Smile: 

```
  bwm-ng v0.6 (probing every 1.000s), press 'h' for help

  input: disk IO type: rate

  /         iface                   Rx                   Tx                Total

  ==============================================================================

              hda:           0.00 KB/s            0.00 KB/s            0.00 KB/s

              hdc:           0.00 KB/s            0.00 KB/s            0.00 KB/s

              hdd:           0.00 KB/s           99.90 KB/s           99.90 KB/s

              hdb:           0.00 KB/s            0.00 KB/s            0.00 KB/s

              sda:        2549.45 KB/s           59.94 KB/s         2609.39 KB/s

              sdb:        2417.58 KB/s           59.94 KB/s         2477.52 KB/s

              sdc:        2761.24 KB/s           71.93 KB/s         2833.17 KB/s

              sdd:        2801.20 KB/s           47.95 KB/s         2849.15 KB/s

              md0:        8123.88 KB/s           23.98 KB/s         8147.85 KB/s

  ------------------------------------------------------------------------------

            total:       18653.35 KB/s          363.64 KB/s        19016.98 KB/s

```

(This will be MUCH higher during a resync!)

To keep an eye on the iowait times, you can use top (hint: Press '1' to see per-cpu stats), but top sucks. htop blows its socks off - emerge, configure  and use that instead  :Wink: 

```
1  [||||||                               13.2%] Tasks: 156 total, 2 running                     

2  [|||||||||||||||                      27.4%] Load average: 0.69 0.75 0.72                    

Avg[|||||||||||||                        20.4%] Uptime: 3 days, 22:50:42                        

Mem[|||||||||||||||||||||||||||||||1046/3042MB] Swp[|||||                           196/1953MB] 

Avg:  1.0% sy:  4.2% ni: 14.6% hi:  0.3% si:  0.3% wa:  0.6%                                    

Mem:3042M used:1046M buffers:19M cache:1890M                                                    

Swp:1953M used:200800k                                                                          

  PID USER     PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command                           

18020 cyke      32  12  987M  490M 10236 S  1.0 16.1  6h58:22 nice -n 10 opera -nomail          

 8064 cyke      39  19  463M  240M 10168 S 33.0  7.9 31h08:47 ktorrent --nofork                 

 7871 cyke      20   0  171M 50956 11272 S  0.0  1.6  6:17.16 /home/cyke/local/lib/thunderbird-1

27489 cyke      30  10  126M 42416  7048 S  0.0  1.4  2:15.28 /bin/sh /home/cyke/local/bin/acror

 6969 cyke      20   0 35472 25544  2024 R  2.0  0.8 51:34.05 Xvnc :1 -desktop X -auth /home/cyk

28748 cyke      39  19 59072 14372  2724 S  0.0  0.5  0:02.16 /opt/opera/lib/opera/9.63//operapl

27303 cyke      20   0 30464 13076 10728 S  0.0  0.4  0:00.81 kedit [kdeinit] -caption KEdit -ic

 7700 cyke      20   0 25096  8224  4048 S  0.0  0.3 14:48.57 xchat                             

 7085 cyke      20   0 32988  7856  5772 S  0.0  0.3  0:38.84 kicker [kdeinit]                  

31479 gopher    20   0 29748  7368  4688 S  0.0  0.2  0:02.69 kicker [kdeinit]                  

31535 gopher    20   0 20412  7032  3776 S  0.0  0.2  5:09.86 xchat                             

 7081 cyke      20   0 30700  6816  4848 S  0.0  0.2  0:22.01 kwin [kdeinit] -session 10b5ef5365

14818 cyke      20   0 30464  6728  4416 S  0.0  0.2  0:00.64 kedit [kdeinit] -caption KEdit -ic

31476 gopher    20   0 27928  6636  4472 S  0.0  0.2  0:00.62 kdesktop [kdeinit]                

 7635 cyke      20   0 11268  6400   992 S  0.0  0.2  0:01.00 urxvt -rv -sl 8000 +ut +st -bd 8 -

F1Help  F2Setup F3SearchF4InvertF5Tree  F6SortByF7Nice -F8Nice +F9Kill  F10Quit                
```

(This looks even better in colour, and you can tell htop what to display!)

Appendix:

From my mdadm notes, other nifty stuff you can turn on:

```
Enable/disable write-intent bitmap:

mdadm /dev/mdX -Gb internal

mdadm /dev/mdX -Gb none
```

This is a journal for the array; If there is a powercut or something, it means the consistency check doesn't have to check the entire array (Saves many hours!) but has a corresponding performance penalty (Not really noticable outside benchmarks)

In addition to things like...

```
blockdev --setra 8192 /dev/md0

blockdev --setra 2048 /dev/sda /dev/sdb /dev/sdc /dev/sdd
```

...increasing the stripe_cache_size can provide massive boosts to write speed:

```
echo 8192 > /sys/block/md0/md/stripe_cache_size
```

----------

## erdrick

 *Quote:*   

> 
> 
> Current kernel config: http://www.dump.no/files/15939361214c/kernel.config
> 
> Just swapped to gentoo sources today, been using vanilla sources the last year or two. Most if not all options in the kernel have been tried, IO schedulers and so forth. 
> ...

 

Snippets of your posted kernel config:

 *Quote:*   

> 
> 
> #
> 
> # Processor type and features
> ...

 

Have you tried setting the io scheduler to NOOP? NOOP is usually what is recommended for RAIDs. Also I believe your Hz setting is way too high for a multicore processor; try 300hz or 100hz and perhaps changing your preemption to voluntary or none.

----------

## mauricev

What parameters to do you pass to bonnie++? Where do the latency numbers come from? What results are you looking at that makes you think your performance is poor? The "per char" results?

----------

## colyte

For bonnie++ I used "bonnie++ -d <directory on raid5 array> -s <8g>/<16g> -u <local user>".

The block write speed (~60MB/s) is total and utter shit for a raid5 array with 10 disks. An opensolaris raidz1 (which is basicly raid5) with 4x1TB disks, has a write speed around 140MB/s.

Not to mention, even with 2.6.30, the load/latency issues found in http://bugzilla.kernel.org/show_bug.cgi?id=12309 are still there.

Depressive.

----------

