# mdraid 5 speed, 3 disk vs 4 disk vs more?

## eccerr0r

Not another mdraid 5 speed problem thread???

Actually, no, I just wanted to post an observation which somewhat baffles me.

For the past (more than) a decade I've been running 4-disk RAID5's so that I could have a 25% overhead ratio.  I was getting miserable write rates (a less than the speed of one disk), and if lucky read rates got about the speed of two raw disks.

My first RAID was PATA, and the second RAID was SATA.  The PATA raid worked beautifully for many years with absolutely no disk failures.  A few minor disk kicks but was easily recoverable.  On the other hand, I had all sorts of problems with my first SATA RAID.  Lost disks left and right.  It got to the point I stopped putting fresh disks into the array and ran degraded.

... up until a few days ago, another disk kicked the bucket.  4 disk RAID with 2 good members?  Byebye data :(

Fortunately it wasn't that bad.  I knew I was running a ticking timebomb, so I frequently backed up to a new RAID on another machine, this time a new, bigger 3-disk RAID5 on a slow machine, separate so that it counts as an offline backup.   When that last straw disk died that took the old RAID down and out, it took a few hours but finally got the new 3-disk RAID5 (with its 33% space overhead) up and running booting once more.  I only had to do one restore as I just so happened to backup the disks just before it failed, so I have a very fresh data set minus one volume on the LVM.

However one thing that sort of surprised me : I noticed the RAID seemed faster than before.  Well, the disks themselves were faster, so it's sort of expected, but the other thing that I was surprised at: write speeds through the filesystem is faster than of a single disk for the first time I've ever had MDRAID.  One would expect my 4-disk RAID5s to share writes to another disk but the 3 disk, relatively speaking, is faster for some reason (sequential burst):

"TTFS" is Through The File System (and LVM and MDRAID)

Old 120Gx4 PATA RAID5:

Single Read: 58MB/sec

TTFS Read ~110MB/sec

TTFS Write ~50MB/sec

Old 500Gx4 SATA RAID5

Single Read: 90MB/sec

TTFS Read: ~180MB/sec

TTFS Write: ~80MB/sec

New 2Tx3 SATA RAID5

Single Read: 180MB/sec

TTFS Read: 313MB/sec

TTFS Write: 200MB/sec

Is there some sort of bottleneck on 4-disk MDRAID5 but OK on 3-disk?  Or perhaps these "new" 2T disks are just that much better?  Note that the two SATA systems were benchmarked on the same CPU and same ICH.

Not quite SSD territory of course, but this is pretty sweet.  I was thinking about someday reshaping to a 4-disk RAID but now I'm worried I'd lose this disk bandwidth.

----------

## Mad Merlin

I've got a 4x6T raid5 and a 12x4T raid6 array here (both mdadm) and I ran some tests for you:

4x6T raid5 (WD Reds, X79 SATA ports):

```

$ dd if=/dev/zero of=bigfile bs=1G count=64 conv=fdatasync

64+0 records in

64+0 records out

68719476736 bytes (69 GB, 64 GiB) copied, 245.646 s, 280 MB/s

$ dd if=bigfile of=/dev/null bs=1G iflag=direct

64+0 records in

64+0 records out

68719476736 bytes (69 GB, 64 GiB) copied, 288.211 s, 238 MB/s

```

That read speed is pretty sucktastic... not actually sure why it's even slower than writes, but it's still faster than a single drive.

12x4T raid6 (Seagate Barracudas, 4x on X79 SATA ports, 8x on LSI SAS2308 based HBA)

```

# dd if=/dev/zero of=bigfile bs=1G count=64 conv=fdatasync

64+0 records in

64+0 records out

68719476736 bytes (69 GB) copied, 142.274 s, 483 MB/s

# dd if=bigfile of=/dev/null bs=1G iflag=direct

64+0 records in

64+0 records out

68719476736 bytes (69 GB) copied, 55.4539 s, 1.2 GB/s

```

Write speed here doesn't scale well (no surprise there), but reads get a decent improvement.

My understanding is there's a number of tunables for raid5 and raid6 (like stripe size, and setting it so that it's in alignment with the drives) that can have a noticeable impact on sequential read/write speeds. To be honest I've never looked into it much because by the time I found out about it, SSDs existed.

Areal density on spinning rust is still steadily improving and aside from RPM, that's the primary factor that affects sequential read/write speeds. Your older 500G drives are going to be using much less dense platters than your newer 2T drives, so your single drive numbers look about right.

All that aside, sequential read/write speeds almost never matter, IOPS is what matters, and for that you'd need several hundred spinning rust drives (...and not in raid5 or raid6! Those modes don't improve IOPS, they usually worsen it.) to even get close to the performance of a single low end SSD.

If you care about performance, raid10 all the way, ideally with SSDs.

----------

## eccerr0r

Read speed has always been significantly higher than single drive, I was just wondering why the 3-disk solution was "closer" to "theoretical".  With the number of spindles increasing I'd imagine that it should be able to do both reads and writes faster than single disk, but real throughput is much lower than theoretical and flatlines out faster than expected as more disks are introduced - diminishing returns.

Barring that one read issue, it looks like your disks are somewhat scaling (I don't know how fast your individual member disks are, that would be curious to know) but clearly the 12-disk array is nowhere near 12x performance of a single disk.  It doesn't look too bad however - read looks fairly decent, and at least write speeds far exceeds that of single disk (and possibly 2 disks).

Though IOPS is important, it depends on how many IOPS your system actually needs to do.  At least for me, the number of IOPS this RAID needs to do is not high as there are few consumers of the data.  However the data is consumed in large chunks most of the time and sequential does make a difference in this case.

I suspect that I might have to stay with a 3-disk solution as though overhead is higher (33% vs 25%) it's still not as bad as RAID10 (50% overhead).  The performance drop of going to the 4th disk seems costly.

----------

## krinn

did you check your 1TB disks use write-back instead of write-through? combine with a huge cache it can do wonder (until a power failure).

----------

## Mad Merlin

Those drives are in the 150-160M/s sequential read/write range.

There should be no point in the curve where adding another drive to a RAID 5 or RAID 6 makes sequential reads/writes to the array slower. If there is, and you care, you should probably look into the aforementioned tunables for the array (many of these affect the on disk format of the array and so are probably best chosen at creation time for the array).

----------

## szatox

 *Quote:*   

> I was just wondering why the 3-disk solution was "closer" to "theoretical". 

 

Not sure how big the actual numbers are, but it might have something to do with rotational latency (or jitter in general). Say, with a single drive you expect to wait half of the cycle before the disk is positioned for reading that particular sector. With 2 disks it's 3/4 of that cycle, and approaches 1 full cycle with an infinite number of disks. The total speed is limited by the most unlucky part, so increasing the number of attempts (number of disks) pushes the expected value from average towards pessimistic estimation.

Other things that can affect it is read block size, write block size, strip size and random/sequential IO ratio. With heavily random reads it might be prefferable to use enormous stripes to increase the chance that a particular read will only hit a single drive - at cost of huge performance hit when you update that data.

Sequential reads and writes spanning multiple drives will be much faster, but generate more IO operations, so you really don't want to go all random in this case. Perhaps one of them was better tuned for the test you ran on it.

----------

## Cyker

Are they all going through the same controller?

I had an issue like this on my old server box. My theory was that 4 of the SATA sockets were going through the AMD chipset but the other 3 were going through some add-on chip which was hooked through the PCI bus rather than the hypertransport bus and was noticeably holding back the rest of the array, esp. when the PCI bus was under heavy load from e.g. network.

----------

## eccerr0r

For me, the 500G disks and the 2T disks were connected to the same controller(but at different times as I don't have enough SATA ports) during my tests.  

The ratio between the expected vs the observed rates was worse for the 4 disk versus the 3 disk setup - this is the issue that got me wondering.

I tried tuning the 120G and 500G disk sets, however I just plopped in the 2T disks without additional tuning (which by default, inherited the 500G tuning if there were any) and had this observation...

The only other possibility which may explain things is that I did have dissimilar branded disks, though they were of similar speeds.  The 2TB RAID members are identical.

----------

## Cyker

I wonder if it could also be a vibration thing?

Supposedly one of the reasons they sell these more expensive NAS HDDs is that they incorporate vibration compensation sensors; I personally have never noticed any significant difference in performance between arrays of conventional HDDs vs special NAS ones, but I normally mount my HDDs in acoustically damped cages which (I'd assume!) also damp the vibrations.

----------

## eccerr0r

Possible, alas the 500G disks were inserted in the same hot swap bay that the 2T disks are now inserted in.  

All disks are supposed to have some vibration compensation, it's absolutely necessary to make sure the heads stay on track.

----------

## Goverp

I don't think anyone's mentioned the filesystem in use, but could stride be involved here? if you copy a 3-disk RAID5 to a 4-disk one without changing stride that might impact performance.

----------

## eccerr0r

That's a possibility (though I did mke2fs fresh on both RAIDs).  I found https://busybox.net/~aldot/mkfs_stride.html and played with the numbers a bit - and it seems that the stride/stripe widths are "weird" with 4 disks (because there are actually 3 valid data chunks on a specific position on all disks.)

For 3-disk, it's an even number of 2 and perhaps this is why things seem better with 2 without stripe tuning.  I was also playing with a degenerate 2-disk RAID5 and should test throughput on that too.

I would have thought given small enough chunks that it might not make that big a difference, but maybe not.  However it sounds like if this is the issue, then growing the array is a bad idea because it implies a reformat to change stride/striping... but wait... these are tunables with tune2fs (but would require reallocation of existing blocks to take advantage of it).

Sounds like I'm stuck with the 3 disk for now, or at least some more experimentation with tune2fs is needed.  And perhaps if I could force assemble the bad RAID to see what the stripe/strides were.

----------

## Mad Merlin

Out of curiosity, do you have 4x2T drives available for testing right now? You alluded to the fact that you had seen lower sequential read/write performance with the 4 disk set than the 3, but it wasn't clear if you were extrapolating based on past arrays or if you had actually tested it. I'd be curious to see the specific numbers if you do indeed have a 4x2T set to play with.

----------

## eccerr0r

It was an extrapolation based on the 4x500G disks that I had running, and how performance did not scale relative to the number of disks.

I don't have a spare to grow it.

----------

## eccerr0r

I moved the defective RAID to another system and force assembled it.  (wow, the temporary P4 system I'm using to play with these disks (7 HDDs including the RAID, boot disk, and two others) eats so much power, all the RAID disks drags a 400W PSU to its knees.  Had to swap a different PSU for now.)

I figure that forced assembly should be safe because the disk appears to be healthy enough to read, and I suspect when that disk got dropped, well, the RAID is dead anyway - it makes no sense to continue as there's not enough disks - and minus the sectors it was trying to write (and the write hole), all disks should be synced with each other.

So I looked at the ext3fs striping:

```
RAID stride:              16

RAID stripe width:        48
```

Looks like mke2fs detected the strides and stripes properly.  I haven't fscked the volume yet, I wonder how bad it's going to be, the ext3 journalling + lvm + mdraid, I wonder if the journal is even going to be in sync or not.

----------

