# SATA: Slow cached reads revisited

## kristoffer

As the unfortunate author of this thread, I experience quite low cached reads with my SATA drive:

```
$ hdparm -tT /dev/sda

/dev/sda:

 Timing cached reads:   1404 MB in  2.00 seconds = 701.84 MB/sec

 Timing buffered disk reads:  176 MB in  3.00 seconds =  58.57 MB/sec
```

As I understood the other thread, this didn't have much to do with the actual harddrive, but more with the communication between memory and CPU. Anyway, as you might notice, I do have better performance then the former poster, but I have confirmed with memtest86+ that my two identical PC-3200 memory sticks (with DDR400 chips) indeed run at full speed, i.e. 200 MHz with working dual channel. Shouldn't that give me a memory bandwidth around 3 GB/s which hdparm should report for cached reads? I tend to see people post hdparm results for cached reads at that speed, so I guess I should, or am I mistaking (if so, please enlighten me!)? What could be wrong?

----------

## Desintegr

I have an AMD64 3000+ with 2x512MB Kingston DDR 400 (dual-channel enabled) and I get the same result like you (~800 MB/s).

My chipset is a NForce3 Ultra. (Gigabyte K8NS-939).

No problem with disk speed (Maxtor SATA) : ~50-55 MB/s

I also try with Ubuntu Feisty (32bits), I got ~800 MB/s too. 

I've also found a interesting thread : http://www.mail-archive.com/debian-amd64@lists.debian.org/msg21903.html

----------

## kristoffer

The result I posted earlier (701.84 MB/sec cached) was measured when I ran a couple harddrive using processes, but without them it's alightly above 800 MB/s. Howvere, I get the exact same results with Debian 4.0. My much weaker laptop, a Dell Latitude X1 with two unpaired memory sticks (256 MB and 1024 MB) and thus no dual channel, gets something similar.

If it's of any importance, I run an Athlon64 3200 (2000 MHz) with two paired PC-3200/DDR400 512 MB memory modules (Corsair value) with working dual channel on an ASUS A8V Deluxe motherboard.

----------

## eccerr0r

just want to make sure... what version of hdparm and what CFLAGS?  can you run the SAME binary of hdparm under both situations?

IIRC there were some hdparm changes in some versions, plus since this is a cpu/ram benchmark, optimization will come into play.

----------

## kristoffer

 *eccerr0r wrote:*   

> just want to make sure... what version of hdparm and what CFLAGS?  can you run the SAME binary of hdparm under both situations?
> 
> IIRC there were some hdparm changes in some versions, plus since this is a cpu/ram benchmark, optimization will come into play.

 

I'm using hdparm-6.9 with very sane CFLAGS, namely "-march=k8 -pipe -O2". But are you positive that the compiler optimisations will increase memory throughput by a factor in the range 2 to 5? That would definitely surprise me.

As a side note I read the following in the link that Desintegr provided:

 *Quote:*   

> The changelog for hdparm v6.9 has:
> 
> "fix X2 over-reporting of -T results"

 

I don't think that effects me since my CPU is single core, but perhaps it can explain things for others that have a similar problem. Or could it be that these 3000+ MB/s reports I've seen are affected, and should be halved? Anyway, dual channel DDR400 should do better than ~800 MB/s, so I still see this as a problem.

----------

## Desintegr

 *kristoffer wrote:*   

> 
> 
> As a side note I read the following in the link that Desintegr provided:
> 
>  *Quote:*   The changelog for hdparm v6.9 has:
> ...

 

X2 doesn't mean Dual-Core. It means « times 2 » like in 2x3=6.

You should try another benchmark tools.

----------

## eccerr0r

 *kristoffer wrote:*   

> I'm using hdparm-6.9 with very sane CFLAGS, namely "-march=k8 -pipe -O2". But are you positive that the compiler optimisations will increase memory throughput by a factor in the range 2 to 5? That would definitely surprise me.
> 
> 

 

Yes.  A poorly compiled loop can have detrimental effects on its spin time.  Poor prefetching, whether inserted by compilers explicitly or due to not-so-great memory optimizations by not understanding the architecture can reduce speed greatly.

The best is to check all other things being equal.  I just wanted to make sure you didn't run gentoo's hdparm and compared it to ubuntu's hdparm with their respective kernels.

Also wanted to make sure we're not chasing a phantom issue...  This is a synthetic benchmark issue, and apparently hasn't been proven a true memory performance issue - if you see real software running half speed, then it needs to be looked at.  Does it take twice as long to rotate a large bitmap graphics image (which is a memory intensive operation)?

----------

## kristoffer

 *Desintegr wrote:*   

> X2 doesn't mean Dual-Core. It means « times 2 » like in 2x3=6..

 

I thought that in this context, X2 referred to X2 as in "AMD Athlon64 X2 Dual-Core" and similar. Are you sure about your interpretation?

----------

## Desintegr

 *kristoffer wrote:*   

>  *Desintegr wrote:*   X2 doesn't mean Dual-Core. It means « times 2 » like in 2x3=6.. 
> 
> I thought that in this context, X2 referred to X2 as in "AMD Athlon64 X2 Dual-Core" and similar. Are you sure about your interpretation?

 

Use hdparm before the patch : you'll get ~1400 MB/s. Try after : you'll get ~700 MB/s.

----------

## albright

This is definitely hdparm - I noticed the drop in -T results

in both a centrino laptop and a amdx2 desktop after

an hdparm upgrade ...

Here's my current results, first for laptop:

```
/dev/hda:

 Timing cached reads:   1242 MB in  2.00 seconds = 620.81 MB/sec

 Timing buffered disk reads:  108 MB in  3.02 seconds =  35.79 MB/sec
```

and for amd:

```
/dev/sda:

 Timing cached reads:   1516 MB in  2.00 seconds = 757.79 MB/sec

 Timing buffered disk reads:  192 MB in  3.06 seconds =  62.64 MB/sec
```

The -T reading used to be twice as fast  :Sad: 

But I didn't notice any real difference  :Wink: 

----------

## kristoffer

 *eccerr0r wrote:*   

> A poorly compiled loop can have detrimental effects on its spin time. Poor prefetching, whether inserted by compilers explicitly or due to not-so-great memory optimizations by not understanding the architecture can reduce speed greatly.

 

I know that optimisations matters quite alot (I have in fact written a C compiler so I know about some of the techniques used) but can it improve things by a factor of 2 or more? Also, since gcc is no toy compiler, I'm quite sure its aware of the underlying architecture of amd64. Or maybe you referred to the a situation where I used completely wrong CFLAGS, including wrong -march? Anyways, I tried recompiling hdparm witout -O2 and with -O3 and didn't notice any differences, so CFLAGS doesn't seem to be the issue here.

 *Desintegr wrote:*   

> Use hdparm before the patch : you'll get ~1400 MB/s. Try after : you'll get ~700 MB/s.

 

I stand corrected. Seems like a weird "bug" or whatever, though.

Still, there is no explanation why some DDR400 users have 3000+ MB/s results. Even if that was with <=hdparm-6.6 the correct speed should be around 1500 MB/s, which still is twice as fast as most people in this thread. Is that simply due to recent improvements of motherboard memory architecture? I mean, my board is 3 years old, so there's definitely some room for improvement there. Perhaps my board can't utilize the full speed of DDR400? Seems stupid, and I doubt it, but I have no clue otherwise.[/quote]

----------

## eccerr0r

 *kristoffer wrote:*   

> I know that optimisations matters quite alot (I have in fact written a C compiler so I know about some of the techniques used) but can it improve things by a factor of 2 or more? Also, since gcc is no toy compiler, I'm quite sure its aware of the underlying architecture of amd64. Or maybe you referred to the a situation where I used completely wrong CFLAGS, including wrong -march? Anyways, I tried recompiling hdparm witout -O2 and with -O3 and didn't notice any differences, so CFLAGS doesn't seem to be the issue here.
> 
> 

 

The question is if there are two binaries out there, both executable on the target platform but with the "wrong" optimization.  How about -Os vs -O2, and -march k8 vs -march nocona.  I can't say that these will definitely cause poor optimization and degraded performance but the possibility exists, it cannot be discounted as a possible source.  Look at the P4, and how poorly it runs with existing code, but if you tune it for that architecture, it will perform much better.  It's even worse for Itanium, gcc versus Intel's compilers, gcc tends to lose.  Granted you were likely using the best available options but how should I know?  I'm saying this as a generic problem not as a specific solution, and you should do binary for binary comparisons.  Always compare versions and compile options.

So it does look like after all you were chasing a phantom problem.  Next time use a "real" benchmark such as code you run day-to-day before claiming a problem.  Sounds like you are some sort of CS student, you should go and read the hdparm source code and see what it's actually doing.   My guess is that it's doing incomplete transfers to the disk controller due to word or transfer size limits and hence not measuring your true memory speed, but I'm too lazy to read the code since I don't really care...

----------

## kristoffer

 *eccerr0r wrote:*   

> The question is if there are two binaries out there, both executable on the target platform but with the "wrong" optimization.  How about -Os vs -O2, and -march k8 vs -march nocona.  I can't say that these will definitely cause poor optimization and degraded performance but the possibility exists, it cannot be discounted as a possible source.  Look at the P4, and how poorly it runs with existing code, but if you tune it for that architecture, it will perform much better.  It's even worse for Itanium, gcc versus Intel's compilers, gcc tends to lose.  Granted you were likely using the best available options but how should I know?  I'm saying this as a generic problem not as a specific solution, and you should do binary for binary comparisons.  Always compare versions and compile options.
> 
> So it does look like after all you were chasing a phantom problem.  Next time use a "real" benchmark such as code you run day-to-day before claiming a problem.  Sounds like you are some sort of CS student, you should go and read the hdparm source code and see what it's actually doing.   My guess is that it's doing incomplete transfers to the disk controller due to word or transfer size limits and hence not measuring your true memory speed, but I'm too lazy to read the code since I don't really care...

 

Sure and all, but I'm still of the impression that something is wrong and that it's not entierly hdparm's fault. I'm not interested in an argument or wherever this is headed; I'm simply interested in why some people have such high ratings on that test compared to my rig. Any way, I took a quick look a the source as you suggested and, not surprisingly, it's pretty straight forward:

For -T (cached reads), hdparm allocates 2 MB of shared memory into a buffer, then it reads (with the read() syscall) the first 2 MB of data from the specified device into that buffer which I guess is done in order to put the data in the cache right before the timer starts. After that the timer is started and the above read is looped as fast as possible for 2 seconds. The number of iterations is multiplied with 2 MB and divided with 2 seconds, resulting in the reported speed for cached reads. The time taken for overhead operations like lseek() and the timer is taken into account for that as well. 

I'm definitely no pro on these low-level things, but would appreciate if any one could point out any source of errors in that measurement with respect to a modern computer architecture and the Linux kernel's buffer cache for disk reads.

A problem with benchmarks is that they have to be compared to something different. I don't know what to compare my hdparm results to except for other people's results, some which are surprisingly much higher than mine with very similar hardware. Sure, I can also compare them among different versions of software and kernels on my computer, which I have done with no alarming differences (except for the *2 issue with older versions of hdparm which now is sorted out). What I'm interested in is learning whether Linux handles my harddrives efficiently, and unless someone has an idea why hdparm's method of testing it is wrong, I don't know whether I'm chasing a phantom or not. The fact that the theoretical speed of DDR400 is 4 times higher than what hdparm measures that my Linux kernel is making out of it (and what I see other people reporting) is enough to make me skeptical.

I will try to look into some other means for benchmarking which might tell me that my disk caching is working properly and that hdparm does it's thing wrong some how. But still, for all I care other programs I use might utilize similar methods of accessing the same disk areas as hdparm does and thus also suffer from the same penalties. If that's the case, I would be interested in sorting out this hdparm thing despite other benchmarks. In fact, I sometimes feel that my gentoo system isn't as snappy as it should be, especially when it comes to starting new processes, so I though it could be stuff like this that caused it.

----------

