# Need help tweaking performance for highly interactive server

## mciann

This may not belong here, but the help I think I need relates to the kernel, so here goes...

I am trying to use linux as a file server with an application that is extremely sensitive to delay and jitter (think VOIP).  The application needs to write thousands of files very quickly.  I have tried several flavors of linux, and Gentoo is providing superior results to anything else I have tried.  In fact, Gentoo running Samba seems to significantly outperform windows on the same hardware, as demonstrated by the two graphs linked below (these are enormous, I know, but I need the visual resolution so you can see what is going on):

Windows:

http://www.mciann.com/windozegraph.jpg

Gentoo:

http://www.mciann.com/linuxgraph.jpg

These are graphs of file write times.  The items of interest are the dark blue line (file write time) and the red blocks (delay related error condition).

Although the Windows run went without error, the baseline is all over the place.  This will not work in a production environment.  The Gentoo baseline is rock solid, but slowly ramps up over time.   I have been able to determine that this ramping effect is caused by the number of files in a directory.  If I cause the application to write to a new subdirectory every 1000 files, for example, the ramp "resets".   Although it would seem that this problem should be directly related to the file system, I have tried JFS, Reiser, XFS, and ext2 filesystems, and all demonstrate this problem.  Perhaps there is some sort of directory entry cache parameters somewhere that I could change??

I know that the "right" answer to this is to change my application so that it creates new subdirectories every few thousand items, but this isn't an option for me, mainly because I am not the developer.

The baseline would be golden if it would just not ramp!  Can anyone offer any insight that could help me?  Thanks!

----------

## Tsonn

One alternative would be to have a cron job moving files into subdirectories every ten minutes, half hour, or whatever...

----------

## smart

Na, that's not what he is looking for.

But sidenote... what you describe seems to match reiserfs perfectly, so despite this not being the real issue, you should most probably choose that.

Regarding the ramping effect itself, my guess would be samba, not kernel for as long as you see no swapping.

In any case to rule that latter out in unexpected moments (especially 2.6 kernels in my opinion are not ideal with their default in that respect) at least for the test i would do "echo 0 > /proc/sys/vm/swappiness". Check the exact spelling though, i'm "offline"  :Smile: 

If that nails the ramping, then make big time sure you got enough RAM in the box, cause with this, once it is in real need of swapping, it will definitely cause a dent up in repsonse time, which then goes down to normal again. In short, the vm is not "prepared" for new requests, but at the same time it doesn't fiddle for as long as there is no real need. With a RAM wise well equipped box, and with RAM prices nowadays though, that would normally be the better choice IMHO.

----------

## mciann

I am running a twin Xeon 3.2 gig machine with 3 gigabytes of RAM.  I am not swapping at all.  I did try Reiser, and found that XFS just barely outperformed it for this application.

----------

## smart

Yep, 3G, fine. The point is if you run kernel 2.6 "I am not swapping at all" might be a lie without you knowing it. It might still swap when you don't expect it. So you may try the tweak or not.

XFS outperforming reiser in this task i think is a bit odd.

The other thing if not done (it's not described) is that i'd in this case give that data section its own harddrive to work on, so that e.g. logging et al don't influence the data work. Since you're looking at milliseconds, seektime counts. It doesn't change the ramping (again, my guess is on Samba), but since you asked for continuity most, it would shave some ripple.

----------

## Redux

The most important factor here is probably your hard drives and the interface. The ramp up that is occurring is probably the back log of files being written to the disk. The are waiting to be written in the hd cache or in system memory.

You probably need to look at a high speed RAID controller that supports on board RAM for extra cache (up to 1GB depending on the card) and some SCSI drives put in RAID 0 to give you the high speed writing that you desire.

It is an expensive solution. The RAID card alone will cost between $200 and $400 depending on the model. Adaptec and Promise are both good manufacturers that make cards that will do the job for you.

----------

## mciann

I am using a Perc 4DI 128 meg caching RAID controller with 15,000 rpm ultra320 drives in a RAID 0 + 1 stripe.

smart:

Thanks - I'll doublecheck my reiser results, and try your vm tweak.

----------

## smart

You might try deadline scheduler, possibly setting writes starved to 1, over the default anticipatory scheduler.

Didn't find details about that controller ... if it does write-through (batteries on it ?) then you might want to try and pull it and do software raid. Otherwise, if that's configurable, switch it to write back if performance counts most.

----------

## smart

BTW, how about the other end. You use Gig Ethernet ? Switched ? Could even do a direct link mabye ? Set socket options in smb.conf ? All those files are relativel< small i guess... tried tcp_low_latency ? Is there a typical file size with those files ... maybe increase package size to match that nicely ?

----------

## mciann

Okay - new information.

I repeated my XFS vs. Reiser tests.  I got quite unexpected results.

I was able to confirm that XFS indeed outperforms Reiser for my application, but I also got spectacularly better results when I switched the volume back to XFS.  I figured out that this is the first time I have tried XFS on a seperate logical volume from the root.  Me == bonehead for not catching that sooner.  I keep forgetting what I have tried on Mandrake but not yet tried on Gentoo.

----------

## smart

Still cannot believe that XFS better than reiser in small quick many file access. but you gave a hint. Which is "logical volume".

Beware. This time it might not be your friend. Go with cleanly physical distinctions in these test or you might get wildly wrong stats, 'cause in no instance you know where you logical volume sits, mostly relative to the other stuff.

Have group 1, physical group one, and put on this whatever you want, be it lvm.

Then have group two, physical group two, with discs being part of nothing else, and best, not even lvm and have this be your data, small files many accesses storage. If the controller does writeback, only 0+1, by the controller and a directly physical partition. Then smack reiser on it, and compare to XFS.

Until you did this test in this way, i give nought on 'em, 'cause as mentionned, you've got no clue where you HD heads scrubb around. In one test they may just have to go next to your logging partition, next test they will have to fly all over the disc, next test they sit on different platters just below, or on different drive. No clue. Physical distinction is key in these comparisons.

So forget about all the mesurements you did so far, if i got your setup right, they are all worthless.

Even worse, misleading.

Worse even, it seems you have two logical volumes at hand, one with reiser, one with XFS. This means, completely different conditions for the two. You need to take the same thing and run it, once reiser, once XFS.

Oh, just to make sure i didn't forget to mention, same thing, physically.

I'm so keen to say until you get: reiser beats XFS in small many files, XFS beats reiser in few big files, consider your test method is flawed, or something else is wrong

----------

## mciann

I would not extrapolate anything from what I am doing and try to make judgements about the overall performance of one filesystem versus another. (the quality of my data or testing practices notwithstanding  :Smile: )  What I am doing is extremely specific, and has a very narrow range of demands upon the system.

That said...

Still cannot believe that XFS better than reiser in small quick many file access.

My test situation does not access files at all.  I need to write many small files quickly.

but you gave a hint. Which is "logical volume".

Beware. This time it might not be your friend. Go with cleanly physical distinctions in these test or you might get wildly wrong stats, 'cause in no instance you know where you logical volume sits, mostly relative to the other stuff. 

logical volume in this conversation != lvm.  The drive arrangement consists of four drives in a hardware RAID 0+1 configuration.  The operating system sees one physical disk, which stripes across two drives and then is mirrored to a second pair configured the same way.  Drive redundancy is a production requirement of the application, so I can't plop a physical disk straight to the operating system  (unless I do software mirroring, and I don't think anyone can suggest that would work better than hardware RAID 1).  Given that requirement, it made the most sense to span disks (you get twice the per-disk performance), and mirror the spans (RAID 0+1).  I did try seperate, unmirrored physical disks in Mandrake, just to test, but the performance was less than what I am seeing on the RAID 0+1 array, and since it won't work in production, it didn't make sense to pursue the issue.  I would like to have seperate physical disks presented to the operating system, but that would require 8 drives, and my cage only holds 6.

The physical disk has 4 partitions.  sda1 is boot.  sda2 is swap.  sda3 is root.  sda4 is /mnt/vol1 (my Novell background is showing.  sorry  :Smile: )

The test results I described were derived in the following manner:  

mkreiserfs /dev/sda4

mount /dev/sda4 /mnt/vol1

(test)

mkfs -t xfs -f /dev/sda4

mount /dev/sda4 /mnt/vol1

(repeat test)

So a more accurate way to describe what I was doing would be to use the word partition rather than "logical volume".  Sorry for creating that confusion.

----------

## smart

 *Quote:*   

> unless I do software mirroring, and I don't think anyone can suggest that would work better than hardware RAID 1

 

That depends, has been proven to be better done in OS than in Hardware for quite a couple RAID controllers. If policy is write back, the controller definitely wins  :Smile: 

What you call span, i call stripe i guess...

I never suggested you to use single disk installations.

 *Quote:*   

> The physical disk has 4 partitions. sda1 is boot. sda2 is swap. sda3 is root. sda4 is /mnt/vol1 (my Novell background is showing. sorry Smile

 

Thats expected to be close to worst, since you guarantee interference between OS activity and data writes in the worst possible manner regarding disk access. Since you do not know what your current affairs with kernel & system are (cronjobs to clean up/swapping whatever), you also cannot compare efficiency of sda4.

 *Quote:*   

> I would like to have seperate physical disks presented to the operating system, but that would require 8 drives, and my cage only holds 6. 

 

6 disks is perfect, that would have been my next question to ask you. 8 are not really necessary for the task, since your speed concern is with the data, not with the OS. 

My suggestion is to take 2 for system (mirrored, sda) and 4 for data (striped and mirrored, sdb).

sda1 boot

sda2 swap (if needed)

sda3 root

sda4 lvm whatever

Now. if you have an estimation what your max size of data to be used will be, take sdb and partition it down to something like double and thus, rule out the rest of the disc.

sdb1 highperf data

If you want to make use of the rest of the disc, you might use it for backup copies after hours, but don't use it otherwise than data while you want it give you best performance.

If you MODIFY the data you write to that disc, the you might consider rearranging the data once a day by creating and sdb2, same size as sdb1. Once during night when there is no more data activity, delete everything on sdb2, copy all files from sdb1 to sdb2, clean sdb1, copy everything back to sdb1.

That seems to me to be the best you can do regarding disk backend in your case. You could possibly do a bit on networking side as well and maybe in the last stage try deadline scheduler.

Give a hint on the expected filesize and your network connection over which the filesystem is accessed.

----------

## mciann

Ok, just verified that the array is in write-back mode.  I'll try your 6 disk configuration suggestion.  That sounds like it would work better.

I fear, however that all of this hardware/kernel tweaking is only going to delay the onset of the ramping, but not eliminate it entirely.   I think your initial suggestion that the problem is with Samba and not the O/S carries a great deal of weight.  I've asked about this on the Samba mailing list, but never got a response.  I hate to cross-post, but would it be appropriate to ask for help from the networking and security forum (here) for Samba specific tuning hints?

P.S. - I've tried TCP_NODELAY and increasing SO_RCVBUF and SO_SNDBUF, plus a good many other Samba tweaks I can't remember just now.  I've been working on this for weeks now.

----------

## smart

If you've got a typical filesize to add, you can base your decision of what size you set RCV and SND buffers. I for myself have set them bot to 16384. But again, might be possible to do better.

What's in between your two machines networkwise ?

How much data do you want to move ? How much is the network connection saturated ? Are the files roughly same size all the time ?

I guess nobody would object you requesting here what you asked on samba list, at least in my opinion. This is a different platform. Doubts would prabably arise if you'd do so in different topics or threads here on gentoo.

----------

## mciann

There are four files that must be copied for every document transaction.  2 are approximately 200K each, and 2 are less than 10K.

The network consists of Cat5E cabling with a netgear GSM712 gig over copper switch.  (I can't get the company to spend money on networking gear)  Itel pro1000Ms are in all the hosts.  I haven't tried jumbo frames yet, but I doubted that the win2k client machines could deal with them.  We're pumping right at 100Mbits of traffic, so we aren't really taxing the network.

----------

## smart

Switches also may know about two modes comparable to write-through and write back. The latter they call store and forward, we want write though (fat write??, dunno). If the two machines are close to each other, maybe you can use a dedicated pair of NICs to do a crossover connection.

You could try "echo 1 > /proc/sys/net/ipv4/tcp_low_latency", compare measurements on this.

The 16384 should be ok for your case.

----------

## mciann

The comparison you are thinking about is cut-through vs. store-and-forward.  It is only really meaningful in older Cisco switches, where the backplane latency was high enough for you to want to attempt cut-through.  The idea of cut-through is that the switch makes a frame forwarding decision before the entire frame has been received.  The downside to this is that if the frame is malformed, the switch can't do anything to deal with it, and the receiving station gets the workload of having to figure out that it is a bad frame.

There really isn't a reason to put a modern switch into cut-through mode, especially to resolve a latency problem.  Modern switches have traffic management and hardware queueing features that will do that much better.  Also, modern high performance switches, such as the Nortel 8600, actually have a  multi-layer buffer design that implements cut-through WRT the central processor (where the mac forwarding table is) and pumps frames straight to x-mit ASIC, where bad frame detection can be done.

That said, we did attempt to use direct wiring (with a previous OS).  The problem is that two hosts have to talk to the server.  The overhead of driving the second network card eliminated the benefit, and things worked better when using the switch.

Besides all of that, if what we are trying to do is elminate a ramp effect that directly relates to the number of files in a directory, how can network behavior be a factor?

/network engineer in a previous life

----------

## smart

 *Quote:*   

> The comparison you are thinking about is cut-through vs. store-and-forward. It is only really meaningful in older Cisco switches, where the backplane latency was high enough for you to want to attempt cut-through. The idea of cut-through is that the switch makes a frame forwarding decision before the entire frame has been received. The downside to this is that if the frame is malformed, the switch can't do anything to deal with it, and the receiving station gets the workload of having to figure out that it is a bad frame. 

 

Well described. One of the good things with modern hardware is that broken frames nearly never happen anymore. So you can usually just head them on.

 *Quote:*   

> There really isn't a reason to put a modern switch into cut-through mode, especially to resolve a latency problem. Modern switches have traffic management and hardware queueing features that will do that much better. Also, modern high performance switches, such as the Nortel 8600, actually have a multi-layer buffer design that implements cut-through WRT the central processor (where the mac forwarding table is) and pumps frames straight to x-mit ASIC, where bad frame detection can be done. 

 

No matter how they are designed, they can do one thing out of two.

- First receive the whole package and then verify its checksum and then decide to forward or drop it.

or

- do not wait until they received the whole package and forward immediately what is received

 *Quote:*   

> There really isn't a reason to put a modern switch into cut-through mode, especially to resolve a latency problem.

 

Yes, you would, for exactly that reason. To reduce latency. Performance is not an issue anymore with the routers/switches having sufficient CPU so they can calculate the checksum "along" the package coing in. but they cannot make a decision with that before the package is fully received either. So it's buffered and then sent.

 *Quote:*   

> The overhead of driving the second network card eliminated the benefit, and things worked better when using the switch. 

 

That surprises me a bit. If those Gigabit cards stick in normal PCI slots and would be quite stuffed, i would see that. Are you sure the data traversed the direct link ? Reconfigured the mount to go to that other IP in that other network ? Really used a different network (if you use two IPs out of the same net, response to requests will all go out the same NIC until you configure source based routing)... No bad words intended, just trying to assure to find why this discrepancy in between expectation and observation.

 *Quote:*   

> Besides all of that, if what we are trying to do is elminate a ramp effect that directly relates to the number of files in a directory, how can network behavior be a factor? 

 

Right, that was not meant to be related to the ramp effect though, just trying to get the overall latency down. If theres enough distance of the baseline in the beginning to what we need, ramp effect would then more probably not hit the limit of acceptable response time.

Along that line, we can also try deadline scheduler...

/nutty dude, entire life

----------

## mciann

I read about deadline scheduler, but couldn't fnd any good information as to how to implement it.  How would I do so?

----------

## smart

Two things came to mind meanwhile.

If i remember right, the NIC you use is an active card with sth. like an i960 for offloading the CPU. I'm not fully aware what the curernt situation is, but historically, Linux doesn't support this as an active card, using the interface HW on it directly and not using the i960/active capability due to lack of license/support for the firmware. But i think intel itself meanwhile changed that. Try modinfo on the driver module you use for that card. It should offer module options to tune the card with respect to throughput vs latency or CPU offloading... if so, amke use of it, if not check intels webpages if you can get a better driver module.

The other one is the suggestion to make sure that you compiled kernel with memory mapped io support for networking.

----------

## smart

You can decide about the scheduler in the kernel configuration in the

General Setup -> Configure Standard Kernel Features ->

there you can switch off no-op, anticipatory and CFQ schedulers. Then you get deadline automatically.

With it you should get options regarding a read/write preference ratio. It's default is 2:1 for reads (by the setting of "2"). You could change that to "1" for reads/writes 1:1.

Would have to check which sysctl that is....

----------

## mciann

Score. 

Your e1000 recommendation about offloading checksums was right on the money.  It gave me about a 15% performance improvement (in terms of delaying the ramp).  Thanks!

What is the right way to automate module loading with command line arguments?

I've also recompiled with deadline scheduler support, but can't find the tunable paramter in /proc.

----------

## smart

/etc/modules.conf for the e1000

static int writes_starved = 2;

seems its not offered via sysctl, though.

Just try the one as is or modify

/usr/src/linux/drivers/block/deadline-iosched.c:

static int writes_starved = 2;    /* max times reads can starve a write */

to

static int writes_starved = 1;    /* max times reads can starve a write */

but then, maybe not. Try as is and see if it helps, otherwise maybe just forget about it.

----------

## DrWilken

Have u got noatime and nodiratime in your /etc/fstab for the filesystem in use?

----------

## mciann

I have bee using noatime, but not nodiratime.  I'll give that a try.

----------

