# solved: nfs extremely slow

## IanBallantyne

hi,

i've got a problem with an nfs server that i've not been able to solve so far.  quite simply, its extremely slow in transferring data, averaging around 10-20kb/sec, but often with breaks of several seconds during data transmission.  mounting the share is immediate and shows no problems.

i've tried numerous ideas, mounting async, various rsize and wsizes, soft mounts with a 5 second timeout, noatime, all to no avail.  the mount command i have on the client, a 2.4 kernel system, at the moment is

mount -t nfs -o "rw,user,async,soft,timeo=50,intr,rsize=1024,wsize=1024,noatime"  192.168.1.7:/mnt/radinew-home/midori-home /mnt/wallaroo/data-priv  the server is on a 2.6 kernel with nfs-utils 1.1.2-r1

on the server:

# uname -a

Linux wallaroo 2.6.23-gentoo-r5 #3 SMP Sun Jan 20 23:16:51 CET 2008 x86_64 AMD Athlon(tm) 64 X2 Dual Core Processor 5000+ AuthenticAMD GNU/Linux

in the /etc/exports on the server:

/mnt/radinew-home/midori-home 192.168.1.0/24(rw,no_root_squash,async,no_subtree_check)

i've also tried manually starting nfsd making it use v3:

/usr/sbin/rpc.nfsd -N 4 3

running exportfs gives the following:

/mnt/radinew-home/midori-home

                192.168.1.0/24

both systems have corresponding entries for each other in /etc/hosts

on the server, rpcinfo returns

# rpcinfo -p

   program vers proto   port

    100000    2   tcp    111  portmapper

    100000    2   udp    111  portmapper

    100024    1   udp  32770  status

    100024    1   tcp  53119  status

    100005    1   udp  33377  mountd

    100005    1   tcp  39380  mountd

    100005    2   udp  33377  mountd

    100005    2   tcp  39380  mountd

    100005    3   udp  33377  mountd

    100005    3   tcp  39380  mountd

    100021    1   udp  33855  nlockmgr

    100021    3   udp  33855  nlockmgr

    100021    4   udp  33855  nlockmgr

    100003    2   udp   2049  nfs

    100003    3   udp   2049  nfs

    100021    1   tcp  57105  nlockmgr

    100021    3   tcp  57105  nlockmgr

    100021    4   tcp  57105  nlockmgr

    100003    2   tcp   2049  nfs

    100003    3   tcp   2049  nfs

and on the client:

# rpcinfo -p

   program vers proto   port

    100000    2   tcp    111  portmapper

    100000    2   udp    111  portmapper

as far as i can tell, everything that is needed for the nfs server is running

# ps -ef | grep rpc

nobody    5402     1  0 Nov11 ?        00:00:00 /sbin/rpc.statd --no-notify

root      5415     2  0 Nov11 ?        00:00:00 [rpciod/0]

root      5416     2  0 Nov11 ?        00:00:00 [rpciod/1]

root      6276     1  0 Nov11 ?        00:00:00 /usr/sbin/rpc.mountd

root     26419  5349  0 16:05 pts/0    00:00:00 grep --colour=auto rpc

# ps -ef | grep rpc

nobody    5402     1  0 Nov11 ?        00:00:00 /sbin/rpc.statd --no-notify

root      5415     2  0 Nov11 ?        00:00:00 [rpciod/0]

root      5416     2  0 Nov11 ?        00:00:00 [rpciod/1]

root      6276     1  0 Nov11 ?        00:00:00 /usr/sbin/rpc.mountd

root     26419  5349  0 16:05 pts/0    00:00:00 grep --colour=auto rpc

# ps -ef | grep nfs

root      6657     2  0 Nov11 ?        00:00:00 [nfsd]

root      6658     2  0 Nov11 ?        00:00:00 [nfsd]

root      6659     2  0 Nov11 ?        00:00:00 [nfsd]

root     26423  5349  0 16:07 pts/0    00:00:00 grep --colour=auto nfs

looking at the activity using wireshark, i can see that the activity seems to come in bursts, then 5 seconds long nothing, followed by a retransmit from the client which earns a response of "RPC duplicate...", followed by more data at what i would consider a normal speed over my 100mbit network, then again the pause, retransmit, duplicate...

the exact behaviour can vary slightly sometimes immediately after mounting the share.  sometimes when i do a directory listing, its almost immediate (1/2 second), other times it can take 10 or 15 seconds.

i've been able to exclude a network problem as copying large amounts of data with scp -rp shows no delays and pulls usually 2mb to 10.8mb/sec, slower on smaller files, faster with large, which is normal behaviour.  also, nfs mounts from the same client to an older suse 9.3 system work without problems.  i've even tried turning off tcp window scaling with ech0 0 >/proc/sys/net/ipv4/tcp_window_scaling with no change.

locally on the server there are no disk access problems, data reads off the raid-1 at around 220-240mb/sec.

i've run out of ideas where to look, and would be very grateful for any suggestions anyone can offer me.Last edited by IanBallantyne on Fri Nov 14, 2008 1:45 am; edited 1 time in total

----------

## danomac

What network cards are in the server and client?

----------

## IanBallantyne

server:

# lspci | grep net

00:0a.0 Bridge: nVidia Corporation CK804 Ethernet Controller (rev f3)

# ifconfig -a

eth0      Link encap:Ethernet  HWaddr 00:19:db:27:58:ef

          inet addr:192.168.1.7  Bcast:192.168.1.255  Mask:255.255.255.0

          inet6 addr: fe80::219:dbff:fe27:58ef/64 Scope:Link

          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

          RX packets:3417998 errors:8940 dropped:0 overruns:0 frame:8940

          TX packets:2981300 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:1000

          RX bytes:2038070581 (1.8 GiB)  TX bytes:7736647833 (7.2 GiB)

          Interrupt:22 Base address:0x2000

client:

# lspci | grep net

00:09.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139 (rev 10)

# ifconfig eth1

eth1      Link encap:Ethernet  HWaddr 00:02:44:39:94:E5

          inet addr:192.168.1.1  Bcast:192.168.1.255  Mask:255.255.255.0

          inet6 addr: fe80::202:44ff:fe39:94e5/10 Scope:Link

          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

          RX packets:2506193289 errors:39330 dropped:184 overruns:19 frame:0

          TX packets:1909054655 errors:0 dropped:0 overruns:5 carrier:0

          collisions:0 txqueuelen:100

          RX bytes:822717850 (784.6 Mb)  TX bytes:2854713325 (2722.4 Mb)

          Interrupt:10 Base address:0xa000

the switch is an older surecom 100mbit switch.  given that the server is the only gigabit capable machine, 100mbit is ok for now.

although since everything else across this network seems to work, i'm not sure what this information provides...

----------

## danomac

The only reason I asked was because I had problems with NFS/Samba with the Realtek chips 8139 and 8169, which you appear to have on one of your clients. 

I scratched my head for many hours because of this, but eventually replaced it with a 3com card I had lying around - problem went away.

I'm not sure if it's a hardware or driver issue, eventually I just gave up.

Do you have a spare card lying around? May as well test it instead of wasting umpteen hours like I did.   :Confused: 

----------

## gentoo_ram

For the O.P.,  both of those ethernet interfaces are showing errors.  That can't be good for performance.  Any kind of error will surely cause retries which will probably slow things to a crawl.

Do you have a bad switch?  Bad cables?  I would solve the hardware issues first.  There's really no reason to have any errors on point-to-point network links using CAT-5/6 cables.

And I've used the parameters "rsize=8192,wsize=8192" to really increase data throughput on the NFS connections.  But, again, solve those ethernet errors first.

----------

## IanBallantyne

indeed there are errors.  i totally missed that one :-/  this is new in my net here.  i will solve this, then try again.  

thanks for pointing that out.

----------

## IanBallantyne

thanks for pointing out the packet error problem.  it was right under my nose but i didn't see it and feel a little silly...  

in the end, the problem was the switch itself, it seems to have fallen over and started dying a slow death.  power off, power on, switch completely dead.  new (old) switch in place and everythings working again.

thank you for helping me.

----------

## gentoo_ram

Excellent!  Yeah, bad networking infrastructure can cause all kinds of annoying problems.  Glad you got this problem figured out.

----------

