# NFS server (RPC connection) fails every few days [SOLVED]

## dfelicia

I've been running a Gentoo file server for a few years, and it's been rock solid.  Files are served via Samba, FTP, and NFS.  Lately, after NO changes to the box, NFS clients start get RPC errors.  The first few times, I chalked it up to a fluke and just restarted NFS.  It's been happening every other day for a week, though, so something is wrong.

Once in the "bad" state, the server can't even connect to itself:

```

# showmount

rpc mount dump: RPC: Unable to receive; errno = Connection refused

# showmount -e 127.0.0.1

rpc mount export: RPC: Unable to receive; errno = Connection refused

```

(nothing in /var/log/messages or dmesg)

I haven't touched anything on the server in > 1 year (not even sync'd portage).  How can I debug this?  /etc/init.d/nfsrestart "fixes" it for a while, but the problem always returns.  (I'm so far out of sync with portage, now, I don't want to try anything that involves upgrading components.)Last edited by dfelicia on Sat Jan 06, 2007 2:34 am; edited 2 times in total

----------

## mjrosenb

I've had some problems in the past with NFS.  My faults usually traced back to bad wiring, bad NIC's or bad memory.  As far as I can tell, NFS uses UDP to be fast, so if there are any transfer problems, you get weird looking errors, and corruption.  How many computers are remotely mounting the NFS partition?

----------

## dfelicia

 *Quote:*   

> How many computers are remotely mounting the NFS partition?

 

20+

The weird thing is that this has been rock-solid for years.  Next time it happens I guess I should try rpcinfo to see if I can narrow this down to one of: portmapper, mountd, nfsd.  When/if I find the culprit, I can start the offending daemon with debug output enabled.

Any way a change in the network could be causing this?  I know ITS had a switch die recently, causing them to have to reroute the VLANs.

----------

## mjrosenb

 *Quote:*   

> 
> 
> Any way a change in the network could be causing this?
> 
> 

 

As far as I can tell, as long as the connection is reliable, the configuration doesn't really matter.

PS

it's always fun to see something like

```

mjrosenb@Guide:distfiles%>md5sum vim-7.0-r1.tar.bz2 

e8c8bb2b5c03f49d40e9736b7d479b9b  vim-7.0-r1.tar.bz2

 mjrosenb@Guide:distfiles%>umount /usr/portage && mount /usr/portage && md5sum vim-7.0-r1.tar.bz2 

02d869a473426b952cc85526f92b4120  vim-7.0-r1.tar.bz2 

 mjrosenb@Guide:distfiles%>umount /usr/portage && mount /usr/portage && md5sum vim-7.0-r1.tar.bz2 

d1531f4e49faee910923b156a8221177  vim-7.0-r1.tar.bz2

```

In that particular case, it was a problem with my router randomly corrupting packets

----------

## dfelicia

I've narrowed this down to rpc.mountd crashing.

```

ogre # rpcinfo -u localhost nfs   

program 100003 version 2 ready and waiting

program 100003 version 3 ready and waiting

ogre # rpcinfo -u localhost portmapper

program 100000 version 2 ready and waiting

ogre # rpcinfo -u localhost mountd    

rpcinfo: RPC: Timed out

program 100005 version 0 is not available

```

I added the option 

```
-d all 
```

 for debug output, but still it crashes without a trace.

I'm really at a loss for how to proceed.

----------

## dfelicia

Ah haaaa.  I think I am getting closer:

 *Quote:*   

> 
> 
> https://rhn.redhat.com/errata/RHSA-2004-072.html
> 
> When mounting a directory, rpc.mountd could
> ...

 

```

ogre ~ # showmount 

Hosts on ogre:

192.168.1.1

ogre ~ # nslookup 192.168.1.1

Server:         1.2.3.4

Address:        1.2.3.4#53

** server can't find 1.1.168.192.in-addr.arpa: NXDOMAIN

ogre ~ # nslookup dfelicia3

Server:         1.2.3.4

Address:        1.2.3.4#53

Name:   dfelicia3.mycompany.com

Address: 192.168.1.1

```

So looks like I may be able to blame the network for this one... will get ITS to have a look at why reverse lookups are failing, and hopefully put an end to the crashes.

----------

## dfelicia

I upgraded nfs-utils and my kernel, and the crashes stopped.

----------

