# Is NFS flaky as hell (or is it just me?)

## Sivar

I've always used SMB for shares in the past. It isn't a particularly elegant protocol, and has surprisingly high overhead, but Samba works fine and its shares are compatible with Microsoft systems.

My laptop has a Gentoo partition which recently ran out of space during a major update to the size of the distfiles directory. I wasn't about to sift through the files to see what I needed, nor was I about to unlink the files and have it download them again (OpenOffice and KDE are rather large), so I decided to NFS mount my home server's distfiles dir and copy what the laptop had to it.

NFS flake-out #1: It took about 2 minutes to NFS mount, even though both the server and laptop were connected to the same gigabit switch with no firewall between them. Both systems have each-others information in /etc/hosts and I used the IP, not the hostname, to mount the drive.

NFS flake-out #2: Sometime during the night, I got an "NFS Stale file handle" error, and the system update halted. According to "The Linux Kernel Mailing List",  *LKML wrote:*   

> I think I have found a bug in the nfs server code that always returns a failure on subtree checks. I described my findings in a post to this list, but nobody answered.

 "...always returns a failure on subtree checks..."? Um, oops?!

NFS flake-out #3: The next morning, I attempted to mount the distfiles share using the "no_subtree_check" option as suggested in that thread, and (after waiting about two minutes again) it said that they share didn't exist. Hmm. So, I look at my server's /etc/exports and find that the share information is still quite present and accounted for.

Odd.

So, I restart NFS, since it appears to be exibiting many signs of having been written by Microsoft (flaky, randomely fails, poor testing), so restarting the software should help, just like Windows server software.

Nope, it did not work. So I go to my distfiles directory to check to see if it had grown too large or if anything looked suspicious and... it wasn't there. Having mounted the distfiles directory via NFS led to the directory's complete deletion!" Let me just repeat that again, because it is such a profoundly massive software screw-up: NFS HAD DELETED THE ENTIRE SHARE. What if I had been sharing my entire /usr directory? Or /etc? That's what backups are for, but you shouldn't need to roll tape because a protocol and software which should be insanely mature screws up about as badly as a it possibly could screw up.

Not to sound like post hoc thinking, but note that the NFS directory contents were exported, but not the directory itself--it should not have even been possible to delete the directory, regardless of permissions. Further, I do not know of any ebuild script which would request the deletion of the entire distfiles directory, thus doubt that portage had anything to do with it. Furthermore, it is excessively unlikely that this was a security breach of some kind, because the NFS share is protected by a default-to-deny stateful external (not on the same machine) firewall AND the share was configured to allow access only from the internal network, which at the time had exactly one Linux system connected--the laptop.

The problem was not flaky hardware--the server runs on a Tyan SMP server-grade board with ECC memory and the hard drive which stores the distfiles directory is a Seagate Cheetah X15-36LP SCSI drive on an Adaptec 29160 64-bit server controller. The ethernet card, which has never been problematic in any way, is a PCI-X Intel Pro/1000MT server adapter.

CFLAGS are a conservative 

```
CFLAGS="-march=athlon -O2 -fomit-frame-pointer -pipe"
```

 and my /etc/exports contained 

```
/usr/portage/distfiles 192.168.0.*(rw,anonuid=250,anongid=250,sync)
```

 (prior to adding no_subtree_check). UID/GID250 is "portage"

This post is more of a warning than a request for assistance. While it is possible that this was all an isolated incident, NFS (which has been around for decades) has screwed up so massively in this case, I think I'll stick with Samba.

----------

## db_404

 *Quote:*   

> 
> 
> NFS flake-out #1: It took about 2 minutes to NFS mount, even though both the server and laptop were connected to the same gigabit switch with no firewall between them. Both systems have each-others information in /etc/hosts and I used the IP, not the hostname, to mount the drive.
> 
> 

 

Not sure about the other problems - but they may be related to this one, are you running portmap on your client (net-nds/portmap)?  Needs to be run in your default runlevel or nfs partitions will take forever to mount.

----------

## Sivar

 *db_404 wrote:*   

>  *Quote:*   
> 
> NFS flake-out #1: It took about 2 minutes to NFS mount, even though both the server and laptop were connected to the same gigabit switch with no firewall between them. Both systems have each-others information in /etc/hosts and I used the IP, not the hostname, to mount the drive.
> 
>  
> ...

 

Ahh--that was it. That fixed #1. Thanks!

----------

## xbmodder

umount all partitions

----------

## jamapii

I have a box where /etc/exports is sort of nontrivial, like exporting /usr/portage read-only and /usr/portage/distfiles read-write, and restricting the allowed clients.

I had to change /etc/exports to make it work at all, and even then, when some clients mount what they need, the server goes into "permission denied" or "stale NFS handle" mode. When the server is restarted, it works again. I have a cron job to restart the server.

This is an old Suse installation, the NFS stuff is still from Suse. The NFS weirdness has always been there. It says

```
server:~ # nfsserver_restart 

Shutting down kernel based NFS server                                      failed

Shutting down kernel based NFS server                                      done

Starting kernel based NFS server                                           done

```

kernel is now 2.4.27

Overall, it mostly works for me.

----------

