# Lock-up during large NFS transfers

## Tarball

Hi,

I am in the process of setting up a file server on my network.  The machine I am using is a dual 533 celeron (Abit BP6 Mobo), 128MB RAM with 160GB disk on the Highpoint 366 ATA controller (which is on the Mobo).

I got gentoo installed no problem.  I have got NFS setup and it seems to be working ok.  For files of a few 100MB I'm getting transfer rates of 10MB/s.

I am having a problem with extremely large files.  I was trying to transfer a couple of DVD ISOs I had created, each about 3GB.

The problem is, that when I am copying one of these ISOs across the network, when the transfer gets to about 1.8 - 2.0 GB transferred, the server locks up.  I can't ping the server, nor ssh in to it.  And if I have a keyboard/monitor connected (it is intended to be a headless box) the caps lock key even stops turning the LED on/off, so it is a pretty damn hard lockup and I have to hit the big button.

Unfortunately, I'm a bit stuck when it comes to debugging the problem.  There are no messages in the system logs and NFS doesn't seem to write any info anyway.

Any ideas how I can debug this problem, or has anybody had a similar problem?

Under moderate usage the box stays up for many days, it's only when I try to transfer multiple GB size files.

Cheers

----------

## neysx

I have just tried to transfer a 3.8Gb file over NFS and it worked as expected. The server is even similar to yours. Mine is a dual P3 with a 160Gb samsung HD on a Promise PDC20265 onboard controller.

It looks more like a hardware failure. Maybe you can try to create a 4Gb on your server and then zip it, that should stress your disk+controller a bit  :Wink: 

```
dd if=/dev/urandom of=/bigfile bs=1024k count=4000

gzip -9 /bigfile
```

Hth

----------

## Tarball

I'll try that when I get home from work.

Just as a matter of reference, what settings are you using for NFS?

(I believe) I'm using NFS v3 and the settings with seem to be the de facto standard on gentoo forums;

rw,sync,rsize=8192,wsize=8192,timeo=14,intr,hard   (this is from memory but I think that's the settings!)

Also, I am using gentoo-dev-sources ( 2.6.7-gentoo-r8 )

----------

## neysx

Nothing special as far as NFS is concerned. I use vanilla-2.6.7 at both ends. *server wrote:*   

> # exportfs -v
> 
> /mirrors        basil.a.la.maison(rw,wdelay,no_root_squash)

  *client wrote:*   

> # mount
> 
> polly:/mirrors on /mnt/polly type nfs (rw,wsize=8192,addr=10.0.0.3)

 BTW, make sure you did not select [] Preemptible Kernel in your kernel. Servers with a high I/O payload can b0rk your file system if it is on. I have noticed it can lose dma and corrupt your disk but I haven't seen any hard lockups, yet.

----------

## neysx

Interestingly, a French guy has a very similar problem when moving large files across the net with the same harware.

I'll let you know about that thread if you want to (I suppose you don't speak French).

----------

## Pyros

Stress is finished successfuly. No hang up...

i've juste played with a 2gig file and zip it but it's ok.

I've already stuck the box just transferring many isos (700 MB p/u)

i'll try at night to re-compile the kernel (with  the last to-date version ) without preemptible... 

i'll tell you tomorrow...

----------

## Tarball

I have got Preemptible kernel compiled at the moment, I'll recompile tonight.

Incidentally, I'm not sure if it hardware related.  I have Samba running on the server with a share set up for the same directories as the nfs exports (so I can 'see' the files on my xbox).

When I tried copying the file using the samba share it did manage to copy it completely, although it only managed as transfer rate of 4MB/s compared to the 10MB/s on NFS.

Pyros: which IDE channel is your harddisk on?  Is it on the HPT controller?

----------

## Pyros

I'm using the PT controller (HD actually on hde).

My system hang up during transfer even with Ftp ou samba...

Hopefully, sometimes, i'm able to transfer many gigs without any problem... I can't understand when or why this is occurs.

----------

## Tarball

Ok,  I created a large file on the server

```

     dd if=/dev/zero of=big_file bs=1M count=3500

```

no problems creating the file.

I then ran gzip which completed ok, I then gunziped the file which again completed ok.

I then tried to copy the file from the server to my desktop machine.  After 2.7GB, the server locked up.

I'm gonna try moving the disk from hde1 to hdc1.

----------

## kerframil

If you're using a 2.6 kernel then use NFS over TCP, it's better despite the apparent "EXPERIMENTAL" status of the option. That entails enabling the feature on the server (CONFIG_NFSD_TCP=y) and mounting with the appropriate option:

```
mount -o tcp,rsize=8192,wsize=8192 ...
```

I believe that the TCP implementation is poor in 2.4 kernels, however.

----------

## Tarball

kerframil: From what I have read, with NFS over TCP the only benefit is if you have a lossy network.  As far as I can tell my network is very stable and I am getting high transfer rates.  Is there any other advantages?

I've moved the disk I was sharing with NFS from hde1 (HPT366 IDE controller) to hdc1.  I've just tried copying my 3.5GB file and it copied with without problem with a consistant 10MB/s transfer rate.

Tomorrow, I'll try another couple of big file copies and see how it behaves.

----------

## kerframil

 *Tarball wrote:*   

> kerframil: From what I have read, with NFS over TCP the only benefit is if you have a lossy network.  As far as I can tell my network is very stable and I am getting high transfer rates.  Is there any other advantages?

 

I don't know the particulars, but I have a distinct recollection of reading a post by one of the heavyweight kernel devs on the LKML stating in no uncertain terms that TCP was the way to go with 2.6. Perhaps more development effort has been focussed upon the associated code.

As for advantages, well it's a darn site easier to tunnel as well (say, over SSH)  :Wink: 

EDIT: This isn't the post I was thinking of but is nonetheless an interesting read. One would assume that Tronde Myklebust knows what he's talking about  :Cool: 

Personally, I don't like the idea of using a fundamentally unreliable transport protocol for something like NFS despite its roots. It also touches upon how the choice of the block size might affect performance. I'm suspecting that 8k may not necessarily be the optimal setting for most workloads these days (the impression I get is that lower read/write sizes are chiefly useful for working around UDP fragmentation issues and packet loss). Remember that the maximum size of a UDP datagram tends to be fairly limited, although I have no idea what the practical maximum is in this case. Section 5.8 of this document has an interesting insight into the potential drawbacks of UDP (reliability aside).

----------

## Pyros

Finished compil of the kernel.

Just tried to copy a 3.4 Gig from hde1 with samba with no errors...

Need more tests before any conclusion.

----------

## Tarball

Just had the server lock up while transferring from hdc1 so it doesn't seem to be  something specific to the HPT IDE controller.

Pyros: what changes have you make to the kernel?

----------

## Pyros

get the latest gentoo dev source 2.6

then removed preemptible kernel.

compile and reboo   :Very Happy: 

still have to do more tests ...

----------

## Tarball

Pyros, have you tried running a uni-processor kernel?

----------

## damg1nc

Quick sanity check... is DMA enabled on the drive?  I've seen servers crash on large file transfers when DMA was turned off.

----------

## Tarball

I'm pretty sure it is on but I will check when I get home.

I have noticed that it doesn't lock up on every large transfer.  Last night I was repeatedly transferring a file of 3.5GB.  

The first time it copied the file ok but crashed the second time.  On other occasions, it's crashed the first time I try the transfer.

When looking for information on getting debug info from the kernel, I read about an NMI oopser patch but it seems to be in reference to the 2.2 kernels, is there something equivalent for the 2.6 kernels?

----------

## Pyros

No.. i want that stupid think to work with the smp !

I didn't buy a bipro to gat a poor celeron cpu unit...

Still no crash since my compile without preemptible..

I 'll tell you...

----------

## Tarball

I have compiled a kernel with the SMP disabled and I haven't had it crash yet.

I have copied 5 x 3.5GB files.

What version kernel are you running Pyros?

----------

## Pyros

Actual version :  2.6.7-gentoo-r9 #1 SMP

still no crash...

cross fingers

----------

## jdgill0

What version of NFS are you using? It makes a difference, as Version 2 has a file size limit of 2 gigs. Version 3 over comes the 2 gig limit. Have not used Version 4, which I believe is supported under 2.6 kernels.

----------

## Tarball

I am using NFS v3.

I am having better luck now I have turned off 'Pre-emptive kernel' but I have had to mount the directory with 'async' because I was getting good download from the server (10-11MB/s) but very poor upload (500KB/s)

So far no crash with SMP enabled though.

----------

