# High Network Utilization crash?

## eccerr0r

Anyone seen this happening?  Just looking for some "me too's" at least, even if it's not solved...

I try to dump a whole bunch of data (network computer to computer copy of data off another HD) through Gbit ethernet using Linux-3.0.6-gentoo, and it completely crashes the box.   However if I switched to 2.6.35-r4 (gentoo-sources as well) the problem goes away and I can complete the copy.  Weird!  Linux-3.0.3-vanilla also crashes.

I need to determine whether it's writing to disk or just network activity is causing the problem but I ruled out NFS as the culprit (using NFS crashes after staring to dump the data, and I was also able to trigger the crash via using 'netcat' to do the same machine-to-machine copy.

Crashing system (haven't gotten any debug information from it because the machine seems to simply hang with no debug data):

Linux-3.0.6-gentoo

destination disks: MDRAID RAID5 on SATA ICH

x86 (Core2 Quad, 32 bit mode)

r8110s based Gbit ethernet on a Gigabyte EP43-UD3L board

ATI RadeonHD 5770 (FGLRX)

4GB RAM/64G PAE

The machine I was copying from

Linux-2.6.21-Custom

source disk: plain single disk SATA on ICH

x86 (Core2 Duo, 32 bit mode)

Marvell Gbit Ethernet on a Foxconn G965MA board

G965 chipset

4GB RAM/64G PAE

Both machines are connected via a Gbit ethernet switch.

Weird...

----------

## Hu

What if you have the receiving netcat write everything to /dev/null?  This will allow you to reproduce the network load without involving the receiving disk.  Similarly, you could try using dd bs=1M if=/dev/zero of=foo on the "receiving" system to generate a substantial disk load with no network involvement.

----------

## eccerr0r

The machine is otherwise stable... looks like network netcat works (114MB/sec), as well as dumping to disk (140MB/sec... blah, crappy raid...) individually.

I think I have some more clues now though, this might still be NFS after all.  I have a feeling what's crashing is the file locks.  I  think I had the exports read only, and then it would work fine, but if it was exported read-write then it would crash.  Of course this still isn't expected behavior...

hmm..need more testing.

----------

## depontius

I've seen something like this, a year or two back.  I've since tweaked aspects of my installation to make-it-not-happen, though it wasn't out-and-out crashes, rather occasional nearly-minute-long hangs.

My nfs server is an 800MHz P-III, and the problem could be tripped by a single Athlon 64 client.  This was back in the timeframe when Firefox had just started using sqlite to store its instance information and filesystem delays/problems were emerging with the fsync operations.  It seemed to me that the sqlite fsync on a much faster client against a slower server was overwhelming the server.  I moved .firefox over to local disk and symlinked it back to nfs-mounted /home.  The problems went away.

----------

## krinn

did you check dmesg for troubles with irq ? under heavy load many motherboard with buggy part show failure, and the bad irq is then throw away from irq table, the funny part is that any device using it is not reset to grab another one but leave as-is in that bad state. If it happen to the irq that your hdd controller is using, you can expect slow down/freeze and crash, same for network card. It could happen just because too much device use the same irq, or just because some device didn't like shared their irq with anyone.

But i suppose it might not be that, as your dmesg should have already report it.

----------

## depontius

As I said, these problems were a year or two back.  I don't remember all of the steps I took at the time to try to diagnose this.  I also remember that at that time there were known problems with nfs under heavy load, so I had a heavy presumption that that was the problem, rather than anything more fundamental.

It would be fairly easy to "go back" into the trouble realm.  I would just have to remove the .firefox symlink from /home, and "cp -a" the .firefox from local space back to /home.  There may be enough other things changed, that I wouldn't get back into trouble.  One of those other changes I made was to start using cachefilesd to cache my nfsv4 /home, but AFAIK that doesn't cache writes, so if that's the problem, it should still exist.  It is possible however that the nfsv4 write path is changed sufficiently by cachefilesd that that alone would ease my old problems.

----------

## eccerr0r

More interesting is that it seems to die at around the same place each time when I start the particular copy.

Also, not much data needed to transfer before it crashed.  I think it got a few KB over on this multi-GB transfer and the crash occurs, indicating the locking mechanism which is needed at the beginning of the transfer could have been at play.  By random luck I don't think the amount of data matters, just a specific packet is sufficient to hang the server.

Unfortunately I destroyed the source copy of the data in question, this will probably be a bit of a mystery that won't be solved soon unless I build another data set that can repeat this...  But I deeply suspect this to be a software issue versus a hardware one, though I can't point it to one or the other yet.

(I was copying everything off that one disk because I wanted to convert that core2duo to a 64-bit install!)

----------

## depontius

Come to think of it, I recently moved something over 5G onto NFS with no problem.  The photos had been pulled off of SD cards onto my laptop.  I don't yet have an SD reader for my deskside machine, and my laptop isn't set up for my NFS.  (It's a work laptop.)  I used scp to copy the photos from the laptop onto the deskside - in NFS space, which meant that they were being copied from the laptop through the deskside to the NFS server.

This is over a full-duplex 100Mbit LAN.  Even though it's really 2 one-way problems, I suspect that there's enough handshaking overhead that neither transfer got the full 100Mbit rate.  Either my network is in better shape several years later, or that little bit of double-transfer degradation made some difference, or the fact is, I started the transfer and walked away - only occasionally checking to see how it was going.  OTOH, nothing crashed.

----------

## HeissFuss

I haven't heard of this issue crashing an entire system, but THP was added in 2.6.38 and is know to cause application slowness/crashes if you have a lot of filesystem writes pending, or otherwise low on free memory.  Did you enable transparent huge pages in your 3.0 kernels?

----------

## loopx

 *HeissFuss wrote:*   

> I haven't heard of this issue crashing an entire system, but THP was added in 2.6.38 and is know to cause application slowness/crashes if you have a lot of filesystem writes pending, or otherwise low on free memory.  Did you enable transparent huge pages in your 3.0 kernels?

 

Wooooow, thank you very much. I was thinking that NFS was big s*** but in fact, it's that problem you pointed out. So now, I hope to fix that a day because it's very annoying to be stuck for 1 minute when coping over NFS and trying to google to find out why it's hanging ...

EDIT: check that : http://www.mjmwired.net/kernel/Documentation/vm/transhuge.txt

I've done this :

```

echo "none" > /sys/kernel/mm/transparent_hugepage/defrag

```

and the problem still to be gone  :Smile: 

In the kernel, I will now try to rebuild with that new setting :

```

 .config - Linux/x86_64 3.0.6-gentoo Kernel Configuration                                                                                                                           

 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 

  ┌────────────────────────────────────────────────────────────────────────────────── madvise ──────────────────────────────────────────────────────────────────────────────────┐   

  │ CONFIG_TRANSPARENT_HUGEPAGE_MADVISE:                                                                                                                                        │   

  │                                                                                                                                                                             │   

  │ Enabling Transparent Hugepage madvise, will only provide a                                                                                                                  │   

  │ performance improvement benefit to the applications using                                                                                                                   │   

  │ madvise(MADV_HUGEPAGE) but it won't risk to increase the                                                                                                                    │   

  │ memory footprint of applications without a guaranteed                                                                                                                       │   

  │ benefit.                                                                                                                                                                    │   

  │ Symbol: TRANSPARENT_HUGEPAGE_MADVISE [=y]                                                                                                                                   │   

  │ Type  : boolean                                                                                                                                                             │   

  │ Prompt: madvise                                                                                                                                                             │   

  │   Defined at mm/Kconfig:333                                                                                                                                                 │   

  │   Depends on: <choice>                                                                                                                                                      │   

  │   Location:                                                                                                                                                                 │   

  │     -> Processor type and features                                                                                                                                          │   

  │       -> Transparent Hugepage Support (TRANSPARENT_HUGEPAGE [=y])                                                                                                           │   

  │         -> Transparent Hugepage Support sysfs defaults (<choice> [=y])

```

 :Wink: 

EDIT2: I confirm : it works very well now  :Smile: 

----------

## depontius

Does this have to do with THP on the NFS server, client, or both?

I have an i686 machine serving NFSV4, and several amd64 clients.  The server is running some level of 2.6.39, but since it's i686 doesn't have THP.  I believe the clients are all running THP.  I had a bout of performance problems a bit over a year ago, but they had gone by the wayside.  About a week or two back, I noticed another "temporary hang" reminiscent of the bad old days.  I'm wondering if I'm having the THP problem.

----------

## eccerr0r

The server is the machine crashing for me, so I suppose that's the machine that needs to have attention...

Unfortunately I don't have transparent huge pages enabled (but regular huge pages are enabled)...

----------

## loopx

In my case, the server is a Synology DS411+II which has no problem. Client got problem (Gentoo) but with the new configuration in the kernel, no more problem  :Wink: .

Yes, NFS is like a "slow device" over a 100Mbits network. I think it was loading the memory with data to write to NFS and now, it's limited and so, there is no more hang at all. I was experiencing random hang every 1-2 minutes and during ... 1 or 2 minutes ... :-/

now it works like a charm  :Smile: 

----------

## loopx

 *eccerr0r wrote:*   

> The server is the machine crashing for me, so I suppose that's the machine that needs to have attention...
> 
> Unfortunately I don't have transparent huge pages enabled (but regular huge pages are enabled)...

 

At work place, we have one server running EXT4 and NFS for VMware ESXi (used as backup for virtual hdd + thin provisioning). Server has high nice time and transfer are not as fast as the FTP protocol but I think this is normal with NFS. I've not checked THP settings ...

----------

## eccerr0r

Hmm... I guess I'm seeing this pop up again, sort of...

Once again it's the same two machines - a c2q with i686 3.2.1-gentoo-r2 and a c2d x86-64 machine also running 3.2.1-gentoo-r2.

I mounted the c2q with the c2d, and when starting to try to copy a bunch of files from the c2q to the c2d through, the copy process would d-state hang, basically requiring a client side reboot to clear up.

Still no signs of what's going on - no dmesg messages...  

Now what's weird: coping from/to my athlonxp machine through nfs works perfectly fine!!!

Ugh...

----------

