# Large file transfer kills LAN. [UPDATE: not solved]

## platojones

I've just realized that fairly large network transfers kill my network completely.  I've just tried to scp a large file (~160 Mb) between one gentoo linux box and another, and it killed the networking completely on the target box.  It does this every single time at around ~10Mb.  After scp stalls, I cannot even ping the target box, though it seems to be running fine otherwise.   The only way I've found to recover from this is to reboot the target box.  The source box is completely unaffected by this.  I'm tempted to blame the driver/pcmcia-cs for this, but I have no real idea.  I get no log messages indicating any errors at all on the target box when this happens, except that networking services complain about losing networking, which I would expect since that is exactly what happens.  The troubled box has the following hardware/software config:

Dell Inspiron 5000e, laptop, configured as a router/firewall.  It runs linux 2.6.8-r3/r1 (tried both, with the same problem).  Network cards drivers are Asix AX88190.  They are controlled thru PCMCIA-CS version 3.2.5-r1.  Don't know what else to add, but if anybody has an idea about what might be going on here, i'd sure love to hear about it.  BTW, I tried kernel versions without pre-emptive scheduling turned off and had the same problem.  Networking through this box as a gateway has not affect whatsoever, i.e, when acting as a gateway/router, i've never had a problem with it.  This one has me completely baffled.Last edited by platojones on Sun Sep 26, 2004 4:51 am; edited 3 times in total

----------

## platojones

Well, so far nobody seems to have this problem.  One more fact that seems to be relavent is that a simple 'ifconfig eth1 down' followed by 'ifconfig eth0 up' brings the interface back.  Another possibly relavent fact is that the interface is extremely slow (it's set to 100BaseT transfers) but get's a maximum real transfer rate of only 1.2 Mb/s while it's up.  

BTW, this is a laptop (Dell Inspiron 5000e) I'm using as a Gateway, and it's got 2 Linksys NP100 v2 Network Anywhere cards, one for the extern IF and one for the internal lan IF.  

So, any takers on this one?

----------

## lwithers

Is there anything helpful in dmesg after the interface goes down?

----------

## platojones

Sadly no.  That's what is so baffling about this, there are not log entries of any kind that might hint at the source of the problem.  Furthermore, ifconfig shows now errors or overflows.   I suspect driver issues, but it's hard to prove or disprove when you have nothing to go on.

----------

## ReneeTeunissen

Hi, 

Please check the status of you network card. 

if it stalls, and you able to login on this box (console for example), do a ifconfig on the network cards and see if any errors are reported. 

It could be a bad cable of hub/switch. Check if you see any transmit/receive errors on all network cards.

----------

## platojones

After it goes down, here's the ifconfig output:

eth1      Link encap:Ethernet  HWaddr XX:XX:XX:XX:XX:XX

          inet addr:192.168.9.1  Bcast:192.168.9.255  Mask:255.255.255.0

          inet6 addr: xxxx::xxx:xxxx:xxxx:xxxx/64 Scope:Link

          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

          RX packets:63514 errors:0 dropped:0 overruns:0 frame:125

          TX packets:56708 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:1000

          RX bytes:25431344 (24.2 Mb)  TX bytes:47514199 (45.3 Mb)

          Interrupt:11 Base address:0x320

Nothing at all in any logs, except later when nmbd goes down because it can't find the link anymore.  I have to explicitly do an 'ifconfig down' and an 'ifconfig up' to bring the link back.  A flood ping (ping -fs65507) has the same effect as the large file transfer.  Seems like some internal overflow somewhere deep in the kernel/driver, but it's happening quietly, so it's hard to pin it down.  Link stays up and works fine as an interface for the gateway, probably because the gateway never pushes so much data so quickly, but again, that's just a guess.

----------

## platojones

OK, after just removing my foot from my mouth, I can report that I did get some logging output from the flood ping:

Aug 29 12:33:57 furrball kernel: eth1: bogus packet size: 56796, status=0xda nxpg=0xdb.Aug 29 12:33:57 furrball kernel: eth1: next frame inconsistency, 0xdb

Aug 29 12:33:57 furrball kernel: eth1: bogus packet size: 5396, status=0x12 nxpg=0x13. Aug 29 12:33:57 furrball kernel: eth1: bogus packet size: 32912, status=0x17 nxpg=0xd7.

Aug 29 12:33:57 furrball kernel: eth1: next frame inconsistency, 0xd7                  Aug 29 12:33:57 furrball kernel: eth1: bogus packet size: 5396, status=0x12 nxpg=0x13.

Aug 29 12:33:57 furrball kernel: eth1: bogus packet size: 32912, status=0x17 nxpg=0xd7.Aug 29 12:33:57 furrball kernel: eth1: next frame inconsistency, 0xd7

I didn't see that with the large file transfer though, so I'm not sure if it is an artifact of the flood ping itself.

----------

## ReneeTeunissen

What you could do is to use world famous math-solution I learned in my discrete math classes - called try & error  :Wink: 

Did you try to transfer a file the otherway around. Does that fail as well? The same goes for swapping network cards and cables and try if you see a solution. 

If you think it's a bad driver try it using a other brand or model network card that is using another driver. does the samething happen if you do a filetransfer over localhost? 

Please let me know if you find anything weard...

----------

## ReneeTeunissen

about you latest answer. try other cables - if possibe of route them differently. Tho this does not seem the problem, since you do not have transmit/receive problems. did you check all cards on both ends for errors?

----------

## platojones

Renee,

   Sorry for the delayed response, but it took a while to do the hardware swaps, and I was without networking for the duration.  Well, trial and error is a very useful methodology for networking (which, as they say, is just one letter away from 'notworking'), but alas, card and cable swaps have had no effect.  Should have tried that first anyway, but the lack of ifconfig errors seemed to indicate something else going on.  I feel relatively confident that it's not a hardware issue at this point.  I have confirmed that the errors I posted earlier about the flood ping definitely do not happen with the large file transfer shutdown, so I think those errors are incidental to the underlying problem.  I've verified both ends of the link, btw, and no errors are on either side.  Of course, it's the gateway link which goes down, so I'm quite sure that the client is not at fault here.  I've also tried shutting down the other interface on this machine (eth0, which goes to the cable modem, and shutting down the firewall to see if those had any impact, and they do not).  Oh, I do have another idea...I'll try a live cd, and see if I can get a basic link set up to test.  That's about all I can come up with at this point.

----------

## platojones

*Bump*

----------

## DJPooky

You seem to have a problem like I do. ( https://forums.gentoo.org/viewtopic.php?t=217407 )

For me it happens, so you don't have to read the link, during any data transfer. A emerge, a download, or even after alot of web surfing. 

I get ZERO errors, nothing seems to be wrong.

I'm fairly sure it's not a harware thing because it works in windows and it works when I'm booting from the live CD.

Tell me if I can help, or better yet, if you've figured it out. Thx

Pooky

----------

## platojones

Your problem seems very similar.  I've started noticing that my link goes down after a period of data transfer as well, not just large file transfers.  The file transfers do it immediately and every time.  This is the most baffling issue I've had with Linux yet.  I feel confident that it's not broken hardware, since I have 2 cards and when I swapped them, I got the exact same results.  I'm currently using the 2.6.8 gentoo-dev-sources kernel, but I went back to see if the problem still exists with the 2.6.7 series, and if anything, it was much worse.  The link would go down almost immediately when I started transferring a file, where it would go considerably further with the 2.6.8 series.  I've been thinking for a while now that it's the kernel driver (axnet_cs) that's the culprit, but since I get no diagnostics, it's hard to tell.

----------

## platojones

Ok, after digging around the forums a bit more, I noticed a kernel config someone posted and it gave me an idea to try.  I'm running a Dell Inspiron 5000e laptop, so all my networking is handled through the pcmcia interface provided by the kernel and the pcmcia_cs package.  I looked at my config, and for some reason, had a networking driver built into the kernel that had nothing to do with pcmcia.  I removed those and now my link stays up consistently.  I've probably just jinxed it, but I'm on my fourth large file transfer without the link going down, where previously I couldn't even get through one.   Of course, I have no idea how this worked, but I've been so baffled by this I didn't even have any idea about where to start to find it.  

There is still a huge problem performance-wise, since I have a 100 MB link (confirmed) on both sides, but I get dreadful throughput.  This is with scp:

bigfile.zip                                   37%   61MB 660.8KB/s   02:37 ETA

Yes, I do have dma turned on for my hard drive.  Nevertheless, the link stays up now.  Just in case someone has a similar setup to mine, here's what I did.  For some reason I had the Ethernet CONFIG_8169 turned on, though with a laptop, this is obviously not needed.  When I turned it off and rebuilt the kernel, everything stays up now.  Ok, so now to work on the throughput.  I wonder if it's because I have IPV6 built into the kernel....

----------

## ReneeTeunissen

 *platojones wrote:*   

> Of course, I have no idea how this worked, but I've been so baffled by this I didn't even have any idea about where to start to find it.

 

Me too. The same problem you describe I see a lot with friends/family using Win-XP, so far not with Linux. I think you should mail this to the author of the network driver. And tell him/her which driver you disabled to get it working again. This is *realyy* something for a FAQ or howto.. 

 *platojones wrote:*   

> 
> 
> There is still a huge problem performance-wise, since I have a 100 MB link (confirmed) on both sides, but I get dreadful throughput.  This is with scp:
> 
> bigfile.zip                                   37%   61MB 660.8KB/s   02:37 ETA
> ...

 

Ok, you try the abvious  :Wink: 

What performance do you get if you copy it to the same machine, thus using loopback and not the network. You should consider that the PCMCIA interface has not that bandwidth as PCI does, so there _might_ be a few bottlenecks in ths game. 

Ue the loopback connection - to get the fastest network driver on earth and you can see how that does with things. 

If this goes about the same, the "problem" might reside in the PC hardware. Remember scp has to encrypt everything before it goes over the wire. AKA, heavy CPU load.

If you are in you own (private) network without unencrypted wifi links, thus network/password hacking is not an serious issue in your network, you could try ftp and see how that does things. 

ftp sends your passwords in cleartext over the wire and thus might get cought if someone is sniffing you network. In private/home network situations using unencrypted links/protocols is probably not a risk. tho I cant judge about that.

So try to compare things and see how it goes. If you have a slow CPU, scp might be the problem. Using ftp for the same transfer might rule that one out - if the preformance is about the same. Try the loopback to see if the network is the problem. If they in the same range, it might the the harddisk that troubles you.   

Any, do some testing, think about the results it gave you and draw your conclusions. You might post them here so we could debat them  :Wink: 

Btw, a 100 MBit/s full duplex link might give you 10MByte/s theoretical transfer speed, but you will never get this far. Remember about latencies in hardware, drivers, etc. Next to the packet overhead. As you might know - Ethernet, IP, TCP and the application protocol add several bytes to each packet before it goes over the wire. This will all bring down your transfer speed.

----------

## platojones

Renee,

   I did some testing with loopback, and you were right: scp is slow.  I also happened upon a thread that confirms this, with a person getting the same speeds I am!

https://forums.gentoo.org/viewtopic.php?t=201024&start=0&postdays=0&postorder=asc&highlight=netcat

I found that I could get speeds upward of 3.2 MB/s on the loopback interface using scp.  With a samba file transfer, from one machine to another, the rate stayed pretty consistent at 1.2 MB/s.  But samba has some rather large overhead as well, so that may not be too bad. 

Then I installed 'netcat', which has relatively low overhead and began transfer tests that way.  And once again, my interface began going down, just like it had before.  I could not accurately judge the speed that netcat was transferring data, since the link kept going down, but observing the size of the transferred file and the extremely short time netcat was able to run, I would not be suprised if the rates were > 4 MB/s.  

This seems to confirm that if I stay with 'slower' transfer mechanisms, the link will stay up, but when I use very fast transfer tools, like netcat, it takes the link down.  *Sigh*.  It seems that the problem is still there, only it's not as severe since I removed that unneeded driver.   Of course, removing that driver may be a red herring and something else has made the link seem more stable, but I've changed nothing on my box since I began testing (other than removing that driver), so I don't know why else the link would have gotten better.

I'm still betting there is something wrong with my pcmcia network driver (wrong in the sense that it doesn't support my card very well).  Ah well, more mysteries to solve.

[/url]

----------

## platojones

Oh, one other test I did that I should have posted, but didn't earlier.  The problem only appears to be one way.  It's only when the gateway box (the one who's interface goes down) is RECEIVING that there are problems.  Sending to the other box works with no problems, in fact, a netcat UDP transfer goes about 5 times faster.  So, link stays up when transmitting, and the speed is very good.  The eth1 link on receiving gets rather poor performance and the link drops.

----------

## ReneeTeunissen

Hmm. What could try is to disable to load on the bus and interrupts.

What happens if you unhook the uplink interface of the gateway box? Or if you leave the cards in place and swap card IP settings and cables. If it is an " too many interrupts" issue. Thus receiving an interrupt while it's being handled and gets missed, things like this might happen. But should not with current hardware speeds. But you could try, if this is the case.  

I did some tests with my laptop and a 3com pcmcia card (600Mhz celeron toshiba laptop and a AMD1800, 512MB server) and file transfers in both directions and no problems. I get upto 3MB/sec if I 'copy' chars from /dev/zero to the other end with netcat and redirect everything to /dev/null. So there is no disks involved with the transfer. 

See also http://people.redhat.com/dledford/benchmark.html

And try ping -f with large packet sizes to see how things work out. 

Btw, try to search google on your driver/card. or try to borrow a card from a friend of a different brand / with a different driver. 

I dont know a solution for you problem, except for some try and error, and thing you could try. 

Or look into the code of your driver to se if there is an email address in it and send an email to the author if has heard about your problem before (and knows a solution or a way to get more info).

Bests of luck with testing, and if I can i'' help you out - i will, so keep posting  :Wink: 

----------

## ReneeTeunissen

see as well: http://pcmcia-cs.sourceforge.net/ftp/BUGS

----------

## platojones

Renee,

   Thank you for the advice you've given thus far.  It's smart stuff.   I knew this was no ordinary problem, since I didn't even notice it until it tried transferring a large file, which I very seldom do.  This is a tough one, no doubt about it!  I'll try your recommendations as usual and see what I come up with.  I'll keep posting updates here, so if anybody else comes across this issue, at least they will know that they are not alone.  This will get fixed, and when it does, everybody will know about it.  Since this hasn't been a huge inconvinience to me, it's been a very interesting puzzle.

Stay tuned,

Richard

----------

## platojones

Well, I've have a breakthrough here, so it's time for an update.  I really don't think anybody else has experienced a problem like this one, but if somebody does, I've got the solution.  

Turns out it's a kernel/driver problem.  Apprearently, this problem seems to exist in at least the 2.6.7 - 2.6.8 series kernels.  That explains why it's was so tough to track down.  In a fit of desperation, I've tried the 2.6.9 series (2.6.9-mm2) to be exact, and the problem has disappeared.  I'm baffled that I could find no mention of this problem here or on Google for that matter, but I may not have been formulating the query correctly.  Who knows.  

I can't say for certain that it was the kernel, per-se, and not the axnet-cs driver that comes with this kernel.  I suspect the latter since the box I'm transferring from still has the old 2.6.8 kernel and has never had this problem.  Thus, I strongly suspect that the 2.6.8 axnet_cs driver is to blame.    It's interesting that I'm getting a 40-50% reduction in throughput with the new kernel, but I'll take that for a dramatic increase in stability.  There is a chance that I can tune that up a bit, since I'm not really all that familiar with the config parameters of the new kernel yet.

So, thanks for reading, and if you do encounter this problem, just upgrade to the mm series kernel.

----------

## platojones

Hah, this is really starting to get on my nerves.  Ok, the only reason I though it was solved was because, for whatever reason, this new kernel had much slower throughput.  When I tried to netcat using udp, which is extremely fast, same problem!  But, the new kernel is giving me error messages at least, which the previous kernels did not.  Here's what I'm getting:

Sep 26 00:41:44 furrball kernel: eth1: bogus packet size: 14248, status=0xaf nxpg=0xbe.

Sep 26 00:41:44 furrball kernel: eth1: next frame inconsistency, 0xbe

Sep 26 00:41:44 furrball kernel: eth1: bogus packet size: 23229, status=0x9a nxpg=0x77.

Sep 26 00:41:44 furrball kernel: eth1: bogus packet size: 60245, status=0xc1 nxpg=0xf7.

Sep 26 00:41:44 furrball kernel: eth1: next frame inconsistency, 0xf7

Sep 26 00:41:44 furrball kernel: eth1: bogus packet size: 23229, status=0x9a nxpg=0x77.

Sep 26 00:41:44 furrball kernel: eth1: bogus packet size: 60245, status=0xc1 nxpg=0xf7.

So, there's the problem.  I think this is the heart of the matter.  2 things that google came up with on this:  A shared memory issue or possibly bad cables.  I eliminated the latter with another set of cables.  Anybody know anything about a shared memory issue with Linksys Network Anywhere NP100 cards?  This only happens under relatively fast/high network loads.

----------

## ReneeTeunissen

See http://www.redhat.com/archives/redhat-list/1999-May/msg01693.html

 *Quote:*   

> >I get the following msg on my screen repeated:
> 
> >eth0: bogus packet size: 65535, status=0x1 nxpg=0x(div.)
> 
> I once had the same problem. If I remember correctly it was related to my 
> ...

 

----------

