# rsync pull messing up network connection after 45 secs

## fsavigny

Sorry for this accusatory headline, but the limited space did not

allow for niceties. Of course I do not know that it is the rsync pull;

it just looks like that. But the behaviour is absolutely reliable and

reproducible.

I am a longterm internet user, but not a networking person, which is

why I have tried different scenarios to somehow pinpoint the

problem. But many of my observations might be useless, or just normal

and expected behaviour. I hope that at least something useful is among

them.

So let's go:

I am trying to sync files between my big ("home") laptop and my small

netbook (which I take to work) using rsync. I have a router supplied

by my telephone company, which I use to access the internet, and which

assigns IP addresses using DHCP. When I plug in both computers to the

router, they become automatically mutually visible under their host

names (let's call them 'laptop' and 'netbook'), and provided 'laptop'

is running rsyncd and I have defined a module in /etc/rsyncd.conf, I

can start pulling files from it using

```
rsync -aPv laptop::module destination_dir/

```

on 'netbook'.

So far, so perfect and cool. But now, the odd stuff:

After about 30 to 60 secs of file transfers (I have not been able to

spot any specific event which triggers it, but this is reliable, and

the time span until the halt is quite similar each time), the rsync

output on 'netbook' suddenly stops somewhere in mid-action, like a

movie that freezes. The reason is obviously that 'laptop' has just

lost its connection to the router. (What I find a little odd, too, is

that this rsync process on 'netbook' seems to remain in limbo like

that forever, rather than timing out at some point, but I doubt this

might be in any way relevant here.)

On 'laptop', this is how this breakdown is apparently recorded in

dmesg:

```
[ 2846.816367] tg3 0000:09:00.0 eth2: DMA Status error.  Resetting chip.

[ 2847.017821] hrtimer: interrupt took 15771768 ns

[ 2848.637209] tg3 0000:09:00.0 eth2: 0x00000000: 0x167314e4, 0x40100506, 0x02000002, 0x00000010

[ 2848.637214] tg3 0000:09:00.0 eth2: 0x00000010: 0xf9ef0004, 0x00000000, 0x00000000, 0x00000000

[ 2848.637218] tg3 0000:09:00.0 eth2: 0x00000020: 0x00000000, 0x00000000, 0x00000000, 0x01f91028

            ... about 200 more such lines follow ...

[ 2848.638058] tg3 0000:09:00.0 eth2: 0x00007020: 0x00000000, 0x00000000, 0xaf000406, 0x00000000

[ 2848.638062] tg3 0000:09:00.0 eth2: 0x00007030: 0x00010000, 0x000000c4, 0x00100000, 0x00000000

[ 2848.638066] tg3 0000:09:00.0 eth2: 0x00007400: 0x00000000, 0x002aaa2a, 0x00000000, 0x00000000

[ 2848.638071] tg3 0000:09:00.0 eth2: 0: Host status block [00000005:00000053:(0000:017a:0000):(017a:01b1)]

[ 2848.638076] tg3 0000:09:00.0 eth2: 0: NAPI info [00000052:00000052:(01bb:01b1:01ff):0179:(0041:0000:0000:0000)]

[ 2848.740912] tg3 0000:09:00.0: tg3_stop_block timed out, ofs=1400 enable_bit=2

[ 2848.843144] tg3 0000:09:00.0: tg3_stop_block timed out, ofs=c00 enable_bit=2

[ 2848.945266] tg3 0000:09:00.0: tg3_stop_block timed out, ofs=4800 enable_bit=2

[ 2848.966990] tg3 0000:09:00.0 eth2: Link is down

[ 2850.856928] tg3 0000:09:00.0 eth2: Link is up at 100 Mbps, full duplex

[ 2850.856936] tg3 0000:09:00.0 eth2: Flow control is off for TX and off for RX

```

Now, what I usually do when the link between 'laptop' and the router

breaks down (which happens maybe once in a couple of weeks) is to

restart the network, which in my case means running

```
rc-service net.eth2 restart
```

Under ordinary circumstances, this reestablishes the network link. But

after this rsync incident, the odd thing is that the output of this

command looks more or less normal, but the IP address that 'laptop'

now gets assigned is an unusual one: Normally, it's always

192.168.2.100, but after this rsync messup, it always seems to be some

totally different address, such as 169.254.48.75. And the second

oddity is that it seems to be useless:  When I try to ping the router

(which has just assigned me the IP address):

```
> ping 192.168.2.1

connect: Network is unreachable
```

this is what I get. Charmingly but uselessly, however, I can ping

"myself", i.e. 169.254.48.75 (or which other new address I have been

assigned this time).

These matters (IP address, whom I can ping) remain the same no matter

how often I do the net.eth2 restart.

Next step: I do a reboot. This fixes the net link between laptop and

the router; interestingly, however, the IP address I get is the same

unorthodox one I got before the reboot (e.g. 169.254.48.75), and more

oddly, the automatic downloading of email, which is always done from

~/.profile (and seems to be the first attempt to use the net after

startup) fails. But AFTER THIS FIRST FAILURE, all other attempts to

use the net - pinging the router, pinging something on the internet,

downloading email again - suddenly work again.

Next idea: Running

rc-service net.eth2 restart

again, just to see what happens. Result: I get the "classic" IP back,

i.e. 192.168.2.100. 

The problem does not seem to occur when I sync files in the opposite direction.

Although I ran this test just once, I transferred 3 GB from netbook to laptop without 

any hitch, by invoking 

```

rsync -aPv netbook::module destination_dir/

```

on 'laptop'.

Now, my layman suspicion is that the router gets somehow confused or

upset with the network traffic taking this unusual path. But why would

any device allow, say, 50 M of data to pass, and then suddenly cut one

of the connections? That does not sound technological to me. And why 

does it apparently happen only when files are transferred from A to B, 

but not from B to A?

Or is it the hardware on 'laptop'? The kernel driver?

In short, can anybody help me how to pinpoint the problem better?

Thank you so much,

Florian

----------

## Hu

Please lay off the enter key.  Your post is much more readable without jumping to a new line every 70 characters.  The forum will natively wrap regular text at a width that works well for the reader's browser.

The "weird" IP address is from the APIPA range.  It is assigned when your DHCP client is required to configure something and does not get a valid response from a DHCP server.  This could happen if your network card is so confused that it cannot talk to the DHCP server.  The dmesg entry about DMA Status error.  Resetting chip. looks bad.  The last line of dmesg seems to claim the device recovered, but I am doubtful about that since you say it continues not to work and stays broken even across a reboot.

What kernel version are you using?  Has this ever worked?  Does it happen if you stream an equivalent volume of data quickly between the two machines using some other protocol?  I assume when you refer to a reboot that you mean a warm reboot, where the machine remains powered but you start the OS again.  Please try a cold reboot instead, where you tell Linux to halt and turn off power.  Wait 10-15 seconds, then turn power on again.  The magic recovery after trying to use the network after a reboot sounds very odd.  Based on what you have said, I would blame either the kernel driver or the NIC firmware on the machine which loses connectivity.

----------

## fsavigny

Hi Hu,

thank you for responding, and so quickly! Sorry about the line breaks in my original posting - actually it was my editor which did this - I had to collect the information across several reboots, and so started to write in an editor. No need for that now. (But if there is any more, I will switch off the line-breaking behaviour).

 *Quote:*   

> The last line of dmesg seems to claim the device recovered, but I am doubtful about that since you say it continues not to work and stays broken even across a reboot.

 

Not quite! It stays broken across a (warm) reboot only for the very first attempt to use the network. From the second attempt on, it works fine. (I gather that is what you are calling the "magic recovery").

My kernel version is  3.10.7. 

No, this has never worked before. Actually, it's the first time I try this. Which is also why I have no experience whatsoever with other data sharing protocols (I mean I do know how to use FTP, but not in this situation). 

 *Quote:*   

> Does it happen if you stream an equivalent volume of data quickly between the two machines using some other protocol?

 

Would ping -f do the job? (As I said, I am naive with networking matters.)

I have just sent 19065193 packets of 64 B in an average round-trip time of 0.149 sec from 'laptop' to 'netbook', and then got bored with it and stopped it. It seems to be the equivalent of 1.2 GiB of data sent back and forth, which has taken amazingly long, come to think of it (of course not 19065193 * 0.149 sec, but about 40 minutes at least, I think). There was no packet loss.

I have also tried a cold reboot (one), with apparently exactly the same behaviour: The same APIPA range IP address, downloading email did not work; the next ping and everything that followed worked.

One more test I did, by the way: I reset the router before I warm-rebooted 'laptop', i.e. I pulled its power plug and then plugged it in again. Interestingly, 'laptop' got the APIPA address from before the reboot, which amazed me (in other words, same behaviour).

Hmm. Does that lead any further?

Again, many thanks for your help!

Best regards,

Florian

----------

## Hu

By default, ping uses small packets, as you noted.  The ping flood test you did is interesting, since it seems to suggest that large volumes of small packets are fine, but moderate volumes of large packets are a problem.  Using ping -f can be hard on a network, but if the test involves only your machines, that is fine.  According to the documentation, ping -f sends packets as fast as they come back or  one  hundred  times  per second, whichever is more.  Therefore, your test may not have had much data in flight concurrently, which is another possible explanation for why it did not fail.  Can you try a test using chargen as a data source?  That should produce more concurrent traffic and may also use larger packets, more like how rsync is behaving.

The behaviour with cold reboot and with the router cycled are just weird.  With either of those reports alone, I would blame the other component.  With both those reports together, I do not know what is at fault.  I still lean toward a problem with the NIC firmware, but a cold reboot ought to have prevented that.  I suppose the reboot might not be truly cold if the NIC remained powered up using laptop battery power.  Could you repeat the test with the battery disconnected and wall power removed during the 5-10 second window after the system halts?

----------

## fsavigny

Hi Hu,

I am very sorry for not having replied earlier: I'm a very busy teacher, and simply did not have the time. But it does feel like that was rude of me.

 *Quote:*   

> Can you try a test using chargen as a data source? That should produce more concurrent traffic and may also use larger packets, more like how rsync is behaving.

 

I'm so sorry, but it seems that would be something that I would also need a little help with. I looked that up on Wikipedia (somehow the most informative source), which made me install xinetd, but that package does not explain how to do that. The Wikipedia article says to 

```
$ telnet localhost chargen
```

as a test case, but in my case, that only produced

```

Trying ::1...

Trying 127.0.0.1...

telnet: Unable to connect to remote host: Connection refused

```

(with xinetd running). Hmm. What exactly do I have to do?

(But just one idea: The ping documentation says you can specify the package size, apparently without any limit. Would that help, too?)

 *Quote:*   

> Could you repeat the test with the battery disconnected and wall power removed during the 5-10 second window after the system halts?

 

Just to be sure: Do you mean shut down the system, then remove the laptop battery within 5 to 10 secs, then put it back in (after ... seconds?), and then boot it again? Or do you mean rebooting within 5 to 10 seconds?

Thanks again, for being so attentive, and my apologies again for staying silent for so long.

Best regards, Florian

----------

## Hu

Two days does not seem long to me.  No need to apologize.

If you installed xinetd, you have the ability to run chargen, but it may not be configured yet.  Edit /etc/xinetd.d/chargen-stream to configure it.  See man xinetd.conf for details on the configuration language, or ask here if that is insufficient.

Larger ping packets might help, but I would rather reproduce this with TCP so that we get the same flow characteristics.

I meant halt the system, unplug power, remove the battery, plug in wall power, conduct your test until the connection fails.  When it fails, then halt the system, unplug wall power, let it sit for 5+ seconds, restore wall power, then turn it on.  The goal is to ensure that the relevant hardware is temporarily completely unpowered (including lack of battery backup), since that gives the best chance that it will be initialized from a good default state on boot.

----------

## fsavigny

 *Quote:*   

> Two days does not seem long to me. No need to apologize.

 

It seemed a little rude given how quickly you had responded to my inquiry, and given that you are doing me a favour.  :Smile:  Thank you very much!

 *Quote:*   

> I meant halt the system, unplug power, remove the battery, plug in wall power, conduct your test until the connection fails. When it fails, then halt the system, unplug wall power, let it sit for 5+ seconds, restore wall power, then turn it on. 

 

OK! I guess this means that the hardware will be in a, uhm, pristine state when I conduct the test, and the measure I am supposed to take afterwards is to check if powering it off will restore this pristine state.

One thing I noted: You have not included an instruction to remove the battery after the test. Was that intended?

As to the chargen test, I have made some progress, but not succeeded yet: I am now able to run the chargen service on localhost, i.e.

```
> telnet localhost chargen
```

works as shown in the Wikipedia chargen article (it was a matter of simply activating the chargen service), but I have trouble sending characters to the other machine. From 'laptop' I get

```
> telnet netbook chargen

Trying 192.168.2.101...

telnet: Unable to connect to remote host: Connection refused

```

I figured that I would probably have to have something on the 'netbook' side accepting this connection, and tried to start telnet there. But simply doing that did not make any difference, and when I tried to 

```

> telnet laptop

Trying 192.168.2.100...

telnet: Unable to connect to remote host: Connection refused

```

on netbook, I got this similar message. So it seems my two machines are not ready to trust each other telnet-wise. There are no config files for telnet on either machine, and network-naive as I am, I do not get any wiser reading the telnet manpage.

Could you perhaps give me a hint on how to make them connect and trust each other via telnet? Thank you so much!

Best regards!

Florian

----------

## Hu

Right, we want to see whether the hardware can be made pristine again at all.  After that, we can try to narrow down why it needs that help at all.

Yes, it was intended, since I never included instructions on when to put the battery in, either.  I intended for you to remove it once and leave it out for the duration.  :Wink: 

You probably have the xinetd listener bound to localhost, rather than wildcard.  Change the bind directive to listen to 0.0.0.0, restart xinetd, and try again.

----------

## fsavigny

```
Yes, it was intended, since I never included instructions on when to put the battery in, either. I intended for you to remove it once and leave it out for the duration ;-)

```

Oh. Right. 

(Sorry - for some reason, I was convinced the laptop would not run without a battery at all. That's why I automatically assumed I would have to put it in again before switching the laptop on again.  I do not know why I thought that; I seem to recall I often did that with my previous one.)

While I am not making much headway re: chargen (see below on that), I have now been able to conduct the power-off test using the original rsync scenario. I must have run this test without a battery in the laptop about, like, 50 times. My observations are:

1. The amount of data (total size in bytes) which is transferred before the network adaptor fails seems to be random. The smallest was a bit more than 100M, the largest more than 2G, with any number in between also occurring. But the moment of failure always comes.

2. The file which is being transferred while the network adaptor fails seems to have a size of at least 1.3M (in any case, that was the smallest). I am not at all sure, however, whether this could be simply a matter of statistical probability. (But there are many files smaller than that.)

3. Halting the system and then unplugging the power cable seems to lead to a "clean" start of the network on the next boot, i.e. the network address assigned is then 192.168.2.100, but the length of time it must be unplugged for that to happen seems to vary somewhat:

- Unplugging it for 20 secs was almost always enough, even though I think I had one case (of about 15 to 20) where it did get this 169.* number on the next boot.

- Unplugging it for 15 secs led to a 192.168.2.100 address in about slightly less than half of the cases and to a 169.* address in a bit more than half of the cases.

Doing this (i.e. unplugging for 15 secs) for a second time after a 169.* address always seems to lead to a 192.168.2.100 address on the next boot, but I have not tested this more than 3 times.

4. The 'magic recovery' when the laptop has got a 169.* address is 100% reliable, i.e. it always happens.

As to chargen, I have installed and I run xinetd on netbook as well, and set the "bind" option in /etc/xinetd.conf to 0.0.0.0 on both computers, but I still get

```
telnet netbook chargen

Trying 192.168.2.101...

telnet: Unable to connect to remote host: Connection refused

```

telnet is such an unfamiliar beast to me that I have no clue what to do.

Does any of this lead any further?

Best regards, 

Florian

----------

## Hu

The results with removing power make me fairly confident that the problem lies in the laptop, not in the router.  The original messages about DMA lead me to believe this is somehow related to having a large volume of traffic in flight concurrently.  Large contiguous files are more likely to satisfy this than a group of scattered small files, especially if the total size exceeds system caching.  Working chargen would be nice for making this easier to reproduce, but is not required, since it seems like you can reproduce this quite readily with the rsync test.  We still do not know whether the problem is card firmware or the kernel driver, but I am inclined to believe that it is at least in part a firmware bug.  The kernel ought to initialize the card the same way on every boot, so if the firmware gets into a state where sometimes the kernel probe initializes it correctly and sometimes it does not, then in my opinion, that is a firmware bug.

You probably need to set the bind option on the individual services, specifically chargen-stream.  That is why I previously said: *Hu wrote:*   

>  Edit /etc/xinetd.d/chargen-stream to configure it. See man xinetd.conf for details on the configuration language, ...

 

Would you be willing to try a newer kernel?  I do not see any commits in newer kernels that sound applicable, but it is not impossible that the problem was fixed without noting it as such.

----------

## fsavigny

 *Quote:*   

> Edit /etc/xinetd.d/chargen-stream to configure it. See man xinetd.conf for details on the configuration language, ...

 

Ah, OK. I had added lines to those files to enable the service, but put the bind directive in the global config file only. I've added a 

```
bind = 0.0.0.0
```

line to /etc/xinetd.d/chargen-stream on laptop. On netbook, however, which runs a Ubuntu system, there is only a file called /etc/xinetd.d/chargen (without the -stream). It contains two blocks called 

```
service chargen

{

...

}

```

but one of them contains the line

```
id = chargen-stream
```

To me that looks as if this should be equivalent to the service chargen-stream line on laptop, so I have added the same "bind = 0.0.0.0" directive to this block, restarted the xinetd service on both machines, and run the telnet command again. The result was unfortunately still the same:

```
telnet netbook chargen

Trying 192.168.2.101...

telnet: Unable to connect to remote host: Connection refused
```

The problem is probably that I am completely unfamiliar with both (x)inetd and telnet (I do not even have more than a very foggy idea of what they might be there for), and I simply do not have the time to come to a deeper understanding. 

But as you have written, it does seem a 100% reliable observation that on the rsyncs, the network always dies while a larger file (at least 1MB) is being transferred. So if I am too stupid to get this telnet thing running, would it perhaps seem sound to simply test your hypotheses next?

 *Quote:*   

> We still do not know whether the problem is card firmware or the kernel driver, but I am inclined to believe that it is at least in part a firmware bug. The kernel ought to initialize the card the same way on every boot, so if the firmware gets into a state where sometimes the kernel probe initializes it correctly and sometimes it does not, then in my opinion, that is a firmware bug.

 

 *Quote:*   

> Would you be willing to try a newer kernel? I do not see any commits in newer kernels that sound applicable, but it is not impossible that the problem was fixed without noting it as such.

 

Well, compiling and running a newer kernel would be something I can do, but the configuration process always takes very long in my experience (except, perhaps, if I can use some pre-configured, catch-all configuration, such as is used for distros). However, I am already running  3.10.7-gentoo, which seems about the latest kernel offered as gentoo-sources. So would I have to download the latest, bleeding-edge sources from kernel.org, perhaps?

I have never had to do with firmware before, but from what the (German) Wikipedia article says about it, updating firmware as an end user can be anything from very simple to impossible, and from safe to having the potential to destroy the hardware component. Having read this, I think I understand why you are suggesting trying out a newer kernel first. But would a firmware update be an option at all?

Best regards,

Florian

----------

## Hu

A firmware update is usually a last resort, due both to the dangers you mentioned and that some manufacturers are rather bad about not releasing firmware updates for anything but the most critical problems, such as issues that would otherwise warrant a product return/recall.

Yes, I think we should set aside the xinetd/chargen tests for now.  I had suggested that because I thought it would be easier to do that than to ask you to re-run the rsync tests, but it is proving to cost you more time than it would save.

You can get a more recent kernel through Portage by adding sys-kernel/vanilla-sources to /etc/portage/package.accept_keywords, so that it accepts testing versions of that package, then running emerge --oneshot sys-kernel/vanilla-sources.  You can optionally specify a version.  For this purpose, I would suggest using the latest 3.12 series, which is currently the newest stable kernel.

----------

