# Network hangs for ~60 seconds, randomly

## mv

Hello,

I have a difficult to reproduce problem:

I am connected over some router with some DSL network. Occassionally (all 5-120 minutes) the network "hangs" in the sense that all running connections (be it ssh, wget, streaming a movie in a browser, trying to attempt a new browser page, pinging a known ip, opening a new ssh connection) hang for 20-300 second (usually ~60 seconds). No error message or sign what should be wrong. After this time, everything is working again as it should.

The strange thing is that during that time, I can connect with android phone over the same (WLAN) router without any problems.

So I guess that it must be something on my gentoo host which is falsely configured...

Is there an "obvious" candidate for e.g. a missing kernel or dhcpcd option?  How could I debug this issue?Last edited by mv on Fri May 08, 2015 12:04 pm; edited 2 times in total

----------

## NeddySeagoon

mv,

An r8169 network card with a missing firmware patch?

dmesg will tell you all about it.  It may even add a 60 sec delay to your boot while the driver looks for firmware.

----------

## mv

 *NeddySeagoon wrote:*   

> An r8169 network card with a missing firmware patch?

 

Bingo!  Once more, you are me hero!   :Wink: 

 *dmesg wrote:*   

> [   10.348786] r8169 0000:02:00.0: Direct firmware load for rtl_nic/rtl8168g-2.fw failed with error -2
> 
> [   10.348789] r8169 0000:02:00.0 lan0: unable to load firmware patch rtl_nic/rtl8168g-2.fw (-2)

 

(However, this message appears only once during boot; there appears nothing when the delay is happening)

 *lspci wrote:*   

> 02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0c)

 

I installed now the linux-firmware package, and the error message has vanished.

I will report in some hours/days whether it works.

I would not really understand it, because the same machine/card has worked before on another router (in another country) without any issues (and without the firmware patch)...

----------

## NeddySeagoon

mv,

Firmware loading is only attempted at boot.  In theory, its not needed but it exists to fix bugs.

----------

## mv

Apparently in my case it really fixes a bug: So far, I had no hangs for several hours.

It is still strange that the problem did not occur in my previous place (with another router): Apparenty the triggering of the bug depends on the router hard- or software.

As so often: Thank you very much for your help.

----------

## mv

I posted too early: The hang occured again, actually now rather often.

It was apprently just by accident that it worked for some hours.

----------

## mv

Bump... Nobody?

----------

## NeddySeagoon

mv,

Try running wireshark or tumpdump on the interface.  I don't know what to tell you to look for and you need to see a pattern.

If it is traffic related, the traffic causing the problem may not even get onto the wire.

Try a plug in NIC if thats possible.  Replace the ethernet cable,  try a different router port ... 

Just change one thing at a time unless you plan a binary search.

----------

## Cyker

I reckon it's the NIC.

I have had extreme variability with later gen RealTek NICs in Linux; Some work great but with others I've just had no end of problems. The later revision 8111's are especially flakey.

On my old Asus Socket 939 mobo I had to completely disable MSI/MSI-X capabilities in the kernel to make it work. I never got the one on my A88XM-E45 to work stably, it would freeze every now and then and even hard-lock the whole system under sustained high load.

Ended up buying a PCIe NIC based on an Intel chipset and haven't had any trouble since!

----------

## mv

Thanks for the input.

Since I am loggging with tcpdump -i lan0 >logfile, the problem did not re-occur. Maybe logging itself has some effect on it? But it might be just an accident.

Yesterday, I removed CONFIG_PCI_MSI and CONFIG_PCI_MMCONFIG from the kernel options - I had them in only because the kernel help text strongly recommended it. I do not really understand what they do and hope that removing them will not influence any USB devices, because my USB controller uses PCI. My other PCI devices are

 *Quote:*   

> 00:00.0 Host bridge: Intel Corporation 4th Gen Core Processor DRAM Controller (rev 06)
> 
> 00:02.0 VGA compatible controller: Intel Corporation 4th Generation Core Processor Family Integrated Graphics Controller (rev 06)
> 
> 00:03.0 Audio device: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor HD Audio Controller (rev 06)
> ...

 

So far, I did not realize any change in sound, graphics, or harddisk, so perhaps I really do not need these two kernel options. Or do I risk now to ruin my harddisk, e.g. could it be that the kernel now does not get some hardware interrupt and fills a track with random data?

I cannot test the system the next week, so I can report on the success only much later...

----------

## NeddySeagoon

mv,

CONFIG_PCI_MSI is a good thing to have. in the dim and distant past, when an IRQ was acknowledged, the IRQ service routine had to search all the devices that could raise that particular IRQ before it could run the right interrupt service routine.

MSI means Message Signalled Interrupt.  There is some initial setup. Devices that can raise a MSI are programmed with the address of the service routine.

When their IRQ is acknowledged, they place the address of the service routine on the bus, so the CPU can jump straight to it.

Real systems have always done it this way, its just come late to the PC.

Both ways work. MSI reduces the overhead.  MSI capable devices will cooperate with the old way. 

CONFIG_PCI_MMCONFIG memory maps the PCI configuration space for those devices that support it.

This is faster that using I/O ports to read the configuration space, which is the old way.

Again, both ways work.

----------

## mv

Thanks for the explanation. So it seems I should disable both only it as a fallback.

I will continue logging and see how it works out with both enabled. As mentioned, I cannot test for a week now (or maybe even longer), so do not expect a report any soon.

----------

## mv

So far no fix. The hangs continue in random intervals; tcpdump did not provide anything useful.

Edit: Currently trying with the net-misc/r8168 module.

----------

## mv

Now I found a way to "remove" the hang:

When the system hangs, executing 

```
/etc/init.d/dhcpcd restart
```

 makes the hang go away immediately.

So my new conjecture is that the problem lies in the dhcp server of the router; e.g. the lease time sent by the router for dhcp is wrong, or something similar.

(The router is from a telephone company, and I can hardly configure anything.)

I played by switching on/off various options of dhcpcd, but so far nothing helped.

I use only ipv4, and the problem is not related with name resolving, because it occurs also when I try e.g. to ping a fixed (external) address.

Has anyboddy a hint which switches I might still try?

----------

## khayyam

 *mv wrote:*   

> When the system hangs, executing '/etc/init.d/dhcpcd restart' makes the hang go away immediately.

 

mv ... ummm, you're using dhcpcd exclusively to provide 'net'? The above would suggest so as otherwise why would you restart it. If not then 'dhcpcd -k interface' should similarly work (and will point to dhcpcd, or the provisioning of the lease, as the probable cause). 

 *mv wrote:*   

> Has anyboddy a hint which switches I might still try?

 

If it is as you suspect the lease then 'dhcpcd -4 -U interface' might provide some clue, you might also try setting '--release'.

best ... khay

----------

## mv

 *khayyam wrote:*   

> mv ... ummm, you're using dhcpcd exclusively to provide 'net'?

 

Yes.

 *Quote:*   

> If not then 'dhcpcd -k interface' should similarly work

 

You mean then followed by a restart of dhcpcd?

This looks bad: This doesn't reestablish the connection (like /etc/init.d/dhcpcd restart does).

 *Quote:*   

> If it is as you suspect the lease then 'dhcpcd -4 -U interface' might provide some clue

 

I suppose that this looks also bad, that is, that the lease is long enough to not break down after a few minutes:

 *Quote:*   

> dhcp_lease_time='86400'
> 
> dhcp_message_type='5'
> 
> dhcp_rebinding_time='75600'
> ...

 

I also realized that it is related with network activity: If I download a huge file, then it practically never happens that the connection breaks down. The longer the period of inactivity, the higher the chance that the net service has to be restarted.

----------

## khayyam

 *mv wrote:*   

>  *khayyam wrote:*   If not then 'dhcpcd -k interface' should similarly work 
> 
> You mean then followed by a restart of dhcpcd? This looks bad: This doesn't reestablish the connection (like /etc/init.d/dhcpcd restart does).

 

mv ... sorry, I didn't mean '-k', I was thinking of '--rebind,-n'.

 *mv wrote:*   

>  *khayyam wrote:*   If it is as you suspect the lease then 'dhcpcd -4 -U interface' might provide some clue 
> 
> I suppose that this looks also bad, that is, that the lease is long enough to not break down after a few minutes:
> 
>  *Quote:*   dhcp_lease_time='86400'
> ...

 

That is a lease time of 24hrs with the renewal time of 12hrs. That is normal, renewals happen at half lease time, and most home routers have 24hrs set as the lease time.

 *mv wrote:*   

> I also realized that it is related with network activity: If I download a huge file, then it practically never happens that the connection breaks down. The longer the period of inactivity, the higher the chance that the net service has to be restarted.

 

What sort of interface are we talking about here, wired, wireless? Are you doing any packet filtering? Anything you might have set via sysctl related to net.{core,ipv4}, or similarly with regard to .config?

best ... khay

----------

## mv

Sorry for my late reply: Currently, it takes many hours until the problem occurs, and then the network is back very quickly, so testing is in the moment very slow.

(There have been other phases were the problem occured every 1-2 minutes, and the network did not come back at all until I did /etc/init.d/dhcpcd restart).

 *khayyam wrote:*   

> What sort of interface are we talking about here, wired, wireless?

 

Wired (ethernet cable to the provider's router+modem; the latter over telephone cable to the provider).

 *Quote:*   

> Are you doing any packet filtering?

 

Normally yes, but the problem occured also with empty iptables (and policy accept) after some hours.

 *Quote:*   

> Anything you might have set via sysctl related to net.{core,ipv4}

 

This is an interesting idea. Of course, I used the defaults from basesystem's /etc/sysctl.conf

However, I had changed a few values - I forgot why:

```
ipfrag_time 20 (default seems to be 30 meanwhile)

igmp_max_memberships 1 (default is 20)

ip_local_port_range 1024-61000
```

I left these 3 settings now to the default, and since then the error did not occur (though this does not need to say anything).

The igmp* is probably unimportant since netstat -g says, it is not support on my system at all.

I guess the ip_local_port_range is not important, either, though I find it strange that the system's default is 32768-61000, although http://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xhtml says that it should be "officially" 49152-65535.

So if the removal really solved the issue (though I have some doubts, still), I suppose that the ipfrag_time of "only" 20 (instead of 30) seconds was the reason. Hard to believe, either, isn't it?

 *Quote:*   

> --rebind

 

I had no chance to test this yet (since currently there is no hang since many hours; at least none which I realized).

 *Quote:*   

> or similarly with regard to .config?

 

Not sure what you mean here. Deleting the comment lines with unset options to save space, I have the following networking section in my .config:

```
#

# Networking options

#

CONFIG_PACKET=m

CONFIG_UNIX=y

CONFIG_XFRM=y

CONFIG_XFRM_ALGO=m

CONFIG_XFRM_USER=m

CONFIG_XFRM_IPCOMP=m

CONFIG_NET_KEY=m

CONFIG_INET=y

CONFIG_IP_ADVANCED_ROUTER=y

CONFIG_NET_IPGRE_DEMUX=m

CONFIG_NET_IP_TUNNEL=m

CONFIG_NET_IPGRE=m

CONFIG_INET_AH=m

CONFIG_INET_ESP=m

CONFIG_INET_IPCOMP=m

CONFIG_INET_XFRM_TUNNEL=m

CONFIG_INET_TUNNEL=m

CONFIG_INET_XFRM_MODE_TRANSPORT=m

CONFIG_INET_XFRM_MODE_TUNNEL=m

CONFIG_INET_XFRM_MODE_BEET=m

CONFIG_INET_LRO=m

CONFIG_TCP_CONG_CUBIC=y

CONFIG_DEFAULT_TCP_CONG="cubic"

CONFIG_NETFILTER=y

CONFIG_NETFILTER_ADVANCED=y
```

----------

## krinn

Here's a real good candidate for you mv: https://bugs.gentoo.org/show_bug.cgi?id=477356

as the note may not be in all ebuild, you might have not seen it, or maybe you've been a bad boy  :Smile: 

----------

## khayyam

 *mv wrote:*   

> Sorry for my late reply: Currently, it takes many hours until the problem occurs, and then the network is back very quickly, so testing is in the moment very slow. (There have been other phases were the problem occured every 1-2 minutes, and the network did not come back at all until I did /etc/init.d/dhcpcd restart).

 

mv ... no problem, I'm going to assume the problem is fixed (based on the above) but bump if this not the case and I'll look closer.

 *krinn wrote:*   

> Here's a real good candidate for you mv: https://bugs.gentoo.org/show_bug.cgi?id=477356
> 
> as the note may not be in all ebuild, you might have not seen it, or maybe you've been a bad boy

 

krinn ... could be, I have duid commented in dhcpcd.conf, pass -4 as a parameter, and don't have the reported issue. Definitely not the messiah, defintely a very bad boy ;)

best ... khay

----------

## UberLord

 *mv wrote:*   

> Now I found a way to "remove" the hang:
> 
> When the system hangs, executing 
> 
> ```
> ...

 

If you suspect it's a dhcpcd issue look for any messages it may have logged around the time it hanged.

----------

## mv

The problem is not solved. I can now confirm that 

```
dhcpcd -n lan0
```

 makes the network come back (lan0 is my ethernet interface, of course). There are no messages logged at all, and also with tcpdump I never found anything - messages are just delayed when the network hangs: that's why I am still afraid that the culprit is the ethernet controller. But in this case, I do not understand why sending the above command makes the network come back. (And concerning the ethernet controller: Installing the firmware did not make a difference; also using the r8168 driver instead of the in-kernel r8169 driver had made no difference. All recent tests are with firmware and r8169 driver).

The only log messages I find from dhcp are the expected ones when I send the above command:  */var/log/all wrote:*   

> 06:01:41 26.01.16 [dhcpcd] unknown option `noipv4ll'
> 
> 06:01:41 26.01.16 [dhcpcd] sending commands to master dhcpcd process
> 
> 06:01:41 26.01.16 [dhcpcd] control command: dhcpcd -n lan0
> ...

  The "duid" thing is probably not related, since this is commented out in my config and I am using "clientid" instead (though the problem appeared also with "duid" instead of "clientid"). The installed version of dhcpd is  *installed wrote:*   

> dhcpcd-6.10[embedded -ipv6 -udev elibc_glibc kernel_linux]

  My current dhcpcd config file is (comments removed)

 */etc/dhcpcd.conf wrote:*   

> waitip
> 
> hostname
> 
> clientid
> ...

  I am aware that noip4ll is superfluous in the config due to -ipv6 (therefore the log errors above) and that also the last is superfluous (I have only the default hooks installed).

I had removed "slaac private", "required dhcp_server_identifier" and most of the options (except the line with domain_name_servers) in various combinations without any effect. I had also tried to start dhcpcd with the optoin -E and had tried to add "option persistent".

Edit: In the recent tests, I had added back in all /proc/sys/net/*/ the options 

```
secure_redirects 1

log_martians 1
```

 for security reasons which are both not default. I am currentlyl trying again without these and report back if omitting these would surprisingly solve the problem.

Edit 2: I am using a hardened (PAX) kernel; the only explicit network-related option activated there is CONFIG_GRKERNSEC_NO_SIMULT_CONNECT. But the issue persistts since many kernel versions; I doubt that this is really related, though in the lack of other ideas, I will perhaps also try.

----------

## mv

As excpected, the change from the previous Edit 1 did not help.

Instead of redirecting the tcpdump output, I looked at it now on screen, and it continued during the hang. The hang started somewhere during the following output (probably near the end).

I already observed earlier that the ARP messages (with "oui unknown") always appear in the neighborhood of the hang. However, these messages often apear also when there is no hang, so it might be a red herring.

I edited the output manually to shorten it and to avoid publishing local hardware, IP adresses, or hostname.

The hang occured while I was connected with ssh to machine xxx.xxx.xxx.xx.

When asking for my external ip with whatsmyip after the hang, it was not yyyy.yyy.yyy.yy, so maybe the problem occured in the moment when the provider wanted to associate me a new externel IP? (On the other hand, I cannot imagine this, because sometimes the hang is consecutively every few minutes).

 *tcpdump wrote:*   

> 09:18:54.438675 IP 192.168.0.11.47746 > xxx.xxx.xxx.xx.xxx: Flags [P.], seq 89381:89425, ack 565796, win 1424, options [nop,nop,TS val 2487969 ecr 1145468934], length 44
> 
> 09:18:54.447200 IP xxx.xxx.xxx.xx.xxx > 192.168.0.11.47746: Flags [P.], seq 565796:565880, ack 89425, win 385, options [nop,nop,TS val 1145468944 ecr 2487969], length 84
> 
> [ many similar ]
> ...

 

----------

## khayyam

mv ...

I think I've just blown a fuse trying to make sense of your obfuscation :) ... are you saying, wrt 'whatsmyip', 'yyy.yyy.yyy.yy' is the IP of the external interface? If so then "who-has yyy.yyy.yyy.yy [...] tell yyy.yyy.yyy.yy" is gratuitous and I'm fairly sure you shouldn't see this locally.

best ... khay

----------

## Logicien

 *mv wrote:*   

> also using the r8168 driver instead of the in-kernel r8169 driver had made no difference.

 

Why not compile r8169 support as a module instead of in the kernel image? More than one problem have been resolv by this change, wireless and bluetooth supports.

----------

## mv

 *khayyam wrote:*   

>  are you saying, wrt 'whatsmyip', 'yyy.yyy.yyy.yy' is the IP of the external interface?

 

No. I know neither which IP yyy.yyyy.yyy.yy is (it is not a local IP like 192.168...), nor do I know which external IP I had in the moment when the protocol was written. When I checked immediately after the protocoll, this external IP was different from yyy.yyy.yyy.yy. According to what you say, it was probably different all of the time. So I simply don't know which machine had the IP yyy.yyy.yyy.yy. Perhaps some machine form outside which has accessed my IP by accident.

----------

## mv

 *Logicien wrote:*   

>  *mv wrote:*   also using the r8168 driver instead of the in-kernel r8169 driver had made no difference. 
> 
> Why not compile r8169 support as a module instead of in the kernel image?

 

I had not written this clearly: It is compiled as a module. By "in-kernel" I meant from the original kernel sources, in contrast to net-misc/r8168 which has externally maintained sources.

----------

