# hints on network troubleshooting

## Strupniveral

Hello guys, 

I'm looking for some hints on how to find out the root cause of a network issue I'm experiencing recently.

Truly speaking, I'm not sure the root cause is network-related, even if the only side effect I see is network related.

The "incriminated" box is a small gentoo-based home server, used as both media server and router/firewall for the other PCs.

I have this machine running for the last 2+ years without experiencing any issue even once. Starting from 2/3 weeks ago, I'm experiencing some network hangs, where all the PCs are unable to perform any network-related activity: all the internet traffic times out, as long as any other activity directed to the gentoo box (ping, ssh, DHCP request - using both IP and hostname).

Note that the issue is clearly between my clients and the server and (mainly) network related: I'm able to login directly on the gentoo box, every command runs fine (even if a bit slower than usual, when the issue is in progress) and from that box I'm able to reach any host on Internet. Also, I can see the "rx_missed" counter increase drastically on the internal interface when the issue is in progress

While there's no recent change (at least last 3 months) on the gentoo box itself, the issue started when we added a new smartphone and a voip phone to the picture. I'm not really sure if this could be the issue, but the network hung twice the same day while we were running some test calls - on the other days it takes at least 48h to appear after a system restart.

With the issue in progress, I can't see any relevant error in log files, load average is fine ( <0.3), ram usage is fine (about 150M out 1G in use and 800M cached - releasing periodically the cache didn't solved the issue) and the # of connection shown by netstat (both ingoing and outgoing) is normal.

My current idea is to check the connection forwarded by the gentoo box - but I don't know how to list them...

Also, I'm running out of ideas  :Smile:  So I'm wondering if any of you have any hint to share   :Razz: 

Thanks in advance and Regards,

Luca

----------

## Quincy

Sound like a real problem with your new phone as everything else was running fine until then.

What happens when "the issue in progress" and you disconnect the phone? Gets everything back to normal?

My first guess would be a duplicated IP address between both systems (server and phone).

----------

## Strupniveral

Thanks for the reply Quincy.

I'm sure we can exclude IP-related stuff from the game: the gentoo box runs also a DHCP server; the phone itself gets the proper IP every time (they are assigned based on mac address); plus, the issue doesn't appear immediately after the device connects to the network.

Regarding the other question: no, the issue doesn't disappear once the device is disconnected.

What I noticed is that restarting the eth interface stops the "rx_missed" counter to increase and the devices are able to connect back again - but doing so the issue will re-appear sooner (minutes to hours). A full restart of the machine will grant hours to days of "correct" activity.

That's why I was wondering about connections not correctly closed/handled. I set up a small cronjob to collect the output of "lsof -i ¦ wc -l" and other stats to see if some value is maxed out / increases incorrectly...

But any other kind of check to perform is more than welcome  :Smile: 

 *Quincy wrote:*   

> Sound like a real problem with your new phone as everything else was running fine until then.
> 
> What happens when "the issue in progress" and you disconnect the phone? Gets everything back to normal?
> 
> My first guess would be a duplicated IP address between both systems (server and phone).

 

----------

## Strupniveral

Ok, it happened again and I had a chance to run some other tests:

1. I was able to SSH from outside my LAN to my gentoo box

-> This confirms the issue is only related to internal LAN

2. The issue is not related to the # of open connections

-> the output of "lsof -1 | wc -l" reported the following values

```

2012/01/03 19.06 ==> 57

[...]

2012/01/04 21.00 ==> 74

2012/01/04 21.05 ==> 60  <-- This was taken while the issue was present.

2012/01/04 21.11 ==> 57  <-- This was taken after restarting both ifaces, dhcpd and dns server 

```

-> The same holds for the output of "lsof | wc -l", ranging from ~2.0k to ~2.4k

3. The only value I can see that changes along the rx_* / sx_* values is "rx_missed" on internal iface.

```

2012/01/03 19.06 ==> 27896

[...]

2012/01/04 21.00 ==> 27896

2012/01/04 21.05 ==> 28352  <-- Again, this was taken while the issue was present.

2012/01/04 21.11 ==> 28352  <-- This was taken after restarting both ifaces, dhcpd and dns server 

```

According to my knowledge, "rx_missed" logs the packet that cannot be handled by the interface due to the fact the incoming bandwidth is higher than the amount the iface/system can handle.

That could make sense due to the fact a new device appeared into my network, but I think it's highly improbable this is the issue: I used to have a laptop connected to the same network that is not there anymore, so I don't believe a smartphone can generate a workload higher than a pc...

So, I'm wondering if there's a way to actually see why the incoming queue is growing up...

----------

## Quincy

I also don't think that a single device can overflow the queue of your network card by idling around.

Is there any correlation between network load and your problem?

Perhaps connection of the phone was just by chance at almost the same time, as the problem does not vanish when it is disconnected from the network.

Actually I'm thinking of some driver/hardware problem in the server. Which network card/driver are you using?

BTW: Where do you get "rx_missed" from? ifconfig and such just report "errors", "dropped" and "overruns" in my case...

----------

## AngelKnight

What's in dmesg around the time the hangs occur?  Possible hardware problems?  Have you updated the kernel lately?  If so, perhaps a driver regression?

----------

## Strupniveral

 *Quincy wrote:*   

> Is there any correlation between network load and your problem?

 

None that I can see - It uses to work without any issue while doing video streaming to the LAN and then it hangs while doing usual web surfing.

Maybe it's something it grows trough time, but I can't find any evidence of that

 *Quincy wrote:*   

> Actually I'm thinking of some driver/hardware problem in the server. Which network card/driver are you using? 

 

I have two identical network cards:

```

# lspci | grep Ethernet

00:09.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8110SC/8169SC Gigabit Ethernet (rev 10)

00:0b.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8110SC/8169SC Gigabit Ethernet (rev 10)

# 

```

and I'm using the "r8169" drivers, built directly into the kernel

[quote="AngelKnight"]Have you updated the kernel lately? If so, perhaps a driver regression?[/code]

No, I didn't updated anything recently, so same kernel & drivers version as when it was working.

Given that, I'd exclude something related to kernel/driver: they worked fine for a long time and something related to drivers should affect both interfaces, not just one...

[quote="AngelKnight"]Possible hardware problems?[/code]

I though to some HW issue to the interface that happens "randomly", so the system is able to initialize it and use it for a while before it happens but I found this hard to believe do the following:

1. restarting the interface with the init script fix the issue - I would't expect it in case of HW issue

2. while into dmesg I can see the device initialized at boot, nothing similar is logged when restarting the interface

3. while the issue is in progress, I can see the "rx_missed" counter increase with the ethtool command, interface always listed by ifconfig and no error in messages/dmesg, so it seems to me the HW is always handled correctly by the OS

 *Quincy wrote:*   

> BTW: Where do you get "rx_missed" from? ifconfig and such just report "errors", "dropped" and "overruns" in my case...

 

It's one of the values shown by "ethtool" command (ethtool -S <network_device>)

----------

## Strupniveral

Ok, I think I might have tracked down the issue; however I'll wait a while to mark the thread as solved, just to be more sure about the issue itself.

Short version: the galaxy nexus wifi antenna has some compatibility issue with (old) wireless AP

(sorry for the lack of details, but I couldn't find any factual data).

Long version:  I spent a few hours troubleshooting the issue and run some searches in google.

Reporting here all my searches and conclusions, hoping someone else will benefit of it

1. Excluded the issue was load related

Starting a big file transfer from the gentoo box to one pc and from the internet to another pc (via the gentoo box) works smoothly.

Do some random browsing from pcs and from the mobile causes the hang.

2. The issue is not at OS/application level

No error in log files. All the services are acting fine in terms of resource usage. CPU/ram/vmem/proc # don't change before and after the issue

3. No hardware issue

Everything was working fine prior the mobile was purchased. During the time such device is not used everything works fine - both cable-connected and wifi-connected devices.

The issue starts after a while (mins to few hours) the device is connected to wifi network.

4. The device is known to have issues with the WiFi antenna

When I started to be strongly convinced the issue was related to the phone, I run some specific queries on google with a relevant number of results.

While I wasn't able to find any official statement, in google are reported a lot of users complaining with the behavior of the WiFi. Many of them had working networks that started to behave strangely when the device was introduced in their homes.

There are many "software" suggestions out there (statically assigning an IP vs DHCP, force 2.4GHz vs 5GHz, foce a specific channel, etc); none of them worked in my case (and noone else who reported a similar issue confirmed the issue was solved in any of those ways).

Some users reported they took a new AP and their issues disappeared (noone stated that the issue reappeared after buying a new AP).

5. My solution (workaround would be better)

I turned off my wireless AP and shared the network connection on a pc (sharing the ethernet connection via the wireless card) and spot which link would hang and act accordingly.

Now, after 4 days with this scenario, the issue didn't happened anymore.

Again, no change in daily network usage happened: so the old devices are working as before and the new phone is being used without any constrain.

So it looks like there is some incompatibility between the galaxy nexus and some access points - not clear in which way however.

I'll run some other tests and let you know.

----------

