# How do I diagnose intermittent network outages?

## pente

My gentoo desktop had no network problems until it was moved to its current network, a standard home network. The network has a router with several devices connected over wifi; the desktop is the only device connected with an ethernet wire. The other devices on the network include another gentoo machine and a mac; everything is using dhcp.

Frequently (more than once a day) and unpredictably the desktop loses network access for seconds to minutes, sometimes as long as 15 minutes. The other devices do not experience this. When this happens, ping fails to reach the router, other devices on the network, and devices outside the network. Ping variously gives "Destination Hose Unreachable" error messages or not.

Example of pinging other devices on the same network. Note when the network comes back, several pings return at the same time.

```
From 192.168.1.239 icmp_seq=93 Destination Host Unreachable

From 192.168.1.239 icmp_seq=94 Destination Host Unreachable

From 192.168.1.239 icmp_seq=95 Destination Host Unreachable

64 bytes from 192.168.1.79: icmp_seq=96 ttl=255 time=2238 ms

64 bytes from 192.168.1.79: icmp_seq=97 ttl=255 time=1218 ms

64 bytes from 192.168.1.79: icmp_seq=98 ttl=255 time=195 ms

64 bytes from 192.168.1.79: icmp_seq=99 ttl=255 time=3.10 ms

64 bytes from 192.168.1.79: icmp_seq=100 ttl=255 time=4.56 ms

64 bytes from 192.168.1.79: icmp_seq=101 ttl=255 time=2.83 ms
```

Pings from other devices on the network do not reach the desktop when it experiences an outage. There is no network degradation except when it goes out entirely. Outages seem to be unrelated to network usage. Power-cycling the router or restarting dhcpcd during an outage does not solve it.

I don't have any ideas how to go about diagnosing a problem like this, or what information might be useful. Any suggestions would be appreciated.

----------

## DespLock

Hi pente, 

to have any idea of what is going on with your network, you (and we) need some more information. 

pls issue the following commands once:

```

uname -a

lspci -vk (with the info of your network device(s)

lsusb

Are you using OpenRC? if yes:

rc-update

cat /etc/conf.d/net

```

The following comamnds should be issued twice, once with stable network, once while it is not working:

```

ip a

ip route show

```

Further infos:

a) are you using  a firewall on the desktop or the router? 

b) are you using networkmanager with or without systemd?

EDIT:

It's most likely one of these three three reasons:

1) a defective network cable

2) some kind of wrong configuration

3) bug with dhcpd (see https://forums.gentoo.org/viewtopic-t-1124935-highlight-dhcpd.html).Last edited by DespLock on Thu Jan 14, 2021 2:25 am; edited 2 times in total

----------

## Buffoon

I'd run ethtool to see the status of connection for starters.

----------

## Tony0945

Bad cable? Can you use another? or is it snaking through the walls? Bad switch? Is it connected to the switch built into the router or different one? try a different slot on the switch.

Swapping cables is a pretty standard test.   Did you terminate the cable? Or is it premade? Corroded connections? Cat-6? 

One test I made in a similar situation was to connect to an AP. PC had no problems. Intermittent on the 60 foot Cat-6 cable. Wound up replacing the connector on the other end. Didn't like the way the wires looked. Solved the problem. Great cable. US government surplus. US government always buys top quality supplies (well except for mess hall food and the raw food was probably grade A).

----------

## pente

Thanks for the tips so far.

```

$ uname -a

Linux athena 4.20.4-gentoo #2 SMP Sat Jan 26 04:26:04 2019 x86_64 Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz GenuineIntel GNU/Linux

$ lspci -vk

00:19.0 Ethernet controller: Intel Corporation Ethernet Connection (2) I218-V

        Subsystem: ASRock Incorporation Ethernet Connection (2) I218-V

        Flags: bus master, fast devsel, latency 0, IRQ 29

        Memory at efc00000 (32-bit, non-prefetchable) [size=128K]

        Memory at efc3c000 (32-bit, non-prefetchable) [size=4K]

        I/O ports at f080 [size=32]

        Capabilities: [c8] Power Management version 2

        Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+

        Capabilities: [e0] PCI Advanced Features

        Kernel driver in use: e1000e

 $ lsusb

Bus 002 Device 002: ID 8087:8001 Intel Corp. 

Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

Bus 001 Device 002: ID 8087:8009 Intel Corp. 

Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

Bus 004 Device 003: ID 067b:2731 Prolific Technology, Inc. USB SD Card Reader     

Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub

Bus 003 Device 010: ID 05f3:0007 PI Engineering, Inc. Kinesis Advantage PRO MPC/USB Keyboard

Bus 003 Device 009: ID 05f3:0081 PI Engineering, Inc. Kinesis Integrated Hub

Bus 003 Device 004: ID 05e3:0745 Genesys Logic, Inc. Logilink CR0012

Bus 003 Device 002: ID 046d:c52b Logitech, Inc. Unifying Receiver

Bus 003 Device 005: ID 046d:c52b Logitech, Inc. Unifying Receiver

Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

$ rc-update

            alsasound | boot                                   

               binfmt | boot                                   

             bootmisc | boot                                   

              cgroups |                                 sysinit

              chronyd |      default                           

               cronie |      default                           

                cupsd |      default                           

                 dbus |      default                           

                devfs |                                 sysinit

                dmesg |                                 sysinit

              elogind | boot                                   

                 fsck | boot                                   

              hddtemp |      default                           

             hostname | boot                                   

              hwclock | boot                                   

              keymaps | boot                                   

            killprocs |                        shutdown        

    kmod-static-nodes |                                 sysinit

                local |      default nonetwork                 

           localmount | boot                                   

             loopback | boot                                   

                  lvm | boot                                   

              metalog |      default                           

              modules | boot                                   

             mount-ro |                        shutdown        

                 mtab | boot                                   

          net.enp0s25 |      default                           

             netmount |      default                           

     opentmpfiles-dev |                                 sysinit

   opentmpfiles-setup | boot                                   

               procfs | boot                                   

                 root | boot                                   

         save-keymaps | boot                                   

    save-termencoding | boot                                   

            savecache |                        shutdown        

                 sshd |      default                           

                 swap | boot                                   

               sysctl | boot                                   

                sysfs |                                 sysinit

         termencoding | boot                                   

                 udev |                                 sysinit

         udev-trigger |                                 sysinit

              urandom | boot                                   

$ ls /etc/conf.d/net*

/etc/conf.d/net-online  /etc/conf.d/net.wlp0s20u1  /etc/conf.d/netmount

$ ip a

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000

    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

    inet 127.0.0.1/8 brd 127.255.255.255 scope host lo

       valid_lft forever preferred_lft forever

    inet6 ::1/128 scope host 

       valid_lft forever preferred_lft forever

2: enp0s25: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000

    link/ether XXXXXXX brd ff:ff:ff:ff:ff:ff

    inet 192.168.1.239/24 brd 192.168.1.255 scope global dynamic noprefixroute enp0s25

       valid_lft 86107sec preferred_lft 75307sec

    inet6 XXXXXX/64 scope link 

       valid_lft forever preferred_lft forever

3: sit0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000

    link/sit 0.0.0.0 brd 0.0.0.0

$ ip route show

default via 192.168.1.1 dev enp0s25 proto dhcp src 192.168.1.239 metric 2 

192.168.1.0/24 dev enp0s25 proto dhcp scope link src 192.168.1.239 metric 2 

```

I'm not aware of any firewalls, and am not sure how I'd go about checking for one. I have networkmanager installed but I don't know what it does and don't think I am "using" it. No systemd. I'm not familiar with how to use ethtool, and a quick look at the documentation didn't help.

I forgot to assess the premade cable. I can inspect the whole cable and it appears fine. I think it's the same one as was being used previously without incident. I will replace it when I have the chance. I have changed which slot it is plugged into and will cycle through all of them if I continue to experience problems.

dhcpcd version 9.1.4.

----------

## Tony0945

I use netifrc so I can't help with networkmanager.  What does "ifconfig -a" show?

----------

## pjp

Do your logs have anything relevant? Particularly dmesg for interfaces (likely beginning with eth and enp). Also dhcp logs, probably in /var/log/syslog or /var/log/messages. Especially useful if you can identify a time period when the problem occurred. Don't forget to verify that your time is correct so that the logged time is appropriate. Also, check whichever system is handing out the addresses.

----------

## DespLock

Hi pente,

on a first look your config seems to be ok and i can't find any unusual except for 

```

Linux athena 4.20.4-gentoo

[...]

dhcpcd version 9.1.4.

```

Both versions are outdated, is this the state of the whole system?

Pls at least make a note if you manually edit the output of commands or logs like:

```

[...]

link/ether XXXXXXX

[...]

inet6 XXXXXX/64

[...]

```

Does you /etc/conf.d/net contain more entries then this for that interface?

```

config_enp0s25="dhcp" 

```

 *Quote:*   

> 
> 
> Do your logs have anything relevant?
> 
> 

 

Pls check the following for any flashy entries:

```

dmesg | grep e1000e

dmesg | grep enp0s25

dmesg | grep firmware

```

What i would do next: 

1) try another port at the router for the network cable and make sure they are properly plugged in on both sides

1) test another cable 

2)  If you havven't applied updates for some times, i would try an actual liveCD/USB like Fedora and see if the problems still occurs. 

3) enable logging for dhcpd if you aren't logging it already (man 5 dhcpcd.conf)

4) don't forget to run again in case of an incident:

```

ip a

ip route show 

```

----------

## pietinger

pente,

first of all, diagnosing intermittent network outages is one of the hardest jobs.

You said, the only change was moving your desktop to another router. Therefore my first thinking is: Cable or Router.

If it is not the cable, it should be the router. And yes there exists buggy ones.

You are using dhcp. This is the first I would change. Give your desktop a static IP address. No more changes (also no updates). Only one by one.

If you have still problems, you know it wasnt dhcp ... and we have to investigate further ... else   :Smile: 

----------

## pietinger

 *pente wrote:*   

> I'm not aware of any firewalls, and am not sure how I'd go about checking for one.

 

Usually if you dont know, then you dont have ...  :Wink:  But, yes you dont have a firewall in your desktop (I see in your rc-update; no loading of iptables or nftables).

----------

## Buffoon

 *Quote:*   

>  I'm not familiar with how to use ethtool, and a quick look at the documentation didn't help.

 

How very interesting, you can post here but you have no access to search engines to search for "how to use ethtool". Well, ethtool would show you instantly if this is the cable causing the trouble. Like this, my working connection:

```

        ...

        Speed: 1000Mb/s

        Duplex: Full

        ...

 
```

But of course, if you can't use it then don't. Keep guessing, more fun?

----------

## DespLock

 *Quote:*   

> 
> 
>    ...
> 
>         Speed: 1000Mb/s
> ...

 

Doesn't that just show that the two network cards auto negotiated a network connection @1000Mb/s and full duplex?   :Cool: 

Still a number of reasons left if they don't.

EDIT: and this might be true while the connection is stable also. Would rather say that the temporarily occurrence is a better indicator for a defective cable then your posted output.

----------

## Tony0945

Buffoon, no need to be nasty. He said he found the documentation but didn't understand it. No crime in that.

----------

## Tony0945

Pente, what is mke and model of the new router and the old router if you still have it?

----------

## gengreen

I got a similar problem a while ago my is :

 *Quote:*   

> Ethernet controller: Intel Corporation Ethernet Connection (7) I219-V

 

If my memory is intact, the problem was from the module e1000e

To mitigate those connectivity loss, I used ethtool as follow :

```

ethtool -K eno1 gso off gro off tso off tx off rx off

ethtool -s eno1 speed 10 duplex full

```

Recent kernel could solve your problem, I'm on a 4.19.152 didn't experience the problem again

Note that when you use ethtool, those change are temporary, restarting your network interface will reset to default

You can check basic information with a simple

```
ethtool eno1
```

obvisouly change eno1 by your need

----------

## pente

I saw some odd things in dmesg, more on that at the end. I have no information on the old router, it was in a commercial setup, not a typical home router. New router is asus tm-ac1900.

```

$ ethtool enp0s25

Settings for enp0s25:

        Supported ports: [ TP ]

        Supported link modes:   10baseT/Half 10baseT/Full 

                                100baseT/Half 100baseT/Full 

                                1000baseT/Full 

        Supported pause frame use: No

        Supports auto-negotiation: Yes

        Supported FEC modes: Not reported

        Advertised link modes:  10baseT/Half 10baseT/Full 

                                100baseT/Half 100baseT/Full 

                                1000baseT/Full 

        Advertised pause frame use: No

        Advertised auto-negotiation: Yes

        Advertised FEC modes: Not reported

        Speed: 1000Mb/s

        Duplex: Full

        Port: Twisted Pair

        PHYAD: 1

        Transceiver: internal

        Auto-negotiation: on

        MDI-X: on (auto)

Cannot get wake-on-lan settings: Operation not permitted

        Current message level: 0x00000007 (7)

                               drv probe link

        Link detected: yes

$ ifconfig -a

enp0s25: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500

        inet 192.168.1.239  netmask 255.255.255.0  broadcast 192.168.1.255

        inet6 XXXXXX  prefixlen 64  scopeid 0x20<link>

        ether XXXXXX  txqueuelen 1000  (Ethernet)

        RX packets 248392094  bytes 300585770434 (279.9 GiB)

        RX errors 0  dropped 0  overruns 0  frame 0

        TX packets 135507076  bytes 29767207081 (27.7 GiB)

        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

        device interrupt 20  memory 0xefc00000-efc20000  

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536

        inet 127.0.0.1  netmask 255.0.0.0

        inet6 ::1  prefixlen 128  scopeid 0x10<host>

        loop  txqueuelen 1000  (Local Loopback)

        RX packets 74341785  bytes 5937779001 (5.5 GiB)

        RX errors 0  dropped 0  overruns 0  frame 0

        TX packets 74341785  bytes 5937779001 (5.5 GiB)

        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

sit0: flags=128<NOARP>  mtu 1480

        sit  txqueuelen 1000  (IPv6-in-IPv4)

        RX packets 0  bytes 0 (0.0 B)

        RX errors 0  dropped 0  overruns 0  frame 0

        TX packets 0  bytes 0 (0.0 B)

        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

```

I have cycled through the slots in the router and upgraded dhcpcd to 9.4.0 without resolving the issue. The kernel is quite out of date as updating is a pain. @world was last updated a few months ago, I checked recently for anything that looked like it particularly needed updating and it seemed fine.

"ip a" and "ip route show" were pretty much unchanged during an outage (I've lost track of which output in my terminal was during the outage or not). Pulling (and restoring) the cable during an outage causes "ip route show" to give no output and ping gave "Network is unreachable" for 26 seconds before returning to "Destination Host Unreachable".

As noted above, I don't have a /etc/conf.d/net file. No dhcpcd logs I could find, I'll have to enable logging. Wiggling the cable connections does not cause any network problems.

I did see something interesting about the timing of these messages in dmesg:

```

[4250615.210839] e1000e: enp0s25 NIC Link is Down

[4250618.342852] e1000e: enp0s25 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

[4250730.770904] e1000e: enp0s25 NIC Link is Down

[4250733.734920] e1000e: enp0s25 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

[4250736.450033] e1000e: enp0s25 NIC Link is Down

[4250739.386080] e1000e: enp0s25 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

[6482964.638479] e1000e: enp0s25 NIC Link is Down

[6482967.591513] e1000e: enp0s25 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

[6483082.219650] e1000e: enp0s25 NIC Link is Down

[6483085.181642] e1000e: enp0s25 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

[6483087.165772] e1000e: enp0s25 NIC Link is Down

[6483090.127801] e1000e: enp0s25 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

[7467309.019115] e1000e: enp0s25 NIC Link is Down

[7467313.957213] e1000e: enp0s25 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

[7467428.862331] e1000e: enp0s25 NIC Link is Down

[7467431.820360] e1000e: enp0s25 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

[7467433.776456] e1000e: enp0s25 NIC Link is Down

[7467436.738482] e1000e: enp0s25 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

[7930109.220327] e1000e: enp0s25 NIC Link is Down

[7930113.658373] e1000e: enp0s25 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

[7930228.785480] e1000e: enp0s25 NIC Link is Down

[7930231.824426] e1000e: enp0s25 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

[7930233.890540] e1000e: enp0s25 NIC Link is Down

[7930236.944594] e1000e: enp0s25 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

[8104968.649715] e1000e: enp0s25 NIC Link is Down

[8104972.309130] e1000e: enp0s25 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

[8105090.273666] e1000e: enp0s25 NIC Link is Down

[8105093.954350] e1000e: enp0s25 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

```

I believe none of those correspond to the cable being physically removed. The timing is very odd; in each case the link goes down for 3 seconds, comes back for 115 seconds, goes down for 3 seconds, comes back for 2 seconds, and then goes down for 3 seconds. (Except the last incident is missing the last cycle.) There are several weeks between each of these, although network outages happen more than once a day. I guess this points towards some kind of cable problem, although it is hard to imagine a cable problem that has such regular timing.

I'll assume it is a cable problem unless someone has another line of attack to suggest. Not sure how long it'll be before I get a replacement to test.

----------

## C5ace

If you have or can borrow a second PC buy or make up a short crossover network cable connect both PCs and run ping. 

Alternatively get one of those $10.00 Ethernet hubs and two normal short patch cables to connect the two PCs.

If either works, it's the cable or your router.

Now remove one of the connectors of your long cable and crimp on a new one wired for crossover and try again. If this works, your cable is OK and your router is bad.

Note, some cable modems and old commercial routers require crossover cables between the PC's and modem or router. Telstra in Australia used to provide such cable modems to their customers.

----------

## DespLock

Hi pente,

with your last logs it is most UN-likely a hardware defect or the cable. In that case you would have errors in the RX erorr/TX error fields visible too (beside the link error in dmesg):

 *Quote:*   

> 
> 
> enp0s25: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
> 
>         inet 192.168.1.239  netmask 255.255.255.0  broadcast 192.168.1.255
> ...

 

There seems to be a known bug in earlier kernel versions with e1000e driver and you can work around it(see below, you need ethtool) or install a newer kernel (as gengreen said newer lts kernel 4.19.xxx seems to work). 

If you want to stick with your kernel and test the solution run as root:

```

ethtool -K enp0s25 gso off gro off tso off

```

or in case it doesn't work:

```

ethtool -K enp0s25 gso off gro off tso off tx off rx off 

```

and watch the logs. 

If it is working the command needs to be run after bringing the interface up. This can be done via  

/etc/conf.d/net.

```

postup() {

  # This function could be used, for example, to register with a

  # dynamic DNS service.  Another possibility would be to

  # send/receive mail once the interface is brought up.

  

   if [[ "$IFACE" == *"enp0s25"* ]]; then

      #echo "$IFACE: postup"

      ethtool -K enp0s25 gso off gro off tso off

   fi

   

    return 0

}

```

Add this function to the file and restart the service.

----------

## pente

Unfortunately the suggested ethtool commands did not yield any change in behavior. It's unclear to me if I should expect upgrading the kernel to resolve the issue.

Since I figured I am due for a kernel upgrade regardless, I just got through the kernel configuration marathon and am about to give it a try. Along the way I noticed the config_generic_phy option which sounded important, but about which I could find very little information, all of which suggested it was related to ethernet in some way. It is disabled in my current kernel. Probably a red herring, but I thought I'd ask if there is a chance this or some other kernel option might be relevant to my issue?

Some new information: the home network is apparently configured as some kind of mesh network. I am unclear about how those work, so I'm not sure of the implication of losing ping to other devices on the network. As I believe other devices on the network are connected (wirelessly) to the device the desktop has a wired connection to, I think it is unlikely the problem lies with the mesh.

Also, on further observation I now believe the connection failures are correlated with network usage (less frequent on weekends), although I'm not sure if it has more to do with the desktop's usage or others'.

----------

## pente

I believe I have applied all of the suggestions mentioned above, except the part about a crossover cable that I didn't understand how to do or what the goal was. I have upgraded the kernel and @world, and replaced the cable, in addition to the other steps I've mentioned in my previous comments.

There has been no change in the network interruptions I have been experiencing. Any suggestions on how to diagnose the problem would be appreciated, thanks.

----------

## Ralphred

 *pente wrote:*   

> although it is hard to imagine a cable problem that has such regular timing.

 

Not really, the environmental* factors that are pushing a cable from "almost failing" to "failing" can be unpredictable, but the way the protocol handles the timeout for a broken link or negotiation for reconnection are predictable.

Termination points are the things to check first, something like 

```
ping -f [ip_addr]
```

 gives a nice graphical feedback for running around and testing infrastructure, and it the first tool I use if I suspect cable failure.

 *pente wrote:*   

> connection failures are correlated with network usage (less frequent on weekends), although I'm not sure if it has more to do with the desktop's usage or others'.

 

Can you get in to the router, see what it's load is; I had a client with similar issues due to the sheer number of apple/android devices slurping data and phoning home, new device from the ISP worked for a couple of months then the same issue. Replaced it with separate modem, router and AP devices** and he's never had issues since.

*These can be anything from temperature changes causing physical stress, to electrical items causing undue interference in unshielded/damp cables etc.

**Though a decent all in one would have probably worked

----------

## gentoo_ram

I saw similar behavior with this driver when my computer was plugged in to a cable modem once.  Intermittent drops of signal.  Once I got the cable modem replaced, the issue went away.  Whatever your computer is plugged into may be having problems.

----------

