# Intermittent ICMP destination host unreachable

## Irayo

Hi,

I'm using Gentoo on a router/gateway box.  I have latest stable gentoo-sources and pretty standard network settings for a router (DHCP on the WAN-side IP, static LAN-side IP, forwarding enabled).

Mostly, everything works as it should.  But every so often on my client machines (all of them, including various OSes: Linux Mint, Gentoo, OS X, Windows ...), I receive a batch of ICMP "destination host unreachable" responses from the router machine for all types/ports/destinations of traffic.  Each batch of ICMP errors seems to correspond to the preceding 5-10 seconds of packets that my client machines have attempted to send.

This happens at seemingly at random intervals ranging from 30 seconds apart to 1500 seconds apart and it has been very difficult to diagnose so far because of this.

Every time I try to search for this problem I receive reports of people having "destination unreachable" errors 100% of the time, but I've yet to find anyone who has this intermittently-occurring disconnect.  It's almost as if my routing table is resetting and repopulating itself every so often, and "forgets" how to route while it does so -- but of course I can't catch anything happening because it happens intermittently, and there are no log messages in system logs or dmesg.

Right now I'm testing whether the router machine itself ever gets "destination unreachable" errors from itself/upstream, or if it is only reporting this errors to clients (this might tell me whether it's a routing issue, hardware issue, or upstream issue depending on the results).  But so far, I have not encountered this issue on the router machine.

Does anyone have an idea what might be causing this issue?

----------

## smerf

Have you excluded hardware problem (cable, nic, link negotiation)?

----------

## szatox

when destination host is unreachable it should also print you the failing host's address. Like in: if you are 10.0.0.100 and router is 10.0.0.1 and you ping google, failure on 10.0.0.100 means you lost connection to router, failure on 10.0.0.1 means router has no path avialible, and something further away means someone in your ISP's facility tripped over a wire. Better use this way rather than pinging separately on all devices to determine where the problem exists.

I've encountered such an issue when I assigned 2 nic interfaces IPs from the very same subnet pool. So it sometimes attempted to send data over wrong connection. Yeah, i know, stupit mistake.

----------

## Irayo

 *smerf wrote:*   

> Have you excluded hardware problem (cable, nic, link negotiation)?

 

Not yet.  I don't think there are any "hard" disconnects as those would be printed to the kernel log ("link down", "link is not ready"), and the devices are all properly negotiated at 1000mbps full-duplex, but there could still be issues with the NIC/cables.  I've been waiting for a good opportunity to shut down.  I don't have a replacement NIC to test, but my plan is to swap the interfaces (WAN<->LAN) and use new cables.  If the problem remains the same, it seems likely to be a configuration or upstream problem; if the problem changes (if I start getting LAN connection issues or something) then it's probably a NIC issue and I'll get a replacement.

----------

## Irayo

 *szatox wrote:*   

> when destination host is unreachable it should also print you the failing host's address. Like in: if you are 10.0.0.100 and router is 10.0.0.1 and you ping google, failure on 10.0.0.100 means you lost connection to router, failure on 10.0.0.1 means router has no path avialible, and something further away means someone in your ISP's facility tripped over a wire.

 

Yeah.  The clients see the failure as occurring at my router system (192.168.0.1), so I'm pretty sure I'm losing link/connection/route to the ISP.

From looking at network traffic dumps, it looks like around 2-3 seconds before the host-unreachables are sent, my router sends an ARP request to try to find the ISP's upstream router.  This ARP request goes unanswered.  Either this is a symptom (meaning I've already lost connection to the ISP and that is why the ARP is sent and why I get no response) or it is actually the problem (some misconfigured system doesn't respond to my ARP request as it should so I lose ability to route packets).

After all the destination-host-unreachable messages are sent, another ARP request is attempted and receives a valid response.

As a test, I've inserted a static ARP table entry to see if the problem goes away.  If I haven't seen any problems in a few hours, I'll assume that did the trick and try to figure out why.

----------

## smerf

 *Irayo wrote:*   

> I don't think there are any "hard" disconnects as those would be printed to the kernel log ("link down", "link is not ready"), and the devices are all properly negotiated at 1000mbps full-duplex, but there could still be issues with the NIC/cables.

 

Re-negotiation of link speed does not make link to go down - even faulty cable does not always mean hard diconnects.

I had once situation where faulty NIC caused 1000/100 negotiation/degradation cycle every few minutes and I have

detected this by monitoring continuously the interface speed with ethtool. It is uncommon but theoretically possible.

----------

## Irayo

So my test of inserting a static ARP table entry for the upstream ISP router seems to have done the trick.  At least until yesterday, when I rebooted... then I started having issues again until I put the static ARP entry back this evening.  Now the problem has gone away again.  So that's almost definitely the issue.

Any ideas why the upstream router is (sometimes) not responding to my ARP queries?  There isn't any packet loss as far as I can tell...

Is there away to increase ARP request frequency or retries or something? I'm not sure what would help in this situation.

----------

## smerf

Are you directly connected to this router or there are some other devices (like switch) in between?

If so, then the problem might be on this hypothetical switch. Maybe it is not passing correctly Layer 2 broadcasts?

----------

## smerf

Maybe you are experiencing some form of this issue?

----------

