# [Solved] Very odd problem at datacenter, connection keeps...

## yarug

Hi all,

I'm turning to the forum for a very odd problem I'm experiencing with my colocated servers. I'm by no means a networking expert and as such I have not been able to solve this issue so far. I hope your input can put this to rest.

We have recently changed our colocation provider. Note that we did not have any (technical) problems with our previous colocation provider, connectivity was fine. However, with our current colocation provider, our connection simply stops working after some time (sometimes a day, sometimes sooner - not sure if it stops after a specific amount of data or if it is just random).

The setup is as follows: the provider has a switch that provides 15 or so ports available. We use one of these ports to connect to our Gentoo router/gateway (eth0), this is the WAN side. That same router/gateway then is connected on the LAN side (eth2) to our own router and we have 4 more servers connected to that router forming our network.

The first time we connected to the provider's switch we used port 1. That worked for a day or so until the connection stopped functioning. I could not access our box remotely anymore and when I went on site, I could not connect to the Internet from that box as well. However, when I switched to another port (port 6) the connection came back up again. This worked for a day or so again and then the same thing happened, no more connectivity. An engineer from the colocation provider then switched the port again (port 7) and the connection was back up. Same thing once more, port 7 stopped working -> changed to port 8 (what I'm using now - not sure how long it will last) and I can connect to that box again.

Note that I do not need to restart my server or restart the eth0 for the connection to get back up, simply switching to another port gets the connection back.

Today I was told by the support department of the colocation provider that they cannot find anything wrong with the switch or any of the ports. Other customers are using the switch without problems and they can connect to any of the ports that was blocking before (1, 6, 7).

My server is a Sun Fire X2100 with an extra network card that is based on the Intel e1000. It is configured 100 Mbps full duplex:

alias eth0 e1000

options e1000 Speed=100 Duplex=2

Has anyone experienced this before or does anyone have any ideas how I can tackle this problem? Any help or input is appreciated as I'm at a dead end.

Many thanks,

NesLast edited by yarug on Tue Apr 10, 2007 12:25 pm; edited 1 time in total

----------

## HeissFuss

Are there any messages in dmesg about the interface when it goes down?  Also did your colocation provider mention if the switch logs said anything about the interface when connection was lost?

----------

## Aurisor

Sorry to say it man, but this sounds like a hardware issue.  I'd take the following steps to isolate the problem.

1) Eliminate software as a factor.  

a - This means first checking dmesg, various logs, and so forth.  Especially concentrate on whatever servers you have running around the time the system goes down. 

b - Try running the system off of a livecd.  You can probably get apache running under a livecd environment.  Depending on what this server is doing, though, you may or may not be able to afford running the thing like this for the amount of time it'll take to go down.

c - Reinstall everything?

2) Assume it's a hardware problem

a - Try replacing hardware in this order: NIC, Network Cable, Switch, physical network connection.  

b - If that fails you have a truly mysterious problem.  

Keep us posted!

----------

## yarug

Nothing in dmesg that indicates the loss of connection, but I'll post messages related to e1000 here:

e1000: 0000:05:00.0: e1000_check_copper_options: Forcing to 100 Mbps Full Duplex

e1000: 0000:05:00.0: e1000_check_copper_options: Speed, AutoNeg and MDI-X specifications are incompatible. Setting MDI-X to a compatible value.

e1000: 0000:05:00.0: e1000_probe: (PCI Express:2.5Gb/s:Width x4) 00:15:17:0b:3d:2c

e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection

e1000: 0000:05:00.1: e1000_probe: (PCI Express:2.5Gb/s:Width x4) 00:15:17:0b:3d:2d

e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection

Could it have something to do with the second line "Speed, AutoNeg and MDI-X specifications are incompatible"?

The thing that I find so odd is that switching to another port makes the connection work. To be honest, I have not tried just removing the network cable and then putting it back when the connection was lost. I always switched to a different port. It could be that re-plugging resets the state somehow and things start working again. Maybe by the e1000_watchdog_task (as I assume this watchdog is watching for unplug/plug events) is responsible for that?

This is in /var/log/everything/current and was logged at the time we switched from port 7 to port 8:

Mar 30 11:23:43 [kernel] e1000: eth0: e1000_watchdog_task: NIC Link is Down

Mar 30 11:23:48 [kernel] e1000: eth0: e1000_watchdog_task: NIC Link is Up 100 Mbps Full Duplex

The next time it happens, I'll try just replugging the network cable and see if the connection is reset using the same port.

@HeissFuss: The provider did not say anything about their switch logs.

Thanks,

Nes

----------

## yarug

Just an update

After the connection failed again in the weekend I have had the cable unplugged and then re-plugged but that did not bring the connection back up. Even a server restart does not help! The only thing that brings the connection back up is switching to another port... I'm now almost ready to just replace the network card and see if that could solve this.

----------

## yarug

Another update. It seems that the issue is solved! After contacting support of our hardware supplier, they pointed me to the fact that this line in dmesg may indicate the cause of the problem:

```
e1000: 0000:05:00.0: e1000_check_copper_options: Speed, AutoNeg and MDI-X specifications are incompatible. Setting MDI-X to a compatible value.
```

After removing the following line from /etc/modules.d/e1000:

```
options e1000 Speed=100 Duplex=2
```

and restarting the server, everything started working. Thankfully the network card is still configured with 100Mbps/Full Duplex even without those settings in /etc/modules.d/e1000.

According to the hardware supplier the switch was reconfiguring the switchport (changing the transmit/receive poles) for some reason and the network card tried to follow suit but couldn't because of the explicit options for Speed and Duplex being set. At least something like that... I'm no expert and not sure if that explanation makes sense.

Hopefully this will help anyone having the same problems and thanks to Aurosor and HeissFuss for their replies.

----------

