# Xen DomU Networking Stops working under load

## koan

Hello,

I have a 2.6.21 xen DomU running pv under a 2.6.21 Dom0.

In general everything works well, no problems.  However, if I load the network card, it (almost) stops sending or receiving packets.

Everything is normally pretty low access - I am running a couple of database servers and snmp server and a few other things in the domu, but the traffic frequency is extremely low - one or two users max.

If I start something like bittorrent, it will kill the networking almost immediately.  I was running a Samba server on the domu for a while, but whenever I would save a file of significant size, it would kill the network.

If I xm console in, the network card appears fine, and there are a small number of bytes ticking up on transmit and receive.  Nothing in messages or dmesg.

The dom0 network is fine, and I am using a single physical nic to bridge to.  I have three other vms (all hvm) running on the same Xen box and they don't suffer any network issues.

I am not sure how to proceed with diagnosing this...

Cheers,

Paul

----------

## koan

The domU can ping itself incidentally, and restarting the domU interface doesn't help.

Sometimes it seems to right itself, sometimes it needs a reboot.  tcpdump on the dom0 doesn't show anything.

Also, I use this domU for asterisk, and when the network is working, there are no problems with calls.  So the issue seems to be more about taxing the virtual nic than packet frequency.

----------

## koan

Well, ok.

I have changed the kernel in the domU from 2.6.21 to 2.6.25 - same problem.  In the dom0 I have changed the network card without change.

The domU doesn't recognise anything is wrong in the messages or dmesg - it just cannot connect to anything.  If I shut it down, it hangs but isn't connectible via the xm console.

If I xm destroy, it destroys.  If I then try to xm create the domain again, I get

```
Error: Device 0 (vif) could not be connected. Hotplug scripts not working.
```

Nothing appears in the xen-hotplug.log.

xend.log gives:

```

...

[2008-08-31 19:14:11 5531] DEBUG (DevController:595) hotplugStatusCallback /local/domain/0/backend/vif/11/0/hotplug-status.

[2008-08-31 19:15:51 5531] DEBUG (XendDomainInfo:1897) XendDomainInfo.destroy: domid=11

...

```

So it tries for a while and then destroys the VM.

All the other VMs are working fine at this point, but if I shutdown and attempt to restart any, they will fail to get the vif too.

Adding interfaces to the bridge seems to work fine, so I guess the problem must be in the creation of the vif interface.  Or at least, the vif breaks, and then xen is no longer able to create a new one.

I am not sure at what stage this takes place - prior to vif-script by the look of it...

----------

## bbgermany

How do you create the xenbr interface? Im doing it this way:

/etc/conf.d/net

```

config_eth0=( "null" )

config_eth1=( "null" )

bridge_xenbr0="eth0 eth1"

config_xenbr0=( "192.168.23.252 netmask 255.255.255.0" )

routes_xenbr0=( "default via 192.168.23.1" )

dns_servers=( "192.168.23.20" )

dns_domain="xxx.xxx"

dns_search="xxx.xxx xxx.yyy"

brctl_xenbr0=(

        "setfd 0"

        "sethello 0"

        "stp off"

)

```

/etc/xen/xend-config.sxp

```

(network-script network-dummy)

```

This solved the script issues while creating the bridge.

bb

----------

## koan

I am running the network-bridge script for the bridge create - with a slight mod as I have multiple addresses on my physical nic, and these were not getting set up correctly on the bridge.

Your script sets up the bridge normally, but then also does this:

```

        "setfd 0"

        "sethello 0"

        "stp off" 

```

My bridge has forward delay set to zero, and stp off.  So the only difference is that you have the hello time set to zero, whereas mine is 2 seconds.

With stp off, I am not sure that hello time does anything - but googling it I have found a number of occasions where setting hello time to something other than zero fixed some Xen networking issues (high numbers of interrupts).

Can you remember why you have it set it to zero?

----------

## bbgermany

iirc, i used the gentoo wiki entry to configure my xen. there was this. i use multiple addresses on the bridge as well. iproute2 did the trick for me.

bb

----------

## koan

The forward delay and stp settings are default in current gentoo Xen installs.  The hello interval relates to the frequency that bpdu is issued, and so it is unlikely to have any baring on my issue.

I'll give it a test at some point, because right now I have exhausted the leads available to me - at least, the ones I can think of.  Well I do have another, and that is to build Xen with a stock kernel from another distribution, to see if it helps.  But that represents a whole new set of difficulties, as I couldn't find a stock kernel that did everything I wanted, which is why I came back to gentoo...

----------

## maslo64

Hello Koan,

I have exactly same issue with Xen.When I am starting domU I noticed message:

```

Bringing up eth0

 *     dhcp

 *       Running dhcpcd ...err, eth0: Failed to lookup hostname via DNS: Name or service not known

                                               [ ok ]

 *       eth0 received address 192.168.1.122/24

```

After login to system is everything fine, but when i am doing something like "emerge -eauDN world"  some packages are transfered to domU and after while it looks that bridge is down, and again after while network intercase is working again . 

 Below is my bonding setup for eth0 and eth1 in dom0

```

   config_eth0=( "null" )

   config_eth1=( "null" )

   RC_NEED_bond0=("net.eth0 net.eth1")

   slaves_bond0="eth0 eth1"

   config_bond0=( "null" )

   RC_NEED.xenbr0="net.bond0"

   bridge_xenbr0="bond0"

bridge_xenbr0="bond0"

config_xenbr0=("dhcp")

brctl_xenbr0=(

        "setfd 0"

        "sethello 0"

        "stp off"

)

```

I am now trying to test if it`s not caused by bonding or ipv6.

Any help help will be appreciated

----------

## koan

Hi,

I am not using bonded nics, or IPv6.

Xen 3.3 came into unstable a couple of days ago, so I upgraded, but the problem still remains.

Someone on the Xensource mailing list suggested lowering the NIC rate so that the domU never transfers at a speed that breaks networking, but last time I tried to test, the break happened at 3.6Mbs.  That is pretty slow!

So the changes I have made are:

1) Change domU kernel (2.6.21, 2.6.24, 2.6.25)

2) Change domU userland (gentoo, ubuntu)

3) Change dom0 physical nic + driver (Realtek 8169 -> 8168)

4) Change Xen version (3.2.1 -> 3.3)

The only thing I haven't changed is the dom0 kernel.  I am using a stock 2.6.21 gentoo kernel, so it would be great if anyone watching this that has pv domUs working under a 2.6.21 kernel would post their .config so I can compare it to mine.

Paul

----------

## maslo64

Hmm, so I switched back to eth0 -> xenbr0 configuration and disabled IPV6 and everything is ok now. 

I am going to try different bonding modes. And if the problem persist i thing I have to try NAT  :Sad: 

----------

## bbgermany

What kind of bond do ya use? Maybe your switch doesnt support the mode and so packages get lost at transfer.

bb

----------

## maslo64

I was using mode=1 , but when I was testing pluging-> unplugin cables , connections was not restored.

Then I tried mode=0 which was working fine from dom0 , but issue with domU appear.

----------

## bbgermany

Did you try as mode 5 (balance-tlb) or 6 (balance-alb) as well? Mode 0 is round-robbing and 1 is active-backup. If youre switch supports Link Aggregation Control Protocol (LACP), you should consider mode 4 (802.3ad).

bb

----------

## maslo64

I am going to test this today in the evening , because i don`t have console access to server and try again and again and again   :Cool:  .

----------

## maslo64

Still no luck with bonding , I tried 6 modes for bonding , but still no progres. I am going to downgrade kernels for dom0 and domU from 2.6.21 -> 21.6.18-r12 and check if it helps. I also set "sethello 2" as its known bug for xen as I found and you have right about this.

----------

## koan

Ok,

It looks like the bonding issue isn't related to the original issue report on this thread, but thats ok, we can share  :Wink: 

Anyway, in an effort to eliminate the nic as the source of the problem I used another with different drivers, but the problem remained.  They were both realtek however, and it was pointed out that this would not necessarily discount a driver issue.

I am testing with a 10/100 tulip card and it is looking promising.  No lockups yet and I have shifted a couple of gigs across the link.

----------

## maslo64

I am sorry if I mess up your thread with my own problem  :Smile: 

Anyway, I looks that I have reached the solution. My configuration is HP DL 380G5 and network card is :

03:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)

05:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)

After i recompiled kernel with driver for NW card as modul, it reported "Call trace" to dmesg log.

I tried to compile and install drivers from Broadcom website ,but can`t (don`t know how, all my attempts was unsuccessfull) howto enable 

ZLIB_INFLATE in kernel, thus I can`t load the module. 

Everythinkg looks fine when I compiled it as kernel module thru 'make0 menuconfig' and set mode=5 

So this are my configs: 

```

master ~ # cat /etc/conf.d/net

   config_eth0=( "null" )

   config_eth1=( "null" )

   RC_NEED_bond0=("net.eth0 net.eth1")

   slaves_bond0="eth0 eth1"

   config_bond0=( "null" )

   RC_NEED_xenbr0="net.bond0"

   bridge_xenbr0="bond0"

 config_xenbr0=("dhcp")

 brctl_xenbr0=(

        "setfd 0"

        "sethello 2"

        "stp off"

)

```

```

master ~ # cat /etc/modules.autoload.d/kernel-2.6

bnx2

bonding miimon=100 mode=5

loop max_loop=256

master ~ # uname -a

Linux 2.6.18-xen-r12 #9 SMP Sat Sep 6 14:41:34 CEST 2008 x86_64 Intel(R) Xeon(R) CPU E5420 @ 2.50GHz GenuineIntel GNU/Linux

master ~ # cat /xen/reference/gentoo.xen.cfg

kernel = "/boot/vmlinuz-2.6.21-xenU"

memory = 1024

name = "reference"

vif = [ 'mac=00:16:3E:6A:49:54, bridge=xenbr0'  ]

dhcp = "dhcp"

disk = ['file:/xen/reference/gentoo64.img,sda1,w', \

        'file:/xen/reference/gentoo64_lvm.img,sdc1,w', \

        'file:/xen/reference/swap_disk.img,sds1,w', ]

root = "/dev/sda1 ro"

extra = "gentoo=nodevfsi"

master ~ #

```

app-emulation/xen-3.3.0

app-emulation/xen-tools-3.3.0

For me it looks that somewhere in /usr/src/linux/drivers/net/bonding/* isn`t everyting right when using bnx2.

----------

## Hibbelharry

You might try disabling checksum offloading to hardware by using ethtool. This solved some network dying problems wit xen for me.

Greetz

Hibbelharry

----------

## maslo64

Hello Hibbelharry,

My problem is solved as I can tell now. I can`t add [solved] to topis as this isn`t my thread and I mess up koan's thread  :Smile: 

btw. koan helped switching to diferent drivers ?

----------

## bbgermany

 *Hibbelharry wrote:*   

> You might try disabling checksum offloading to hardware by using ethtool. This solved some network dying problems wit xen for me.
> 
> Greetz
> 
> Hibbelharry

 

Im having checksum offload disabled for tx not rx. Did you disable both?

```

zeus ~ # ethtool -k eth1

Offload parameters for eth1:

Cannot get device udp large send offload settings: Operation not supported

rx-checksumming: on

tx-checksumming: off

scatter-gather: off

tcp segmentation offload: off

udp fragmentation offload: off

generic segmentation offload: off

zeus ~ #

```

bb

----------

## BlackEye

Is there a solution for this problem?

I have exact the same problem as the original poster!

Instead of samba I discovert the problem by using nfs. Copying large files (several MBs) over NFS and the virtual network of xen is unrecoverable dying. I need to restart the whole dom0 to be able to restart the domU and using the network again.

By reduceing the rsize and wsize of nfs I observed that this problem my not appear again. However - this could be related to the fact that the transmission is slower with these changes and maybe the bug isn't affected then. The strange thing is, that I could copy large files using netcat between domU and dom0 without any problems. 

I'm afraid that this is a security issue. I could lower the rsize and wsize but what happens if one is able to send a large packet though the pipe to crash my connection. 

Is there any real solution for this problem? New kernels? New bugfixes? Or any other hints?

I use xen 3.3 with 2.6.21-xen kernel sources (dom0 and domU).

Any help would be really appreciated!

Greetings,

Martin

----------

## koan

I am currently running with a non-Realtek based 10/100 card, and haven't experienced any issues with the network failing even at max.

However, I do want this to be a gigabit connection, so I have a dlink gig card waiting to test, and I'll report back if I get good results (or not).

My other plan is to use the SUSE Xen patchset against the Gentoo 2.6.25 kernel to see if that helps.  I have the kernel built, but it remains to be seen if it even boots - other people have working Gentoo installs with this mix of kernel, but I don't know whether this will address the networking problem.

What network card are you using?

----------

## BlackEye

 *koan wrote:*   

> What network card are you using?

 

02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 01)

The Realtek Ethernet Controller seems to have some issues with xen. I read something about this on the net. Unfortunately I can't change the NIC because this is a root-server which I haven't any direct access to.

About the kernel source I use, maybe this link is interesting for you too -> https://forums.gentoo.org/viewtopic-t-709908.html

There you can get a new vanilla with xen patches. This is the kernel I currently use on my dom0.

If I use NFS with this kernel and without setting rsize and wsize I got horrible transferrates. If I manually set rsize and wsize to 8192 I got a vast better result (you can see my post in the other thread about this).

However - I dont know if this is the real solution for this problem or not.

About the NIC: Although I found some issues about the realtek in conjunction with xen - I really don't know why this should have anything to do with it because the whole transfer between dom0 and the domUs are getting over the virtual devices..

----------

