[ovs-discuss] Intra-Bridge Perfomance issue

Thu Aug 30 02:43:35 UTC 2012

On Aug 29, 2012, at 9:27 PM, Jesse Gross <jesse at nicira.com> wrote:

> On Wed, Aug 29, 2012 at 6:19 PM, Michael A. Collins
> <mike.a.collins at ark-net.org> wrote:
>> I have several xensource servers running lots of PV-On-HVM Windows  
>> DomUs and
>> I have a pretty weird problem.  Here are my details:
>> Kernel: 3.5.0-rc2
>> OpenvSwitch module: Built-in from upstream (aka did not install  
>> kernel
>> module when building OpenvSwitch)
>> OpenvSwitch userland tools: version 1.4.0
>>
>> I have a single Bridge with two fake-bridges.
>> I have configured a LACP bond with 4 physical nics that connects to a
>> PortChannel on a 6509.
>> I setup the native vlan on the 6509 to be 102.
>> I have configured the bond with vlan_mode=native-untagged and  
>> tag=102.
>> All my vms are added to the fake-bridge associated with vlan 102.
>> I have four servers configured this way all connected to the same  
>> 6509,
>> ServerA, ServerB, ServerC, and ServerD.
>>
>> I have no problem sending and receiving traffic to any VM on any of  
>> the four
>> servers, in other words all my VMs get IPs from a DHCP Server and  
>> can icmp
>> each other.
>> I have decent performance moving files, SMB2, from VMs that are on  
>> different
>> servers, aka VM1 on ServerA copies a file to VM2 on ServerB.
>>
>> I have horrible performance when moving files, SMB2, from VMs that  
>> are on
>> the same server, aka VM1 on ServerA copies a file to VM2 on  
>> ServerA.  I am
>> not an expert on how OpenvSwitch works, and I can't discount that  
>> my own
>> stupidity may be behind this, but I am at a loss for what to do to
>> troubleshoot this.
>>
>> I have captured packets of a reproducible type or session of network
>> traffic, aka Logging into a VM with the same account which has a  
>> roaming
>> profile configured.  This pulls down about 50MB of data and when  
>> logging
>> into a VM that is on a different server than the file server that  
>> hosts the
>> profile it takes about 11 seconds.  When logging into a VM that is  
>> on the
>> same server as the file server it takes well over 20 minutes and  
>> never
>> really succeeds.
>>
>> What I can see that is different from the two packet captures are  
>> the amount
>> of retransmits, Duplicate ACKs and Out-of-Order packets are insane  
>> when
>> going from vm to vm on the same server.
>>
>> It seems to me after looking at the traffic in the capture that  
>> everything
>> is trucking along until we get to a large file, say 5MB, then it  
>> just falls
>> apart.  On the VM that is on a different server, I can get the file  
>> moved
>> across in only 458 packets, with only 36 TCP ACKed lost segment  
>> packets
>> flagged.
>> On the VM that is on the same server, I can't get the file moved  
>> across even
>> after 6800+ packets, with 5500 Dup ACKs, Out-of-Order or  
>> retransmission
>> packets flagged.
>>
>> There has to be something going on that could explain this, but I  
>> am at a
>> loss!  Any help would be greatly appreciated!!
>
> The fact that it only happens when you start to see large packets
> likely means that it is related to TCP segmentation offload.  I know
> that some versions of the Windows PV drivers on Xen had bugs in this
> area so I would look to see if there is a newer version that you can
> upgrade to.  I don't know which versions are affected though.
>
Wouldn't TCP seg offload affect all the traffic not just the traffic  
that stays on the bridge?  I will go grab the newest version of the pv  
drivers and let you know.
Mike