[ovs-dev] OVN testsuite failure on Travis

Ben Pfaff blp at ovn.org
Thu Jun 23 17:48:12 UTC 2016


On Sun, May 08, 2016 at 10:54:52AM -0700, Ben Pfaff wrote:
> On Sun, May 08, 2016 at 12:43:28PM -0500, Ryan Moats wrote:
> > Ben Pfaff <blp at ovn.org> wrote on 05/08/2016 11:31:18 AM:
> > > The most common races I see in the OVN tests would be addressed by the
> > > idea I proposed here:
> > >         http://openvswitch.org/pipermail/dev/2016-April/070041.html
> > > (please see the remainder of the thread for refinements)
> > >
> > > I think that Ryan Moats (CCed) is planning to work on that.
> > 
> > I've asked my colleague Amitabha Biswas (also CCed) to work on this
> > particular issue so that I can focus on SFC...
> 
> That's good to know, thanks!

To make it clear that this is not just a problem for tests, I had a
conversation at a conference on Monday where an operator identified two
real-world situations where it's important to make sure that the network
as seen by the hypervisors has caught up with the central database
before allowing operations to proceed:

1. Consider a Zookeeper cluster.  Before adding a new member to the
   cluster, it is necessary that the new member be able to see all of
   the other members, and that all of the other members be able to see
   the new member.  Otherwise, adding the member to the cluster may
   succeed but with partial connectivity (and even connectivity from A
   to B but not back from B to A in some cases).  Thus, all of the
   hypervisors that host a VM in the cluster (including the new VM) must
   be up-to-date with the central database before adding the new VM to
   the cluster.

2. Consider adding a new load balancer to a cluster of load balancers.
   The load balancer has a collection of newly added servers behind it
   as well.  Before the load balancer can reasonably be added, all the
   HVs hosting the load balancer or any of its backends must be caught
   up with the new state of the network, otherwise requests that arrive
   at the new load balancer might be dropped or delayed due to partial
   connectivity.

I think that the sequence number protocol I proposed handles these
cases.  It even allows for waiting for a subset of HVs (the ones
involved in the scenarios above) to catch up rather than for the whole
cloud.



More information about the dev mailing list