[ovs-dev] OVN testsuite failure on Travis
Ben Pfaff
blp at ovn.org
Thu Jun 23 17:48:12 UTC 2016
On Sun, May 08, 2016 at 10:54:52AM -0700, Ben Pfaff wrote:
> On Sun, May 08, 2016 at 12:43:28PM -0500, Ryan Moats wrote:
> > Ben Pfaff <blp at ovn.org> wrote on 05/08/2016 11:31:18 AM:
> > > The most common races I see in the OVN tests would be addressed by the
> > > idea I proposed here:
> > > http://openvswitch.org/pipermail/dev/2016-April/070041.html
> > > (please see the remainder of the thread for refinements)
> > >
> > > I think that Ryan Moats (CCed) is planning to work on that.
> >
> > I've asked my colleague Amitabha Biswas (also CCed) to work on this
> > particular issue so that I can focus on SFC...
>
> That's good to know, thanks!
To make it clear that this is not just a problem for tests, I had a
conversation at a conference on Monday where an operator identified two
real-world situations where it's important to make sure that the network
as seen by the hypervisors has caught up with the central database
before allowing operations to proceed:
1. Consider a Zookeeper cluster. Before adding a new member to the
cluster, it is necessary that the new member be able to see all of
the other members, and that all of the other members be able to see
the new member. Otherwise, adding the member to the cluster may
succeed but with partial connectivity (and even connectivity from A
to B but not back from B to A in some cases). Thus, all of the
hypervisors that host a VM in the cluster (including the new VM) must
be up-to-date with the central database before adding the new VM to
the cluster.
2. Consider adding a new load balancer to a cluster of load balancers.
The load balancer has a collection of newly added servers behind it
as well. Before the load balancer can reasonably be added, all the
HVs hosting the load balancer or any of its backends must be caught
up with the new state of the network, otherwise requests that arrive
at the new load balancer might be dropped or delayed due to partial
connectivity.
I think that the sequence number protocol I proposed handles these
cases. It even allows for waiting for a subset of HVs (the ones
involved in the scenarios above) to catch up rather than for the whole
cloud.
More information about the dev
mailing list