[ovs-dev] [PATCH] doc: Document proposed OVN Gateway HA design.

Wed Jul 15 15:49:09 UTC 2015

On Thu, Jul 09, 2015 at 06:12:53PM -0700, Ethan Jackson wrote:
> High availability for gateways in network virtualization deployments
> is fairly difficult to get right.  There are a ton of options, most of
> which are too complicated or perform badly.  To help solve this
> problem, this patch proposes an HA design based on some of the lessons
> learned building similar systems.  The hope is that it can be used as
> a starting point for design discussions and an eventual
> implementation.
> 
> Signed-off-by: Ethan Jackson <ethan at nicira.com>

Thank you for writing this up!  

This had encoding "y", which made it challenging to apply ;-)

Can we put it in the ovn directory?

When a logical network contains a gateway, then both sides are part of
the logical network, and thus "logical space".  So while I agree with
the diagram at the very beginning that shows a gateway between an
external network and an OVN virtual network, I think it's a bit
misleading to say:

    The OVN gateway is responsible for shuffling traffic between logical space
    (governed by ovn-northd), and the legacy physical network.

since both sides of the gateway are in logical space.  I think it would
be more accurate to use some variant of "virtual" here, maybe:

    The OVN gateway is responsible for shuffling traffic between VMs
    (governed by ovn-northd), and the legacy physical network.

In the second paragraph, I am not sure why HA is critical to
performance:

    An HA solution is both critical to the performance and manageability of the
    system, and extremely difficult to get right.

The second paragraph of "Basic Architecture" starts:

    Since the broader internet is managed outside of the OVN network
    domain, all traffic between logical space and the WAN must travel
    through this gateway.

Is that the reason?  The reasons that come to mind to me are different
(or maybe just more specific?).  First, the gateway is the machine that
has a connection to the external network of interest; it might be in a
remote location such as a branch office away from the bulk of the
hypervisors in an OVN deployment.  Second, supposing that in fact the
gateway isn't in that kind of remote location, we want to have a central
point of entry into the virtual part of an OVN network because otherwise
we don't know which of N hypervisors should bring the packet into the
virtual network.

Under "Naive active-backup", do you mean OpenFlow echo requests here
(a "hello" message is only sent at the very beginning of an OpenFlow
session, to negotiate the OpenFlow version):

    ovn-northd monitors this
    gateway via OpenFlow hello messages (or some equivalent),

Under "Controller Independent Active-backup", I am not sure that I buy
the argument here, because currently ovn-northd doesn't care about the
layout of the physical network.  The other argument rings true for me of
course:

    This can significantly increase downtime in the event of a failover
    as the (often already busy) ovn-northd controller has to recompute
    state for the new leader.

Here are some spelling fixes as a patch.  This also replaces the fancy
Unicode U+2014 em dashes by the more common (in OVS, anyway) ASCII "--".

Thanks again for writing this!

diff --git a/OVN-GW-HA.md b/OVN-GW-HA.md
index ea598b2..e0d5c9f 100644
--- a/OVN-GW-HA.md
+++ b/OVN-GW-HA.md
@@ -30,8 +30,8 @@ The OVN gateway is responsible for shuffling traffic between logical space
 implementation, the gateway is a single x86 server, or hardware VTEP.  For most
 deployments, a single system has enough forwarding capacity to service the
 entire virtualized network, however, it introduces a single point of failure.
-If this system dies, the entire OVN deployment becomes unavailable.  To mitgate
-this risk, an HA solution is critical — by spreading responsibilty across
+If this system dies, the entire OVN deployment becomes unavailable.  To mitigate
+this risk, an HA solution is critical -- by spreading responsibility across
 multiple systems, no single server failure can take down the network.
 
 An HA solution is both critical to the performance and manageability of the
@@ -51,7 +51,7 @@ OVN controlled tunnel traffic, to raw physical network traffic.
 
 Since the broader internet is managed outside of the OVN network domain, all
 traffic between logical space and the WAN must travel through this gateway.
-This makes it a critical single point of failure — if the gateway dies,
+This makes it a critical single point of failure -- if the gateway dies,
 communication with the WAN ceases for all systems in logical space.
 
 To mitigate this risk, multiple gateways should be run in a "High Availability
@@ -128,15 +128,15 @@ absolute simplest way to achive this is what we'll call "naive-active-backup".
 Naive Active Backup HA Implementation
 ```
 
-In a naive active-bakup, one of the Gateways is choosen (arbitrarily) as a
+In a naive active-backup, one of the Gateways is choosen (arbitrarily) as a
 leader.  All logical routers (A, B, C in the figure), are scheduled on this
 leader gateway and all traffic flows through it.  ovn-northd monitors this
 gateway via OpenFlow hello messages (or some equivalent), and if the gateway
 dies, it recreates the routers on one of the backups.
 
 This approach basically works in most cases and should likely be the starting
-point for OVN — it's strictly better than no HA solution and is a good
-foundation for more sophisticated solutions.  That said, it's not without it's
+point for OVN -- it's strictly better than no HA solution and is a good
+foundation for more sophisticated solutions.  That said, it's not without its
 limitations. Specifically, this approach doesn't coordinate with the physical
 network to minimize disruption during failures, and it tightly couples failover
 to ovn-northd (we'll discuss why this is bad in a bit), and wastes resources by
@@ -167,7 +167,7 @@ ethernet source address of the RARP is that of the logical router it
 corresponds to, and its destination is the broadcast address.  This causes the
 RARP to travel to every L2 switch in the broadcast domain, updating forwarding
 tables accordingly.  This strategy is recommended in all failover mechanisms
-discussed in this document — when a router newly boots on a new leader, it
+discussed in this document -- when a router newly boots on a new leader, it
 should RARP its MAC address.
 
 ### Controller Independent Active-backup
@@ -188,7 +188,7 @@ Controller Independent Active-Backup Implementation
 ```
 
 The fundamental problem with naive active-backup, is it tightly couples the
-failover solution to ovn-northd.  This can signifcantly increase downtime in
+failover solution to ovn-northd.  This can significantly increase downtime in
 the event of a failover as the (often already busy) ovn-northd controller has
 to recompute state for the new leader. Worse, if ovn-northd goes down, we
 can't perform gateway failover at all.  This violates the principle that
@@ -207,7 +207,7 @@ priority to each node it controls.  Nodes use the leadership priority to
 determine which gateway in the cluster is the active leader by using a simple
 metric: the leader is the gateway that is healthy, with the highest priority.
 If that gateway goes down, leadership falls to the next highest priority, and
-conversley, if a new gateway comes up with a higher priority, it takes over
+conversely, if a new gateway comes up with a higher priority, it takes over
 leadership.
 
 Thus, in this model, leadership of the HA cluster is determined simply by the
@@ -221,7 +221,7 @@ of member gateways, a key problem is how do we communicate this information to
 the relevant transport nodes.  Luckily, we can do this fairly cheaply using
 tunnel monitoring protocols like BFD.
 
-The basic idea is pretty straight forward.  Each transport node maintains a
+The basic idea is pretty straightforward.  Each transport node maintains a
 tunnel to every gateway in the HA cluster (not just the leader).  These
 tunnels are monitored using the BFD protocol to see which are alive.  Given
 this information, hypervisors can trivially compute the highest priority live
@@ -277,7 +277,7 @@ even though its tunnels are still healthy.
 Router Specific Active-Backup
 ```
 Controller independent active-backup is a great advance over naive
-active-backup, but it still has one glaring problem — it under-utilizes the
+active-backup, but it still has one glaring problem -- it under-utilizes the
 backup gateways.  In ideal scenario, all traffic would split evenly among the
 live set of gateways.  Getting all the way there is somewhat tricky, but as a
 step in the direction, one could use the "Router Specific Active-Backup"
@@ -286,7 +286,7 @@ router basis, with one twist.  It chooses a different active Gateway for each
 logical router.  Thus, in situations where there are several logical routers,
 all with somewhat balanced load, this algorithm performs better.
 
-Implementation of this strategy is quite straight forward if built on top of
+Implementation of this strategy is quite straightforward if built on top of
 basic controller independent active-backup.  On a per logical router basis, the
 algorithm is the same, leadership is determined by the liveness of the
 gateways.  The key difference here is that the gateways must have a different
@@ -295,7 +295,7 @@ be computed by ovn-northd just as they had been in the controller independent
 active-backup model.
 
 Once we have these per logical router priorities, they simply need be
-comminucated to the members of the gateway cluster and the hypervisors.  The
+communicated to the members of the gateway cluster and the hypervisors.  The
 hypervisors in particular, need simply have an active-backup bundle action (or
 group action) per logical router listing the gateways in priority order for
 *that router*, rather than having a single bundle action shared for all the
@@ -327,7 +327,7 @@ undesirable.
 The controller can optionally avoid preemption by cleverly tweaking the
 leadership priorities.  For each router, new gateways should be assigned
 priorities that put them second in line or later when they eventually come up.
-Furthermore, if a gateway goes down for a significant period of time, it's old
+Furthermore, if a gateway goes down for a significant period of time, its old
 leadership priorities should be revoked and new ones should be assigned as if
 it's a brand new gateway.  Note that this should only happen if a gateway has
 been down for a while (several minutes), otherwise a flapping gateway could
@@ -368,7 +368,7 @@ gateways end up implementing an overly conservative "when in doubt drop all
 traffic" policy, or they implement something like MLAG.
 
 MLAG has multiple gateways work together to pretend to be a single L2 switch
-with a large LACP bond.  In principle, it's the right right solution to the
+with a large LACP bond.  In principle, it's the right solution to the
 problem as it solves the broadcast storm problem, and has been deployed
 successfully in other contexts.  That said, it's difficult to get right and not
 recommended.