[ovs-dev] [RFC PATCH ovn 4/4] ovn-controller: Don't flood fill local datapaths beyond DGP boundary.

Han Zhou hzhou at ovn.org
Fri Jul 30 07:21:41 UTC 2021


For a fully distributed virtual network dataplane, ovn-controller
flood-fills datapaths that are connected through patch ports. This
creates scale problems in ovn-controller when the connected datapaths
are too many.

In a particular situation, when distributed gateway ports are used to
connect logical routers to logical switches, when there is no need for
distributed processing of those gateway ports (e.g. no dnat_and_snat
configured), the datapaths on the other side of the gateway ports are
not needed locally on the current chassis. This patch avoids pulling
those datapaths to local in those scenarios.

There are two scenarios that can greatly benefit from this optimization.

1) When there are multiple tenants, each has its own logical topology,
   but sharing the same external/provider networks, connected to their
   own logical routers with DGPs. Without this optimization, each
   ovn-controller would process all logical topology of all tenants and
   program flows for all of them, even if there are only workloads of a
   very few number of tenants on the node where the ovn-controller is
   running, because the shared external network connects all tenants.
   With this change, only the logical topologies relevant to the node
   are processed and programmed on the node.

2) In some deployments, such as ovn-kubernetes, logical switches are
   bound to chassises instead of distributed, because each chassis is
   assigned dedicated subnets. With the current implementation,
   ovn-controller on each node processes all logical switches and all
   ports on them, without knowing that they are not distributed at all.
   At large scale with N nodes (N = hundreds or even more), there are
   roughly N times processing power wasted for the logical connectivity
   related flows. With this change, those depolyments can utilize DGP
   to connect the node level logical switches to distributed router(s),
   with gateway chassis (or HA chassis without really HA) of the DGP
   set to the chassis where the logical switch is bound. This inherently
   tells OVN the mapping between logical switch and chassis, and
   ovn-controller would smartly avoid processing topologies of other node
   level logical switches, which would hugely save compute cost of each
   ovn-controller.

For 2), test result for an ovn-kubernetes alike deployment shows
signficant improvement of ovn-controller, both CPU (>90% reduced) and memory.

Topology:

- 1000 nodes, 1 LS with 10 LSPs per node, connected to a distributed
  router.

- 2 large port-groups PG1 and PG2, each with 2000 LSPs

- 10 stateful ACLs: 5 from PG1 to PG2, 5 from PG2 to PG1

- 1 GR per node, connected to the distributed router through a join
  switch. Each GR also connects to an external logical switch per node.
  (This part is to keep the test environment close to a real
   ovn-kubernetes setup but shouldn't make much difference for the
   comparison)

==== Before the change ====
OVS flows per node: 297408
ovn-controller memory: 772696 KB
ovn-controller recompute: 13s
ovn-controller restart (recompute + reinstall OVS flows): 63s

==== After the change (also use DGP to connect node level LSes) ====
OVS flows per node: 81139 (~70% reduced)
ovn-controller memory: 163464 KB (~80% reduced)
ovn-controller recompute: 0.86s (>90% reduced)
ovn-controller restart (recompute + reinstall OVS flows): 5s (>90% reduced)

Signed-off-by: Han Zhou <hzhou at ovn.org>
---
TODO:
- DDlog
- Investigate why the optimization cannot be used when redirect-type is
  bridged
---
 controller/binding.c   | 100 ++++++++++++++++++++++++++++++++++++-----
 northd/ovn-northd.c    |  24 ++++++++++
 ovn-architecture.7.xml |  26 +++++++++++
 ovn-nb.xml             |   6 +++
 tests/ovn.at           |  67 +++++++++++++++++++++++++++
 5 files changed, 212 insertions(+), 11 deletions(-)

diff --git a/controller/binding.c b/controller/binding.c
index 376dc039d..55951701e 100644
--- a/controller/binding.c
+++ b/controller/binding.c
@@ -86,11 +86,57 @@ static void tracked_binding_datapath_lport_add(
 static void update_lport_tracking(const struct sbrec_port_binding *pb,
                                   struct hmap *tracked_dp_bindings);
 
+/* Checks if pb is a patch port and the peer datapath should be added to local
+ * datapaths. */
+static bool
+need_add_patch_peer_to_local(
+    struct ovsdb_idl_index *sbrec_port_binding_by_name,
+    const struct sbrec_port_binding *pb,
+    const struct sbrec_chassis *chassis)
+{
+    /* If it is not a patch port, no peer to add. */
+    if (strcmp(pb->type, "patch")) {
+        return false;
+    }
+
+    /* If it is a regular patch port, it is fully distributed, add the peer. */
+    const char *crp = smap_get(&pb->options, "chassis-redirect-port");
+    if (!crp) {
+        return true;
+    }
+
+    /* A DGP, find its chassis-redirect-port pb. */
+    const struct sbrec_port_binding *pb_crp =
+        lport_lookup_by_name(sbrec_port_binding_by_name, crp);
+    if (!pb_crp) {
+        static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 5);
+        VLOG_WARN_RL(&rl, "Chassis-redirect-port %s for DGP %s is not found.",
+                     pb->logical_port, crp);
+        return false;
+    }
+
+    /* Check if it is configured as "always-redirect". If not, then we will
+     * need to add the peer to local for distributed processing. */
+    if (!smap_get_bool(&pb_crp->options, "always-redirect", false)) {
+        return true;
+    }
+
+    /* Check if its chassis-redirect-port is local. If yes, then we need to add
+     * the peer to local, which could be the localnet network, which doesn't
+     * have other chances to be added to local datapaths if there is no VIF
+     * bindings. */
+    if (pb_crp->ha_chassis_group) {
+        return ha_chassis_group_contains(pb_crp->ha_chassis_group, chassis);
+    }
+    return false;
+}
+
 static void
 add_local_datapath__(struct ovsdb_idl_index *sbrec_datapath_binding_by_key,
                      struct ovsdb_idl_index *sbrec_port_binding_by_datapath,
                      struct ovsdb_idl_index *sbrec_port_binding_by_name,
                      const struct sbrec_datapath_binding *datapath,
+                     const struct sbrec_chassis *chassis,
                      bool has_local_l3gateway, int depth,
                      struct hmap *local_datapaths,
                      struct hmap *tracked_datapaths)
@@ -149,7 +195,8 @@ add_local_datapath__(struct ovsdb_idl_index *sbrec_datapath_binding_by_key,
                                             peer_name);
 
                 if (peer && peer->datapath) {
-                    if (!strcmp(pb->type, "patch")) {
+                    if (need_add_patch_peer_to_local(
+                            sbrec_port_binding_by_name, pb, chassis)) {
                         /* Add the datapath to local datapath only for patch
                          * ports. For l3gateway ports, since gateway router
                          * resides on one chassis, we don't need to add.
@@ -158,7 +205,7 @@ add_local_datapath__(struct ovsdb_idl_index *sbrec_datapath_binding_by_key,
                         add_local_datapath__(sbrec_datapath_binding_by_key,
                                              sbrec_port_binding_by_datapath,
                                              sbrec_port_binding_by_name,
-                                             peer->datapath, false,
+                                             peer->datapath, chassis, false,
                                              depth + 1, local_datapaths,
                                              tracked_datapaths);
                     }
@@ -183,14 +230,15 @@ add_local_datapath(struct ovsdb_idl_index *sbrec_datapath_binding_by_key,
                    struct ovsdb_idl_index *sbrec_port_binding_by_datapath,
                    struct ovsdb_idl_index *sbrec_port_binding_by_name,
                    const struct sbrec_datapath_binding *datapath,
+                   const struct sbrec_chassis *chassis,
                    bool has_local_l3gateway, struct hmap *local_datapaths,
                    struct hmap *tracked_datapaths)
 {
     add_local_datapath__(sbrec_datapath_binding_by_key,
                          sbrec_port_binding_by_datapath,
                          sbrec_port_binding_by_name,
-                         datapath, has_local_l3gateway, 0, local_datapaths,
-                         tracked_datapaths);
+                         datapath, chassis, has_local_l3gateway, 0,
+                         local_datapaths, tracked_datapaths);
 }
 
 static void
@@ -1267,7 +1315,7 @@ consider_vif_lport_(const struct sbrec_port_binding *pb,
             add_local_datapath(b_ctx_in->sbrec_datapath_binding_by_key,
                                b_ctx_in->sbrec_port_binding_by_datapath,
                                b_ctx_in->sbrec_port_binding_by_name,
-                               pb->datapath, false,
+                               pb->datapath, b_ctx_in->chassis_rec, false,
                                b_ctx_out->local_datapaths,
                                b_ctx_out->tracked_dp_bindings);
             update_related_lport(pb, b_ctx_out);
@@ -1482,7 +1530,8 @@ consider_nonvif_lport_(const struct sbrec_port_binding *pb,
         add_local_datapath(b_ctx_in->sbrec_datapath_binding_by_key,
                            b_ctx_in->sbrec_port_binding_by_datapath,
                            b_ctx_in->sbrec_port_binding_by_name,
-                           pb->datapath, has_local_l3gateway,
+                           pb->datapath, b_ctx_in->chassis_rec,
+                           has_local_l3gateway,
                            b_ctx_out->local_datapaths,
                            b_ctx_out->tracked_dp_bindings);
 
@@ -1565,7 +1614,7 @@ consider_ha_lport(const struct sbrec_port_binding *pb,
         add_local_datapath(b_ctx_in->sbrec_datapath_binding_by_key,
                            b_ctx_in->sbrec_port_binding_by_datapath,
                            b_ctx_in->sbrec_port_binding_by_name,
-                           pb->datapath, false,
+                           pb->datapath, b_ctx_in->chassis_rec, false,
                            b_ctx_out->local_datapaths,
                            b_ctx_out->tracked_dp_bindings);
         update_related_lport(pb, b_ctx_out);
@@ -1907,7 +1956,7 @@ add_local_datapath_peer_port(const struct sbrec_port_binding *pb,
         add_local_datapath__(b_ctx_in->sbrec_datapath_binding_by_key,
                              b_ctx_in->sbrec_port_binding_by_datapath,
                              b_ctx_in->sbrec_port_binding_by_name,
-                             peer->datapath, false,
+                             peer->datapath, b_ctx_in->chassis_rec, false,
                              1, b_ctx_out->local_datapaths,
                              b_ctx_out->tracked_dp_bindings);
         return;
@@ -2450,12 +2499,14 @@ consider_patch_port_for_local_datapaths(const struct sbrec_port_binding *pb,
                 get_local_datapath(b_ctx_out->local_datapaths,
                                    peer->datapath->tunnel_key);
         }
-        if (peer_ld) {
+        if (peer_ld && need_add_patch_peer_to_local(
+                b_ctx_in->sbrec_port_binding_by_name, peer,
+                b_ctx_in->chassis_rec)) {
             add_local_datapath(
                 b_ctx_in->sbrec_datapath_binding_by_key,
                 b_ctx_in->sbrec_port_binding_by_datapath,
                 b_ctx_in->sbrec_port_binding_by_name,
-                pb->datapath, false,
+                pb->datapath, b_ctx_in->chassis_rec, false,
                 b_ctx_out->local_datapaths,
                 b_ctx_out->tracked_dp_bindings);
         }
@@ -2463,7 +2514,11 @@ consider_patch_port_for_local_datapaths(const struct sbrec_port_binding *pb,
         /* Add the peer datapath to the local datapaths if it's
          * not present yet.
          */
-        add_local_datapath_peer_port(pb, b_ctx_in, b_ctx_out, ld);
+        if (need_add_patch_peer_to_local(
+                b_ctx_in->sbrec_port_binding_by_name, pb,
+                b_ctx_in->chassis_rec)) {
+            add_local_datapath_peer_port(pb, b_ctx_in, b_ctx_out, ld);
+        }
     }
 }
 
@@ -2647,6 +2702,29 @@ delete_done:
 
         case LP_CHASSISREDIRECT:
             handled = consider_cr_lport(pb, b_ctx_in, b_ctx_out);
+            if (!handled) {
+                break;
+            }
+            const char *distributed_port = smap_get(&pb->options,
+                                                    "distributed-port");
+            if (!distributed_port) {
+                static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 1);
+                VLOG_WARN_RL(&rl, "No distributed-port option set for "
+                             "chassisredirect port %s", pb->logical_port);
+                break;
+            }
+            const struct sbrec_port_binding *distributed_pb
+                = lport_lookup_by_name(b_ctx_in->sbrec_port_binding_by_name,
+                                       distributed_port);
+            if (!distributed_pb) {
+                static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 1);
+                VLOG_WARN_RL(&rl, "No port binding record for distributed "
+                             "port %s referred by chassisredirect port %s",
+                             distributed_port, pb->logical_port);
+                break;
+            }
+            consider_patch_port_for_local_datapaths(distributed_pb, b_ctx_in,
+                                                    b_ctx_out);
             break;
 
         case LP_EXTERNAL:
diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
index 2bc187d39..7f6f9d34d 100644
--- a/northd/ovn-northd.c
+++ b/northd/ovn-northd.c
@@ -664,6 +664,7 @@ struct ovn_datapath {
 
     /* NAT entries configured on the router. */
     struct ovn_nat *nat_entries;
+    bool has_distributed_nat;
 
     /* Set of nat external ips on the router. */
     struct sset external_ips;
@@ -839,6 +840,11 @@ init_nat_entries(struct ovn_datapath *od)
                             nat_entry);
             }
         }
+
+        if (!strcmp(nat->type, "dnat_and_snat")
+            && nat->logical_port && nat->external_mac) {
+            od->has_distributed_nat = true;
+        }
     }
 }
 
@@ -3079,12 +3085,30 @@ ovn_port_update_sbrec(struct northd_context *ctx,
                 sbrec_port_binding_set_gateway_chassis(op->sb, NULL, 0);
             }
             smap_add(&new, "distributed-port", op->nbrp->name);
+
+            bool always_redirect = !op->od->has_distributed_nat;
             if (redirect_type) {
                 smap_add(&new, "redirect-type", redirect_type);
+                /* XXX Why can't we enable always-redirect when redirect-type
+                 * is bridged? */
+                if (!strcmp(redirect_type, "bridged")) {
+                    always_redirect = false;
+                }
+            }
+
+            if (always_redirect) {
+                smap_add(&new, "always-redirect", "true");
             }
         } else {
             if (op->peer) {
                 smap_add(&new, "peer", op->peer->key);
+                if (op->nbrp->ha_chassis_group ||
+                    op->nbrp->n_gateway_chassis) {
+                    char *redirect_name =
+                        ovn_chassis_redirect_name(op->nbrp->name);
+                    smap_add(&new, "chassis-redirect-port", redirect_name);
+                    free(redirect_name);
+                }
             }
             if (chassis_name) {
                 smap_add(&new, "l3gateway-chassis", chassis_name);
diff --git a/ovn-architecture.7.xml b/ovn-architecture.7.xml
index 3598b5073..a65ddd9d5 100644
--- a/ovn-architecture.7.xml
+++ b/ovn-architecture.7.xml
@@ -832,6 +832,32 @@
     Set the <code>redirect-type</code> option on a distributed gateway port.
   </p>
 
+  <h4>Using Distributed Gateway Ports For Scalability</h4>
+
+  <p>
+    Although the primary goal of distributed gateway ports is to provide
+    connectivity to external networks, there is a special use case for
+    scalability.
+  </p>
+
+  <p>
+    In some deployments, such as the ones using <code>ovn-kubernetes</code>,
+    logical switches are bound to individual chassises, and are connected by a
+    distributed logical router. In such deployments, the chassis level logical
+    switches are centralized on the chassis instead of distributed, which means
+    the <code>ovn-controller</code> on each chassis doesn't need to process
+    flows and ports of logical switches on other chassises. However, without
+    any specific hint, <code>ovn-controller</code> would still process all the
+    logical switches as if they are fully distributed. In this case,
+    distributed gateway port can be very useful. The chassis level logical
+    switches can be connected to the distributed router using distributed
+    gateway ports, by setting the gateway chassis (or HA chassis groups with
+    only a single chassis in it) to the chassis that each logical switch is
+    bound to. <code>ovn-controller</code> would then skip processing the
+    logical switches on all the other chassises, largely improving the
+    scalability, especially when there are a big number of chassises.
+  </p>
+
   <h2>Life Cycle of a VIF</h2>
 
   <p>
diff --git a/ovn-nb.xml b/ovn-nb.xml
index ec51b5608..4b27c6027 100644
--- a/ovn-nb.xml
+++ b/ovn-nb.xml
@@ -2457,6 +2457,12 @@
         connection to another OVN deployment.
       </p>
 
+      <p>
+        Also mentioned in the OVN architecture guide, distributed gateway ports
+        can also be used for scalability reasons in deployments where logical
+        switches are dedicated to chassises rather than distributed.
+      </p>
+
       <p>
         The preferred way to configure a gateway is <ref
         column="ha_chassis_group"/>, but <ref
diff --git a/tests/ovn.at b/tests/ovn.at
index 13f2a4990..a5fa0acaf 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -27658,3 +27658,70 @@ OVN_CHECK_PACKETS_REMOVE_BROADCAST([hv5/vif-north3-tx.pcap], [vif-north3.expecte
 
 AT_CLEANUP
 ])
+
+# This test ensures that an ovn-controller doesn't program OVS flows for
+# datapaths that are beyond a distributed gateway port.
+#
+#    lsp0 at hv1 - ls1 - lrp_lr_ls1 (DGP at hv0)
+#                                         \
+#                                         lr
+#                                         /
+#    lsp1 at hv2 - ls2 - lrp_lr_ls2 (DGP at hv1)
+#
+# In this topology, hv1 shouldn't have flows of ls2, and hv2 shouldn't have
+# flows of ls1.
+OVN_FOR_EACH_NORTHD([
+AT_SETUP([Dont flood fill local datapaths beyond distributed gateway port])
+ovn_start
+
+net_add n1
+check ovn-nbctl lr-add lr
+for i in 1 2; do
+    sim_add hv$i
+    as hv$i
+    check ovs-vsctl add-br br-phys
+    ovn_attach n1 br-phys 192.168.0.$i
+
+    check ovs-vsctl add-port br-int vif$i -- set Interface vif$i external-ids:iface-id=lsp$i
+
+    check ovn-nbctl ls-add ls$i
+    check ovn-nbctl lsp-add ls$i lsp$i \
+        -- lsp-set-addresses lsp$i "f0:00:00:00:00:0$i 10.0.$i.2"
+
+    lrp=lrp_lr_ls$i
+    lsp_lr=lsp_ls${i}_lr
+    check ovn-nbctl lrp-add lr $lrp f0:00:00:00:aa:0$i 10.0.$i.1/24 \
+        -- lsp-add ls$i $lsp_lr \
+        -- lsp-set-type $lsp_lr router \
+        -- lsp-set-options $lsp_lr router-port=$lrp \
+        -- lsp-set-addresses $lsp_lr router \
+        -- lrp-set-gateway-chassis $lrp hv$i 1
+done
+
+check ovn-nbctl --wait=hv sync
+
+# hv0 should see flows for lsp1 but not lsp2
+AT_CHECK([as hv1 ovs-ofctl dump-flows br-int table=24 | grep 10.0.1.2], [0], [ignore])
+AT_CHECK([as hv1 ovs-ofctl dump-flows br-int table=24 | grep 10.0.2.2], [1])
+# hv2 should see flows for lsp2 but not lsp1
+AT_CHECK([as hv2 ovs-ofctl dump-flows br-int table=24 | grep 10.0.2.2], [0], [ignore])
+AT_CHECK([as hv2 ovs-ofctl dump-flows br-int table=24 | grep 10.0.1.2], [1])
+
+# Change lrp_lr_ls1 to a regular lrp, hv2 should see flows for lsp1
+check ovn-nbctl --wait=hv lrp-del-gateway-chassis lrp_lr_ls1 hv1
+AT_CHECK([as hv2 ovs-ofctl dump-flows br-int table=24 | grep 10.0.1.2], [0], [ignore])
+
+# Change it back, and trigger recompute to make sure extra flows are removed
+# from hv2 (recompute is needed because currently I-P adds local datapaths but
+# doesn't remove.)
+check ovn-nbctl --wait=hv lrp-set-gateway-chassis lrp_lr_ls1 hv1 1
+as hv2 check ovn-appctl -t ovn-controller recompute
+AT_CHECK([as hv2 ovs-ofctl dump-flows br-int table=24 | grep 10.0.1.2], [1])
+
+# Enable dnat_and_snat on lr, and now hv2 should see flows for lsp1.
+AT_CHECK([ovn-nbctl --wait=hv lr-nat-add lr dnat_and_snat 192.168.0.1 10.0.1.3 lsp1 f0:00:00:00:00:03])
+AT_CHECK([as hv2 ovs-ofctl dump-flows br-int table=24 | grep 10.0.1.2], [0], [ignore])
+
+OVN_CLEANUP([hv1],[hv2])
+AT_CLEANUP
+])
-- 
2.30.2



More information about the dev mailing list