[ovs-dev] [PATCH v2] ovn-controller: Fix chassisredirect port flapping when ovs-vswitchd crashes

nusiddiq at redhat.com nusiddiq at redhat.com
Mon Feb 4 10:39:22 UTC 2019


From: Numan Siddique <nusiddiq at redhat.com>

On a chassis when ovs-vswitchd crashes for some reason, the BFD status doesn't
get updated in the ovs db. ovn-controller will be reading the old BFD status
even though ovs-vswitchd is crashed. This results in the chassiredirect port
claim flapping between the master chassis and the chasiss with the next higher
priority if ovs-vswitchd crashes in the master chassis.

All the other chassis notices the BFD status down with the master chassis
and hence the next higher priority claims the port. But according to
the master chassis, the BFD status is fine and it again claims back the
chassisredirect port. And this results in flapping. The issue gets resolved
when ovs-vswitchd comes back but until then it leads to lot of SB DB
transactions and high CPU usage in ovn-controller's.

This patch fixes the issue by checking the OF connection status of the
ovn-controller with ovs-vswitchd and calculates the active bfd tunnels
only if it's connected.

Signed-off-by: Numan Siddique <nusiddiq at redhat.com>
---

v1 -> v2
-----
 * Deleted unnecessary prints in the test case which were added during
   debugging.

 ovn/controller/ofctrl.c         |  6 ++++++
 ovn/controller/ofctrl.h         |  3 +++
 ovn/controller/ovn-controller.c | 13 ++++++++++++-
 tests/ovn.at                    | 26 +++++++++++++++++++++++++-
 4 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/ovn/controller/ofctrl.c b/ovn/controller/ofctrl.c
index 218612787..95b95b607 100644
--- a/ovn/controller/ofctrl.c
+++ b/ovn/controller/ofctrl.c
@@ -1265,3 +1265,9 @@ ofctrl_inject_pkt(const struct ovsrec_bridge *br_int, const char *flow_s,
 
     return NULL;
 }
+
+bool
+ofctrl_is_connected(void)
+{
+    return rconn_is_connected(swconn);
+}
diff --git a/ovn/controller/ofctrl.h b/ovn/controller/ofctrl.h
index ae0cfa513..f7521801b 100644
--- a/ovn/controller/ofctrl.h
+++ b/ovn/controller/ofctrl.h
@@ -60,4 +60,7 @@ void ofctrl_check_and_add_flow(struct hmap *desired_flows, uint8_t table_id,
                                const struct match *,
                                const struct ofpbuf *ofpacts,
                                bool log_duplicate_flow);
+
+bool ofctrl_is_connected(void);
+
 #endif /* ovn/ofctrl.h */
diff --git a/ovn/controller/ovn-controller.c b/ovn/controller/ovn-controller.c
index 4e9a5865f..2098f280c 100644
--- a/ovn/controller/ovn-controller.c
+++ b/ovn/controller/ovn-controller.c
@@ -689,7 +689,18 @@ main(int argc, char *argv[])
                     ovsrec_bridge_table_get(ovs_idl_loop.idl), br_int,
                     sbrec_chassis_table_get(ovnsb_idl_loop.idl), chassis_id,
                     sbrec_sb_global_first(ovnsb_idl_loop.idl));
-                bfd_calculate_active_tunnels(br_int, &active_tunnels);
+
+                if (ofctrl_is_connected()) {
+                    /* Calculate the active tunnels only if have an an active
+                     * OpenFlow connection to br-int.
+                     * If we don't have a connection to br-int, it could mean
+                     * ovs-vswitchd is down for some reason and the BFD status
+                     * in the Interface rows could be stale. So its better to
+                     * consider 'active_tunnels' set to be empty if it's not
+                     * connected. */
+                    bfd_calculate_active_tunnels(br_int, &active_tunnels);
+                }
+
                 binding_run(ovnsb_idl_txn, ovs_idl_txn, sbrec_chassis_by_name,
                             sbrec_datapath_binding_by_key,
                             sbrec_port_binding_by_datapath,
diff --git a/tests/ovn.at b/tests/ovn.at
index f54f24c74..fd558cb98 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -8055,7 +8055,31 @@ ovn-nbctl --timeout=3 --wait=hv \
 
 test_ip_packet gw2 gw1
 
-OVN_CLEANUP([hv1],[gw1],[gw2],[ext1])
+# Get the claim count of both gw1 and gw2.
+gw1_claim_ct=`grep "cr-alice: Claiming" gw1/ovn-controller.log | wc -l`
+gw2_claim_ct=`grep "cr-alice: Claiming" gw2/ovn-controller.log | wc -l`
+
+# Kill ovs-vswitchd in gw2. gw1 should claim the gateway port.
+kill `cat gw2/ovs-vswitchd.pid`
+
+# gw1 should claim the cr-alice and the claim count of gw1 should be
+# incremented by 1.
+gw1_claim_ct=$((gw1_claim_ct+1))
+
+OVS_WAIT_UNTIL([test $gw1_claim_ct = `cat gw1/ovn-controller.log \
+| grep -c "cr-alice: Claiming"`])
+
+AT_CHECK([test $gw2_claim_ct = `cat gw2/ovn-controller.log | \
+grep -c "cr-alice: Claiming"`])
+
+test_ip_packet gw1 gw2
+
+as gw2
+ovs-appctl -t ovn-controller exit
+ovs-appctl -t ovs-vswitchd exit
+
+OVN_CLEANUP([hv1],[gw1],[ext1])
+
 AT_CLEANUP
 
 AT_SETUP([ovn -- 4 HV, 3 LS, 2 LR, packet test with HA distributed router gateway port])
-- 
2.20.1



More information about the dev mailing list