[ovs-dev] race condition in "ovn: l3ha ensure no master bouncing when ovn-controller is restarted" test?

Miguel Angel Ajo Pelayo majopela at redhat.com
Fri Oct 27 08:02:03 UTC 2017


Hi Ben,

   Thank you for reporting this, I will investigate/think what can make it
fail on
high concurrency to fix the issue.

   The test simulates what happens when you restart ovn-controller on an
specific chassis, the chassis entry goes away for a small interval of time,
the issue for us was that ovn-northd detected the SBDB chassis gone,
and triggered the re-creation of Gateway_Chassis entires in the SBDB,
breaking havoc the existing gateways on the same A/P gateway_chassis
cluster.

   One enhancement to all this could be that we made ovn-northd be a little
bit more smart (russellb suggested that at some point) , and instead of
recreating
the whole set of SBDB gateway_chassis associated to a Port_Binding, we
only removed/created the specific ones which had changes. That'd be
a nice iterative enhancement to this.

   Also, I have a pending change to accept hostnames as chassis names too,
instead of the randomly generated names, which are harder to trace.


   Best regards,
 I'll keep you updated.


On Fri, Oct 27, 2017 at 8:26 AM, Ben Pfaff <blp at ovn.org> wrote:

> Hi, I've been running the OVN unit tests in a loop with -j10 (to run
> faster and cause load) today, and one failure that I see is the
> following:
>
>     2364. ovn.at:8454: testing ovn -- ensure one gw controller restart in
> HA doesn't bounce the master ...
>     creating ovn-sb database
>     creating ovn-nb database
>     starting ovn-northd
>     starting backup ovn-northd
>     adding simulator 'main'
>     adding simulator 'gw1'
>     adding simulator 'gw2'
>     adding simulator 'hv1'
>     OK
>     OK
>     OK
>     OK
>     OK
>     OK
>     1b7f7753-aea0-4484-a8db-2971b9fb444f
>     552e4549-6ba5-4509-99a2-7968cb960a15
>     a7ea2de8-558d-413a-90df-187f58c4f239
>     ../../tests/ovn.at:8533: grep "Releasing lport" gw1/ovn-controller.log
>     --- /dev/null   2017-07-26 15:46:07.674034656 -0700
>     +++ /home/blp/nicira/ovs/_build/tests/testsuite.dir/at-groups/2364/stdout
>      2017-10-26 17:11:07.261985117 -0700
>     @@ -0,0 +1 @@
>     +2017-10-27T00:11:07.242Z|00024|binding|INFO|Releasing lport
> cr-outside from this chassis.
>     ../../tests/ovn.at:8533: exit code was 0, expected 1
>     2364. ovn.at:8454: 2364. ovn -- ensure one gw controller restart in
> HA doesn't bounce the master (ovn.at:8454): FAILED (ovn.at:8533)
>
> I don't really understand this test, which was introduced by the
> following commit.  Do you guys have any hints as to why this would race
> under high load?
>
> --8<--------------------------cut here-------------------------->8--
>
> From: "majopela at redhat.com" <majopela at redhat.com>
> Date: Thu, 13 Jul 2017 17:02:41 +0000
> Subject: [PATCH] ovn: l3ha ensure no master bouncing when ovn-controller is
>  restarted
>
> When ovn-controller is restarted, ovn-controller removes the old
> Chassis entry from the SBDB and a new one is inserted.
>
> This cleared the Gateway_Chassis chassis column in the SBDB and then
> ovn-northd removed the empty-column Gateway_Chassis entry.
> Such event made the other (non-restarted and master gateway chassis)
> believe that he was a single (non-HA) gateway, turning off BFD and
> releasing the port for a tiny time frame causing unnecesary downtime.
>
> Signed-off-by: Miguel Angel Ajo <majopela at redhat.com>
> Signed-off-by: Russell Bryant <russell at ovn.org>
> ---
>  ovn/northd/ovn-northd.c | 34 +++++++++++---------
>  tests/ovn.at            | 83 ++++++++++++++++++++++++++++++
> +++++++++++++++++++
>  2 files changed, 103 insertions(+), 14 deletions(-)
>
> diff --git a/ovn/northd/ovn-northd.c b/ovn/northd/ovn-northd.c
> index cfe765f42c9d..a3f138d448ce 100644
> --- a/ovn/northd/ovn-northd.c
> +++ b/ovn/northd/ovn-northd.c
> @@ -1684,11 +1684,22 @@ gateway_chassis_equal(const struct
> nbrec_gateway_chassis *nb_gwc,
>                        const struct sbrec_chassis *nb_gwc_c,
>                        const struct sbrec_gateway_chassis *sb_gwc)
>  {
> -    return !strcmp(nb_gwc->name, sb_gwc->name)
> -           && !strcmp(nb_gwc_c->name, sb_gwc->chassis->name)
> -           && nb_gwc->priority == sb_gwc->priority
> -           && smap_equal(&nb_gwc->options, &sb_gwc->options)
> -           && smap_equal(&nb_gwc->external_ids, &sb_gwc->external_ids);
> +    bool equal = !strcmp(nb_gwc->name, sb_gwc->name)
> +                 && nb_gwc->priority == sb_gwc->priority
> +                 && smap_equal(&nb_gwc->options, &sb_gwc->options)
> +                 && smap_equal(&nb_gwc->external_ids,
> &sb_gwc->external_ids);
> +
> +    if (!equal) {
> +        return false;
> +    }
> +
> +    /* If everything else matched and we were unable to find the SBDB
> +     * Chassis entry at this time, assume a match and return true.
> +     * This happens when an ovn-controller is restarting and the Chassis
> +     * entry is gone away momentarily */
> +    return !nb_gwc_c
> +           || (sb_gwc->chassis && !strcmp(nb_gwc_c->name,
> +                                          sb_gwc->chassis->name));
>  }
>
>  static bool
> @@ -1723,11 +1734,10 @@ sbpb_gw_chassis_needs_update(
>              chassis_lookup_by_name(chassis_index,
>                                     lrp->gateway_chassis[n]->
> chassis_name);
>
> -        if (chassis) {
> -            lrp_gwc_c[lrp_n_gateway_chassis] = chassis;
> -            lrp_gwc[lrp_n_gateway_chassis] = lrp->gateway_chassis[n];
> -            lrp_n_gateway_chassis++;
> -        } else {
> +        lrp_gwc_c[lrp_n_gateway_chassis] = chassis;
> +        lrp_gwc[lrp_n_gateway_chassis] = lrp->gateway_chassis[n];
> +        lrp_n_gateway_chassis++;
> +        if (!chassis) {
>              static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 1);
>              VLOG_WARN_RL(
>                  &rl, "Chassis name %s referenced in NBDB via
> Gateway_Chassis "
> @@ -1805,10 +1815,6 @@ copy_gw_chassis_from_nbrp_to_sbpb(
>          const struct sbrec_chassis *chassis =
>              chassis_lookup_by_name(chassis_index, lrp_gwc->chassis_name);
>
> -        if (!chassis) {
> -            continue;
> -        }
> -
>          gw_chassis = xrealloc(gw_chassis, (n_gwc + 1) * sizeof
> *gw_chassis);
>
>          struct sbrec_gateway_chassis *pb_gwc =
> diff --git a/tests/ovn.at b/tests/ovn.at
> index 126d4a67141b..03a9ff465143 100644
> --- a/tests/ovn.at
> +++ b/tests/ovn.at
> @@ -8080,3 +8080,86 @@ AT_CHECK([grep $garp hv2_br_phys_tx | sort], [0],
> [])
>  OVN_CLEANUP([hv1],[hv2],[hv3])
>
>  AT_CLEANUP
> +
> +AT_SETUP([ovn -- ensure one gw controller restart in HA doesn't bounce
> the master])
> +AT_SKIP_IF([test $HAVE_PYTHON = no])
> +ovn_start
> +
> +net_add n1
> +
> +# create two gateways with external network connectivity
> +for i in 1 2; do
> +    sim_add gw$i
> +    as gw$i
> +    ovs-vsctl add-br br-phys
> +    ovn_attach n1 br-phys 192.168.0.$i
> +    ovs-vsctl set open . external-ids:ovn-bridge-mappings=phys:br-phys
> +done
> +
> +ovn-nbctl ls-add inside
> +ovn-nbctl ls-add outside
> +
> +# create one hypervisors with a vif port the internal network
> +sim_add hv1
> +as hv1
> +ovs-vsctl add-br br-phys
> +ovn_attach n1 br-phys 192.168.0.11
> +ovs-vsctl -- add-port br-int hv1-vif1 -- \
> +    set interface hv1-vif1 external-ids:iface-id=inside1 \
> +    options:tx_pcap=hv1/vif1-tx.pcap \
> +    options:rxq_pcap=hv1/vif1-rx.pcap \
> +    ofport-request=1
> +
> +ovn-nbctl lsp-add inside inside1 \
> +        -- lsp-set-addresses inside1 "f0:00:00:01:22:01 192.168.1.101"
> +
> +
> +ovn_populate_arp
> +
> +ovn-nbctl create Logical_Router name=R1
> +
> +# Connect inside to R1
> +ovn-nbctl lrp-add R1 inside 00:00:01:01:02:03 192.168.1.1/24
> +ovn-nbctl lsp-add inside rp-inside -- set Logical_Switch_Port rp-inside \
> +    type=router options:router-port=inside \
> +    -- lsp-set-addresses rp-inside router
> +
> +# Connect outside to R1 as distributed router gateway port on gw1+gw2
> +ovn-nbctl lrp-add R1 outside 00:00:02:01:02:04 192.168.0.101/24
> +
> +ovn-nbctl --id=@gc0 create Gateway_Chassis \
> +                    name=outside_gw1 chassis_name=gw1 priority=20 -- \
> +          --id=@gc1 create Gateway_Chassis \
> +                    name=outside_gw2 chassis_name=gw2 priority=10 -- \
> +          set Logical_Router_Port outside 'gateway_chassis=[@gc0, at gc1]'
> +
> +ovn-nbctl lsp-add outside rp-outside -- set Logical_Switch_Port
> rp-outside \
> +    type=router options:router-port=outside \
> +    -- lsp-set-addresses rp-outside router
> +
> +# Create localnet port in outside
> +ovn-nbctl lsp-add outside ln-outside
> +ovn-nbctl lsp-set-addresses ln-outside unknown
> +ovn-nbctl lsp-set-type ln-outside localnet
> +ovn-nbctl lsp-set-options ln-outside network_name=phys
> +
> +# Allow some time for ovn-northd and ovn-controller to catch up.
> +ovn-nbctl --wait=hv sync
> +
> +# currently when ovn-controller is restarted, the old entry is deleted
> +# and a new one is created, which leaves the Gateway_Chassis with
> +# an empty chassis for a while. NOTE: restarting ovn-controller in tests
> +# doesn't have the same effect because "name" is conserved, and the
> +# Chassis entry is not replaced.
> +
> +gw2_chassis=$(ovn-sbctl --bare --columns=_uuid find Chassis name=gw2)
> +ovn-sbctl destroy Chassis $gw2_chassis
> +
> +# Ensure ovn-controller has processed latest sbdb update
> +# ovn-nbctl --wait=hv sync
> +
> +AT_CHECK([grep "Releasing lport" gw1/ovn-controller.log], [1], [])
> +
> +OVN_CLEANUP([gw1],[gw2],[hv1])
> +
> +AT_CLEANUP
> --
> 2.10.2
>
>


More information about the dev mailing list