[ovs-discuss] active_backup failover issue

Numan Siddique nusiddiq at redhat.com
Tue Apr 27 22:27:45 UTC 2021


On Tue, Apr 27, 2021 at 6:00 PM Francois <rigault.francois at gmail.com> wrote:
>
> On Tue, 27 Apr 2021 at 23:08, Numan Siddique <numans at ovn.org> wrote:
> >
> > On Tue, Apr 27, 2021 at 4:58 PM Francois <rigault.francois at gmail.com> wrote:
> > >
> > > On Tue, 27 Apr 2021 at 22:20, Numan Siddique <numans at ovn.org> wrote:
> > > >
> > > > On Tue, Apr 27, 2021 at 9:11 AM Francois <rigault.francois at gmail.com> wrote:
> > > > >
> > >
> > > > The ovn-controller running on chassis-1 will not detect the BFD failover.
> > >
> > > Thanks for your answer! Ok for chassis-1.
> > >
> > > What I don't understand is why chassis-2, who is aware that chassis-1
> > > is down, is not able to act as a gateway for its own ports.
> >
> > I see what's going on.  So ovn-controller on chassis-2 detects the failover
> > and claims the cr-<gateway_port>. But ovn-controller on chassis-1 which has
> > higher priority claims it back because according to it, BFD is fine.
> >
> > You can probably monitor the ovn-controller logs on both chassis, and you
> > might notice claim/release logs.
> >
> > Or you can do "tail -f ovnsb_db.db" and see that there are constant updates
> > to the cr-<gateway_port>.
> >
> > Having 3 chassis will not result in this split brain scenario which you have
> > probably observed.
>
> I am going to do a bit more research and see what happens on some
> real OpenStack installation, maybe I messed up somewhere.
>
> There is nothing logged in the ovn-controller, and nothing flooding
> the DB (+one line saying port_binding is down). My understanding was
> that the move of gateway (as it happens for chassis-3) happens
> without the involvement of the control plane, in other words in case
> the first gateway fails, the flows to move to the second gateway are
> already installed and can be used straight away.
>
> I am puzzled because if I trace the packet from chassis-2 before and
> after chassis-1 dies, it always end up in flow
>
> 37. reg15=0x3,metadata=0x4, priority 100, cookie 0x7a15360f
>     set_field:0x4/0xffffff->tun_id
>     set_field:0x3->tun_metadata0
>     move:NXM_NX_REG14[0..14]->NXM_NX_TUN_METADATA0[16..30]
>      -> NXM_NX_TUN_METADATA0[16..30] is now 0x1
>     bundle(eth_src,0,active_backup,ofport,members:7)
>
> Only difference is, when chassis-1 is up, the added
>      -> output to kernel tunnel
>
> It seems that there is no backup flow for packets not going through a
> tunnel, straight to external.

I think it is expected, because ovn-controller of chassis-1 has claimed
the gateway port (i.e cr-<gw_port), and hence ovn-controller on chassis-2
has the above flow you mentioned.  If you run "ovn-sbctl show" you would
see chassis-1 claiming the gateway chassis port. (I am talking about
your 2 chassis scenario here).

Along with killing ovs-vswitchd, if you also kill ovn-controller, you
should not see
the above tunnel flow. Instead ovn-controller on chassis-2 would claim
the gateway
chassis port (confirm by running ovn-sbctl show) and also remove the
above table 37 flow.

Thanks
Numan

>
> Before tackling the tricky cases, I would like to make it work when
> it fails "as documented" :), just one chassis dying but traffic being
> quickly dispatched somewhere else.
>
> Thanks
>
> On Tue, 27 Apr 2021 at 23:08, Numan Siddique <numans at ovn.org> wrote:
> >
> > On Tue, Apr 27, 2021 at 4:58 PM Francois <rigault.francois at gmail.com> wrote:
> > >
> > > On Tue, 27 Apr 2021 at 22:20, Numan Siddique <numans at ovn.org> wrote:
> > > >
> > > > On Tue, Apr 27, 2021 at 9:11 AM Francois <rigault.francois at gmail.com> wrote:
> > > > >
> > >
> > > > The ovn-controller running on chassis-1 will not detect the BFD failover.
> > >
> > > Thanks for your answer! Ok for chassis-1.
> > >
> > > What I don't understand is why chassis-2, who is aware that chassis-1
> > > is down, is not able to act as a gateway for its own ports.
> >
> > I see what's going on.  So ovn-controller on chassis-2 detects the failover
> > and claims the cr-<gateway_port>. But ovn-controller on chassis-1 which has
> > higher priority claims it back because according to it, BFD is fine.
> >
> > You can probably monitor the ovn-controller logs on both chassis, and you
> > might notice claim/release logs.
> >
> > Or you can do "tail -f ovnsb_db.db" and see that there are constant updates
> > to the cr-<gateway_port>.
> >
> > Having 3 chassis will not result in this split brain scenario which you have
> > probably observed.
> >
> > Thanks
> > Numan
> >
> >
> > >
> > > Francois
> > > _______________________________________________
> > > discuss mailing list
> > > discuss at openvswitch.org
> > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> > >
> _______________________________________________
> discuss mailing list
> discuss at openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>



More information about the discuss mailing list