[ovs-discuss] active_backup failover issue

Tue Apr 27 21:59:41 UTC 2021

On Tue, 27 Apr 2021 at 23:08, Numan Siddique <numans at ovn.org> wrote:
>
> On Tue, Apr 27, 2021 at 4:58 PM Francois <rigault.francois at gmail.com> wrote:
> >
> > On Tue, 27 Apr 2021 at 22:20, Numan Siddique <numans at ovn.org> wrote:
> > >
> > > On Tue, Apr 27, 2021 at 9:11 AM Francois <rigault.francois at gmail.com> wrote:
> > > >
> >
> > > The ovn-controller running on chassis-1 will not detect the BFD failover.
> >
> > Thanks for your answer! Ok for chassis-1.
> >
> > What I don't understand is why chassis-2, who is aware that chassis-1
> > is down, is not able to act as a gateway for its own ports.
>
> I see what's going on.  So ovn-controller on chassis-2 detects the failover
> and claims the cr-<gateway_port>. But ovn-controller on chassis-1 which has
> higher priority claims it back because according to it, BFD is fine.
>
> You can probably monitor the ovn-controller logs on both chassis, and you
> might notice claim/release logs.
>
> Or you can do "tail -f ovnsb_db.db" and see that there are constant updates
> to the cr-<gateway_port>.
>
> Having 3 chassis will not result in this split brain scenario which you have
> probably observed.

I am going to do a bit more research and see what happens on some
real OpenStack installation, maybe I messed up somewhere.

There is nothing logged in the ovn-controller, and nothing flooding
the DB (+one line saying port_binding is down). My understanding was
that the move of gateway (as it happens for chassis-3) happens
without the involvement of the control plane, in other words in case
the first gateway fails, the flows to move to the second gateway are
already installed and can be used straight away.

I am puzzled because if I trace the packet from chassis-2 before and
after chassis-1 dies, it always end up in flow

37. reg15=0x3,metadata=0x4, priority 100, cookie 0x7a15360f
    set_field:0x4/0xffffff->tun_id
    set_field:0x3->tun_metadata0
    move:NXM_NX_REG14[0..14]->NXM_NX_TUN_METADATA0[16..30]
     -> NXM_NX_TUN_METADATA0[16..30] is now 0x1
    bundle(eth_src,0,active_backup,ofport,members:7)

Only difference is, when chassis-1 is up, the added
     -> output to kernel tunnel

It seems that there is no backup flow for packets not going through a
tunnel, straight to external.

Before tackling the tricky cases, I would like to make it work when
it fails "as documented" :), just one chassis dying but traffic being
quickly dispatched somewhere else.

Thanks

On Tue, 27 Apr 2021 at 23:08, Numan Siddique <numans at ovn.org> wrote:
>
> On Tue, Apr 27, 2021 at 4:58 PM Francois <rigault.francois at gmail.com> wrote:
> >
> > On Tue, 27 Apr 2021 at 22:20, Numan Siddique <numans at ovn.org> wrote:
> > >
> > > On Tue, Apr 27, 2021 at 9:11 AM Francois <rigault.francois at gmail.com> wrote:
> > > >
> >
> > > The ovn-controller running on chassis-1 will not detect the BFD failover.
> >
> > Thanks for your answer! Ok for chassis-1.
> >
> > What I don't understand is why chassis-2, who is aware that chassis-1
> > is down, is not able to act as a gateway for its own ports.
>
> I see what's going on.  So ovn-controller on chassis-2 detects the failover
> and claims the cr-<gateway_port>. But ovn-controller on chassis-1 which has
> higher priority claims it back because according to it, BFD is fine.
>
> You can probably monitor the ovn-controller logs on both chassis, and you
> might notice claim/release logs.
>
> Or you can do "tail -f ovnsb_db.db" and see that there are constant updates
> to the cr-<gateway_port>.
>
> Having 3 chassis will not result in this split brain scenario which you have
> probably observed.
>
> Thanks
> Numan
>
>
> >
> > Francois
> > _______________________________________________
> > discuss mailing list
> > discuss at openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> >