[ovs-discuss] OVN: MAC_Binding entries not getting updated leads to unreachable destinations

Daniel Alvarez Sanchez dalvarez at redhat.com
Fri Nov 23 16:28:56 UTC 2018


On Wed, Nov 21, 2018 at 9:04 PM Han Zhou <zhouhan at gmail.com> wrote:
>
>
>
> On Tue, Nov 20, 2018 at 5:21 AM Mark Michelson <mmichels at redhat.com> wrote:
> >
> > Hi Daniel,
> >
> > I agree with Numan that this seems like a good approach to take.
> >
> > On 11/16/2018 12:41 PM, Daniel Alvarez Sanchez wrote:
> > >
> > > On Sat, Nov 10, 2018 at 12:21 AM Ben Pfaff <blp at ovn.org
> > > <mailto:blp at ovn.org>> wrote:
> > >  >
> > >  > On Mon, Oct 29, 2018 at 05:21:13PM +0530, Numan Siddique wrote:
> > >  > > On Mon, Oct 29, 2018 at 5:00 PM Daniel Alvarez Sanchez
> > > <dalvarez at redhat.com <mailto:dalvarez at redhat.com>>
> > >  > > wrote:
> > >  > >
> > >  > > > Hi,
> > >  > > >
> > >  > > > After digging further. The problem seems to be reduced to reusing an
> > >  > > > old gateway IP address for a dnat_and_snat entry.
> > >  > > > When a gateway port is bound to a chassis, its entry will show up in
> > >  > > > the MAC_Binding table (at least when that Logical Switch is connected
> > >  > > > to more than one Logical Router). After deleting the Logical Router
> > >  > > > and all its ports, this entry will remain there. If a new Logical
> > >  > > > Router is created and a Floating IP (dnat_and_snat) is assigned to a
> > >  > > > VM with the old gw IP address, it will become unreachable.
> > >  > > >
> > >  > > > A workaround now from networking-ovn (OpenStack integration) is to
> > >  > > > delete MAC_Binding entries for that IP address upon a FIP creation. I
> > >  > > > think that this however should be done from OVN, what do you folks
> > >  > > > think?
> > >  > > >
> > >  > > >
> > >  > > Agree. Since the MAC_Binding table row is created by ovn-controller, it
> > >  > > should
> > >  > > be handled properly within OVN.
> > >  >
> > >  > I see that this has been sitting here for a while.  The solution seems
> > >  > reasonable to me.  Are either of you working on it?
> > >
> > > I started working on it. I came up with a solution (see patch below)
> > > which works but I wanted to give you a bit more of context and get your
> > > feedback:
> > >
> > >
> > >                             ^ localnet
> > >                             |
> > >                         +---+---+
> > >                         |       |
> > >                  +------+  pub  +------+
> > >                  |      |       |      |
> > >                  |      +-------+      |
> > >                  | 172.24.4.0/24 <http://172.24.4.0/24>    |
> > >                  |                     |
> > >     172.24.4.220 |                     | 172.24.4.221
> > >              +---+---+             +---+---+
> > >              |       |             |       |
> > >              |  LR0  |             |  LR1  |
> > >              |       |             |       |
> > >              +---+---+             +---+---+
> > >       10.0.0.254 |                     | 20.0.0.254
> > >                  |                     |
> > >              +---+---+             +---+---+
> > >              |       |             |       |
> > > 10.0.0.0/24 <http://10.0.0.0/24> |  SW0  |             |  SW1  |
> > > 20.0.0.0/24 <http://20.0.0.0/24>
> > >              |       |             |       |
> > >              +---+---+             +---+---+
> > >                  |                     |
> > >                  |                     |
> > >              +---+---+             +---+---+
> > >              |       |             |       |
> > >              |  VM0  |             |  VM1  |
> > >              |       |             |       |
> > >              +-------+             +-------+
> > >              10.0.0.10             20.0.0.10
> > >            172.24.4.100           172.24.4.200
> > >
> > >
> > > When I ping VM1 floating IP from the external network, a new entry for
> > > 172.24.4.221 in the LR0 datapath appears in the MAC_Binding table:
> > >
> > > _uuid               : 85e30e87-3c59-423e-8681-ec4cfd9205f9
> > > datapath            : ac5984b9-0fea-485f-84d4-031bdeced29b
> > > ip                  : "172.24.4.221"
> > > logical_port        : "lrp02"
> > > mac                 : "00:00:02:01:02:04"
> > >
> > >
> > > Now, if LR1 gets removed and the old gateway IP (172.24.4.221) is reused
> > > for VM2 FIP with different MAC and new gateway IP is created (for
> > > example 172.24.4.222 00:00:02:01:02:99),  VM2 FIP becomes unreachable
> > > from VM1 until the old MAC_Binding entry gets deleted as pinging
> > > 172.24.4.221 will use the wrong address ("00:00:02:01:02:04").
> > >
> > > With the patch below, removing LR1 results in deleting all MAC_Binding
> > > entries for every datapath where '172.24.4.221' appears in the 'ip'
> > > column so the problem goes away.
> > >
> > > Another solution would be implementing some kind of 'aging' for
> > > MAC_Binding entries but perhaps it's more complex.
> > > Looking forward for your comments :)
> > >
> > >
> > > diff --git a/ovn/northd/ovn-northd.c b/ovn/northd/ovn-northd.c
> > > index 58bef7d..a86733e 100644
> > > --- a/ovn/northd/ovn-northd.c
> > > +++ b/ovn/northd/ovn-northd.c
> > > @@ -2324,6 +2324,18 @@ cleanup_mac_bindings(struct northd_context *ctx,
> > > struct hmap *ports)
> > >       }
> > >   }
> > >
> > > +static void
> > > +delete_mac_binding_by_ip(struct northd_context *ctx, const char *ip)
> > > +{
> > > +    const struct sbrec_mac_binding *b, *n;
> > > +    SBREC_MAC_BINDING_FOR_EACH_SAFE (b, n, ctx->ovnsb_idl) {
> > > +        if (strstr(ip, b->ip)) {
> > > +            sbrec_mac_binding_delete(b);
> > > +        }
> > > +    }
> > > +}
> > > +
> > > +
> > >   /* Updates the southbound Port_Binding table so that it contains the
> > > logical
> > >    * switch ports specified by the northbound database.
> > >    *
> > > @@ -2383,6 +2395,15 @@ build_ports(struct northd_context *ctx,
> > >       /* Delete southbound records without northbound matches. */
> > >       LIST_FOR_EACH_SAFE(op, next, list, &sb_only) {
> > >           ovs_list_remove(&op->list);
> > > +
> > > +        /* Delete all MAC_Binding entries which match the IP addresses
> > > of the
> > > +         * deleted logical router port (ie. port with a peer). */
> > > +        const char *peer = smap_get(&op->sb->options, "peer");
> > > +        if (peer) {
> > > +            for (int i = 0; i < op->sb->n_mac; i++) {
> > > +                delete_mac_binding_by_ip(ctx, op->sb->mac[i]);
> > > +            }
> > > +        }
> > >           sbrec_port_binding_delete(op->sb);
> > >           ovn_port_destroy(ports, op);
> > >       }
> > >
>
> Hi,
>
> Sorry that I didn't notice this discussion until now. I encountered similar problems before. It was not in floating IP scenario, but for external IPs - ports on the same networks but not aware by OVN. When IP relocates from one MAC to another, the previous mac-binding entry will not get updated and therefore the re-located IP is unreachable.
>
> This happens for external router IPs on the localnet network behind the gateways (which hosts the 172.24.4.221 port in Daniel's example). It also happens for nested workloads that run inside a VM - the VM port is known by OVN, but the internal workloads (e.g. containers) runs on same subnets but relies on mac-binding to communicate.
>
> For both of my use cases, the problem has been solved by this patch (merged): https://github.com/openvswitch/ovs/commit/b068454082f5d76727ffde34542ff19fed20e178
>
> The idea is, mac-binding entry should be updated when the IP is announced in a new location by GARP/ARP request/ARP response. So I think the best way to solve the problem for floating IP is similar. We just need to generate GARP when a new FIP is attached. I was under the impression that OVN already supports GARP when a new NAT entry is added. But if the problem is still there it means something is wrong there (or the GARP feature is not there yet for the NAT case), and I need to check the code.

I think we're only sending the GARPs only for distributed floating IPs
(nat_addresses field in the Port_Binding table) [0].
Anyways even with that, I'm not quite sure if the MAC_Binding table
would get updated as I think that first time I hit this issue it was
on a DVR environment (ie. distributed FIPs, dnat_and_snat entries with
a logical_port and external_mac).

[0] https://github.com/openvswitch/ovs/blob/master/ovn/controller/pinctrl.c#L2497
>
> For the patch proposed in this discussion, I think there are two problems.
>
> Firstly, I think it doesn't solve the problem completely. It only deletes mac-binding when a logical router port is deleted. However, in any of the above use cases (including FIP), IP relocation can happen without deleting the router port. Or did I misunderstood anything here?
>
> Secondly, northd just reconciles between current state and desired state for SB - it is declarative. We should avoid relying on the northd cleanup logic to trigger important operations. I think the design principle of northd should be making sure the desired state is reached, but not care about how is it reached. For example, it can be reached by deleting extra records one by one, but it is also correct if it deletes everything and recreate the desired entries - this is just an example, it may be inefficient, but it may be reasonable in some scenarios. Adding logic in northd that relies on *how* the desired state is computed would make it unreliable and hard to maintain. I think it would also create challenges for the DDlog implementation.
>
> For the mac-binding aging mechanism mentioned by Daniel, I agree. It is required for fault scenarios when SB is temporarily down. Since we rely on SB DB to store the ARP cache/Neighbor table for the virtual routers, if ARP updates happens when the DB is down, changes are lost. However, the aging mechanism seems tricky when scale is considered. Only the idle entries should be timed out, but it is costly to update states whenever a mac-binding entry is hit. I haven't thought about any clever way to achieve it without sacrificing scalability. Any thoughts here? A workaround to the problem is to resend GARP periodically (e.g. every 1 min).
>
> Thanks,
> Han


More information about the discuss mailing list