[ovs-dev] [PATCH ovn v3 2/2] northd: Remove "reachable" functions and users of them.

Dumitru Ceara dceara at redhat.com
Mon Mar 29 07:08:47 UTC 2021

On 3/28/21 3:01 AM, Mark Michelson wrote:
> On 3/26/21 10:12 AM, Dumitru Ceara wrote:
>> On 3/26/21 2:52 PM, Mark Michelson wrote:
>>> On 3/26/21 7:35 AM, Dumitru Ceara wrote:
>>>> On 3/26/21 12:32 PM, Dumitru Ceara wrote:
>>>>> On 3/24/21 12:44 AM, Mark Michelson wrote:
>>>>>> On 3/23/21 6:55 PM, Dumitru Ceara wrote:
>>>>>>> On 3/23/21 10:02 PM, Mark Michelson wrote:
>>>>>>>> On 3/23/21 12:13 PM, Dumitru Ceara wrote:
>>>>>>>>> On 3/23/21 5:05 PM, Numan Siddique wrote:
>>>>>>>>>> On Fri, Mar 19, 2021 at 2:20 AM Mark Michelson
>>>>>>>>>> <mmichels at redhat.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> Self-originated ARPs are intended to be sent to the "owning"
>>>>>>>>>>> router for
>>>>>>>>>>> a given IP address, as well as flooded to non-router ports on a
>>>>>>>>>>> logical
>>>>>>>>>>> switch.
>>>>>>>>>>> When trying to determine the owning router for an IP address, we
>>>>>>>>>>> would
>>>>>>>>>>> iterate through load balancer and NAT addresses, and check if
>>>>>>>>>>> these IP
>>>>>>>>>>> addresses were "reachable" on this particular router. Reachable
>>>>>>>>>>> here
>>>>>>>>>>> means that the NAT external IP or load balancer VIP falls within
>>>>>>>>>>> any of
>>>>>>>>>>> the networks served by this router. If an IP address were
>>>>>>>>>>> determined
>>>>>>>>>>> not
>>>>>>>>>>> to be reachable, then we would not try to send ARPs for that
>>>>>>>>>>> particular
>>>>>>>>>>> address to the router.
>>>>>>>>>>> However, it is possible (and sometimes desired) to configure NAT
>>>>>>>>>>> floating
>>>>>>>>>>> IPs on a router that are not in any subnet handled by that
>>>>>>>>>>> router.
>>>>>>>>>>> In this case, OVN refuses to send ARP requests to the router on
>>>>>>>>>>> which the
>>>>>>>>>>> floating IP has been configured. The result is that
>>>>>>>>>>> internally-generated
>>>>>>>>>>> traffic that targets the floating IP cannot reach its
>>>>>>>>>>> destination,
>>>>>>>>>>> since the router on which the floating IP is configured never
>>>>>>>>>>> receives ARPs
>>>>>>>>>>> for the floating IP.
>>>>>>>>>>> This patch fixes the issue by removing the reachability checks
>>>>>>>>>>> altogether. If a router has a NAT external IP or load
>>>>>>>>>>> balancer VIP
>>>>>>>>>>> that
>>>>>>>>>>> is outside the range of any of its configured subnets, we still
>>>>>>>>>>> should
>>>>>>>>>>> send ARPs to that router for those requested addresses.
>>>>>>>>>>> Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=1929901
>>>>>>>>>>> Signed-off-by: Mark Michelson <mmichels at redhat.com>
>>>>>>>>>> Thanks for addressing this.
>>>>>>>>>> Acked-by: Numan Siddique <numans at ovn.org>
>>>>>>>>>> @Dumitru - Since you had added the code to limit ARPs.  Can you
>>>>>>>>>> please
>>>>>>>>>> also take a look at this patch ?
>>>>>>>>> Hi Mark, Numan,
>>>>>>>>> I've been thinking about this for a while.  I think this needs at
>>>>>>>>> least:
>>>>>>>>> Fixes: 1e07781310d8 ("ovn-northd: Fix logical flows to limit
>>>>>>>>> ARP/NS
>>>>>>>>> broadcast domain.")
>>>>>>>> OK, I can either just add that to the commit when I merge, or if I
>>>>>>>> need
>>>>>>>> to put a new version up for review, I can add it to the new version
>>>>>>>> then.
>>>>>>> Sounds good.
>>>>> I investigated some more and the problem is actually because of the
>>>>> combination of:
>>>>> 1e07781310d8 ("ovn-northd: Fix logical flows to limit ARP/NS
>>>>> broadcast domain.")
>>>>> and
>>>>> 8c6a5bc21847 ("ovn-northd: Limit self originated ARP/ND broadcast
>>>>> domain.")
>>>>> Without 8c6a5bc21847, the "reachable" functions just optimize ARP
>>>>> request flooding but when the flows using them are not hit we always
>>>>> fall back to flooding in the MC_FLOOD group on LS-PUB.
>>>>> With 8c6a5bc21847, we instead flood all logical router self originated
>>>>> ARP requests in the MC_FLOOD_L2 group.
>>>>>>>>> With the latest changes in ovn-kubernetes I think that the
>>>>>>>>> above was
>>>>>>>>> not
>>>>>>>>> needed anyway.  Mark, do you have more details about this by any
>>>>>>>>> chance?
>>>>>>>> I'm not sure which changes in ovn-kubernetes you're referring to
>>>>>>>> here.
>>>>>>> I was thinking of this:
>>>>>>> https://github.com/ovn-org/ovn-kubernetes/pull/2055
>>>>>>> Specifically:
>>>>>>> https://github.com/ovn-org/ovn-kubernetes/commit/ce2d312c7266c8d55173c9660002c67715a5444b
>>>>>>> However, after a closer look, I'm not sure if dnat_and_snat usage is
>>>>>>> completely removed in ovn-kubernetes.  This makes me wonder if
>>>>>>> reverting
>>>>>>> 1e07781310d8 ("ovn-northd: Fix logical flows to limit ARP/NS
>>>>>>> broadcast
>>>>>>> domain.") will not cause a regression in the ovn-kubernetes use
>>>>>>> case.
>>>>>>>>> Initial bug report was here:
>>>>>>>>> https://mail.openvswitch.org/pipermail/ovs-discuss/2020-June/050287.html
>>>>>>>> To quote from that e-mail:
>>>>>>>> "Question though is why any Pod on the logical switch would send
>>>>>>>> an ARP
>>>>>>>> for an IP that is not in its subnet. A packet from a Pod towards a
>>>>>>>> non
>>>>>>>> subnet IP should ARP only for the default gateway IP."
>>>>>>>> The above holds true for the scenario I'm working here. The key is
>>>>>>>> that
>>>>>>>> this is a self-originated ARP from OVN, not an ARP from a pod/VM.
>>>>>>>> The use case I'm solving here is OpenStack floating IPs. It's not
>>>>>>>> common, but it is expected that a user is able to assign a
>>>>>>>> floating IP
>>>>>>>> to a router that does not service that floating IP's subnet. They
>>>>>>>> also
>>>>>>>> expect all other VMs in the cluster to be able to reach the
>>>>>>>> target VM
>>>>>>>> via its floating IP.
>>>>>>>> So let's say you have a setup like:
>>>>>>>>                          EXT-ROUTER
>>>>>>>>                              ^
>>>>>>>>                              |
>>>>>>>>                              v
>>>>>>>> VM1 <-> LS1 <-> LR1 <-> LS-PUB <-> LR2 <-> LS2 <-> VM2
>>>>>>>> In this case, LR1 and LR2 both have distributed gateway router
>>>>>>>> ports on
>>>>>>>> them (the ports going to LS-PUB). They also have default static
>>>>>>>> routes
>>>>>>>> that point to the EXT-ROUTER. LR2's gateway router port IP
>>>>>>>> address is
>>>>>>>> An OpenStack user configures a floating IP for
>>>>>>>> VM2 of
>>>>>>>> This translates to a dnat_and_snat entry on LR2 for
>>>>>>>> that
>>>>>>>> address, which is NOT in the subnet configured on that router. Now
>>>>>>>> VM1
>>>>>>>> wants to ping In this case, when the ping reaches
>>>>>>>> LR1, LR1
>>>>>>>> needs to ARP to find where is. The ARP needs to
>>>>>>>> target LR2
>>>>>>>> since that's where the floating IP is configured. If the ARP only
>>>>>>>> goes
>>>>>>>> to EXT-ROUTER, EXT-ROUTER won't be able to respond to the ARP
>>>>>>>> properly,
>>>>>>>> and connectivity cannot be established.
>>>>>>> I might be wrong but I'm still not sure how valid it is to define a
>>>>>>> floating IP ( on a router that doesn't have network
>>>>>>> that
>>>>>>> includes
>>>>>> Unfortunately, this battle has been fought a couple of times already,
>>>>>> and in the case of OpenStack, this is something they expect to
>>>>>> work and
>>>>>> currently does work with ml2-ovs.
>>>>>>> Would it work if neutron-ovn configured a network
>>>>>>> on the
>>>>>>> LR2-to-LS-PUB router port?
>>>>>> Yes, that would make the router appear to be "reachable" with the
>>>>>> current OVN code.
>>>>>>> Otherwise, I think we need to clarify if ovn-kubernetes would still
>>>>>>> have
>>>>>>> a scale issue or not.
>>>>>>> A less pretty alternative is to allow such floating IPs (without a
>>>>>>> network that includes that IP) only if a new config knob is set.
>>>>>> I think there are a couple of valid options here:
>>>>>> 1) A configuration option. The option could be configured on the NAT
>>>>>> itself to allow for that address to be considered "reachable" on the
>>>>>> router even if it lies outside the router subnet.
>>>>>> 2) Automatically allow this behavior for "floating IP" NATs
>>>>>> (dnat_and_snat, external_mac set, and port set).
>>>>>> (2) Has the advantage of "just working" for the OpenStack FIP case
>>>>>> while
>>>>>> also not causing a flow explosion for ovn-kubernetes. However, (1) is
>>>>>> more flexible, allowing for the option to be set on other DNATs as
>>>>>> well.
>>>>> Maybe an option (3) that would make all CMSs happy could be:
>>>>> For all dnat_and_snat logical IPs that are not part of router subnets
>>>>> and a single logical flow in table in_l2_lkup on LS-PUB flooding ARP
>>>> s/and a single/add a single/
>>>>> requests that target them in the MC_FLOOD group.  This shouldn't
>>>>> create
>>>>> a scale issue because it's a single flow for all such IPs. If this
>>>>> flow has higher priority than the flow restricting router self
>>>>> originated ARPs to MC_FLOOD_L2 then it should also work fine.
>>> OK, the idea here is that you put the known "unreachable" IP
>>> destinations in a higher priority than the known "reachable" ones, so
>>> that if one of the unreachable ones is targeted, we flood out all ports
>>> instead of just L2. That should work well for self-originated ARPs or
>>> ARPs from outside.
>>> I think this would work. But also see my comments below.
>>>>> The downside is that we *might* hit the OVS resubmit limit for ARPs
>>>>> targeting such FIPs on scaled setups but, I guess, that's a different
>>>>> story.
>>> One thing Terry brought up with me yesterday is that it's odd we have to
>>> self-originate ARPs at all for NAT addresses that are configured on the
>>> logical network. We should be able to put add MAC_Binding entries for
>>> those addresses without actually sending an ARP. You told me yesterday
>>> in chat that in pinctrl, we have send_garp_locally(), which is supposed
>>> to do this. In my setup (and presumably with OpenStack), this is not
>>> happening, hence the need for ARPs.
>> That's true, we should figure out why the MAC_Binding entries are not
>> automatically inserted by ovn-controller in this case.
> I've looked into this a bit more, and I can answer about my particular
> setup, but I can't comment about OpenStack since I don't have a setup I
> can easily inspect at the moment.
> In my case, there were two reasons why I was not seeing gARPs for the
> NAT addresses.
> * I had not set options:nat-addresses on the ls-pub switch.
> * Even if I had set options:nat-addresses on the ls-pub switch, ls-pub
> did not have any ports of type "localnet"
> If we go back to a time before the existence of send_garp_locally(),
> this makes good sense. You only want to send gARPs for NAT addresses if
> there is an external network, and you want to pick and choose which
> addresses you publicize. For "internal" traffic wishing to reach a
> pod/VM via a NAT address, we send ARPs to learn the MAC binding as needed.
> I think that we should send local gARPs (i.e. create MAC_Bindings) for
> NAT addresses even if the connected switch does not have localnet ports.
> What's nice is that we already have an option that so if someone doesn't
> want to fill their MAC_Binding table unnecessarily, they can set
> always_learn_from_arp_request=false and they'll still use self-generated
> ARPs to learn the addresses as needed.
> What do you think?

As long as we respect the always_learn_from_arp_request=false semantics,
I think it makes complete sense to create the mac_bindings automatically
for all logical switches.  We should probably also rename
send_garp_locally() to something more appropriate, it was a bad name to
start with, sorry about that.

>>> If we could get the NAT addresses added to MAC_Binding without sending
>>> any ARPs, I think this would be for the best. And for deployments
>>> worried about the size of the MAC_Binding table, Han added the
>>> "always_learn_from_arp_request" option to force the need for an ARP in
>>> those cases. And in those cases, your suggestion about adding the new
>>> flow in ls_in_l2_lkup should work.
>> But I think we need the new flow regardless of the
>> always_learn_from_arp_request value.  That's because an operator/the CMS
>> can clear the MAC_Binding table externally and ovn-controller will not
>> repopulate it automatically.  IIRC, the number of gARPs sent by
>> ovn-controller is limited.
> Yes, we'd still need to add the new flow, both for the reason you
> mentioned, as well as to work with the
> always_learn_from_arp_request=false case.


>>>>>> I'm just bracing for when someone wants the same behavior for load
>>>>>> balancers...
>>>>> Also, I think the system test you're adding needs a small change as it
>>>>> currently passes both with and without your patch.  We need to remove:
>>>>> ovn-nbctl lr-route-add lr1 lr1-ls-pub
>>>>> What do you think?
>>> In this particular case, I can remove that static route. The only reason
>>> I can remove it in this case is because the floating IP happens to be in
>>> lr1's subnet. Otherwise, lr_in_ip_routing would drop the ping request.
>>> We would not even reach lr_in_arp_request. In the generalized case where
>>> a FIP could be outside the subnet of EITHER of the logical routers, then
>>> the static route would be necessary.
>> Makes sense.
>>> In the reference network I used from Terry, a difference was that there
>>> was an external router connected to ls-pub via a localnet port. lr1 and
>>> lr2 had default static routes added. Essentially:
>>> ovn-nbctl lr-route-add lr1 <ext_router_ip> lr1-ls-pub
>>> ovn-nbctl lr-route-add lr2 <ext_router_ip> lr2-ls-pub
>>> This way, the pings (both request and response) would always make it
>>> past lr_in_ip_routing.
>>> In my test, I was trying for a more minimal setup, with no external
>>> router. In order for the test to work, I needed to add static routes.
>>> I think I need to just add a fake "external router" VM to the test and
>>> include the same default routes on lr1 and lr2 in order to simulate the
>>> OpenStack scenario better. Having no external router means I have to add
>>> static routes on each router (for the general case), which trivializes
>>> the scenario. I also will change the test so that FIP so that it is not
>>> in either logical router's subnet.
>> I think the current test is fine as long as we remove the route on lr1.
>> We're actually testing the way lr2 replies to ARPs for the FIP so it
>> should be enough like this IMO.
>>>>> Regards,
>>>>> Dumitru
>> Regards,
>> Dumitru

More information about the dev mailing list