[ovs-dev] [PATCH ovn v3 2/2] northd: Remove "reachable" functions and users of them.

Fri Mar 26 14:12:40 UTC 2021

On 3/26/21 2:52 PM, Mark Michelson wrote:
> On 3/26/21 7:35 AM, Dumitru Ceara wrote:
>> On 3/26/21 12:32 PM, Dumitru Ceara wrote:
>>> On 3/24/21 12:44 AM, Mark Michelson wrote:
>>>> On 3/23/21 6:55 PM, Dumitru Ceara wrote:
>>>>> On 3/23/21 10:02 PM, Mark Michelson wrote:
>>>>>> On 3/23/21 12:13 PM, Dumitru Ceara wrote:
>>>>>>> On 3/23/21 5:05 PM, Numan Siddique wrote:
>>>>>>>> On Fri, Mar 19, 2021 at 2:20 AM Mark Michelson
>>>>>>>> <mmichels at redhat.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Self-originated ARPs are intended to be sent to the "owning"
>>>>>>>>> router for
>>>>>>>>> a given IP address, as well as flooded to non-router ports on a
>>>>>>>>> logical
>>>>>>>>> switch.
>>>>>>>>>
>>>>>>>>> When trying to determine the owning router for an IP address, we
>>>>>>>>> would
>>>>>>>>> iterate through load balancer and NAT addresses, and check if
>>>>>>>>> these IP
>>>>>>>>> addresses were "reachable" on this particular router. Reachable
>>>>>>>>> here
>>>>>>>>> means that the NAT external IP or load balancer VIP falls within
>>>>>>>>> any of
>>>>>>>>> the networks served by this router. If an IP address were
>>>>>>>>> determined
>>>>>>>>> not
>>>>>>>>> to be reachable, then we would not try to send ARPs for that
>>>>>>>>> particular
>>>>>>>>> address to the router.
>>>>>>>>>
>>>>>>>>> However, it is possible (and sometimes desired) to configure NAT
>>>>>>>>> floating
>>>>>>>>> IPs on a router that are not in any subnet handled by that router.
>>>>>>>>> In this case, OVN refuses to send ARP requests to the router on
>>>>>>>>> which the
>>>>>>>>> floating IP has been configured. The result is that
>>>>>>>>> internally-generated
>>>>>>>>> traffic that targets the floating IP cannot reach its destination,
>>>>>>>>> since the router on which the floating IP is configured never
>>>>>>>>> receives ARPs
>>>>>>>>> for the floating IP.
>>>>>>>>>
>>>>>>>>> This patch fixes the issue by removing the reachability checks
>>>>>>>>> altogether. If a router has a NAT external IP or load balancer VIP
>>>>>>>>> that
>>>>>>>>> is outside the range of any of its configured subnets, we still
>>>>>>>>> should
>>>>>>>>> send ARPs to that router for those requested addresses.
>>>>>>>>>
>>>>>>>>> Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=1929901
>>>>>>>>>
>>>>>>>>> Signed-off-by: Mark Michelson <mmichels at redhat.com>
>>>>>>>>
>>>>>>>> Thanks for addressing this.
>>>>>>>>
>>>>>>>> Acked-by: Numan Siddique <numans at ovn.org>
>>>>>>>>
>>>>>>>> @Dumitru - Since you had added the code to limit ARPs.  Can you
>>>>>>>> please
>>>>>>>> also take a look at this patch ?
>>>>>>>>
>>>>>>>
>>>>>>> Hi Mark, Numan,
>>>>>>>
>>>>>>> I've been thinking about this for a while.  I think this needs at
>>>>>>> least:
>>>>>>>
>>>>>>> Fixes: 1e07781310d8 ("ovn-northd: Fix logical flows to limit ARP/NS
>>>>>>> broadcast domain.")
>>>>>>
>>>>>> OK, I can either just add that to the commit when I merge, or if I
>>>>>> need
>>>>>> to put a new version up for review, I can add it to the new version
>>>>>> then.
>>>>>
>>>>> Sounds good.
>>>
>>> I investigated some more and the problem is actually because of the
>>> combination of:
>>>
>>> 1e07781310d8 ("ovn-northd: Fix logical flows to limit ARP/NS
>>> broadcast domain.")
>>>
>>> and
>>>
>>> 8c6a5bc21847 ("ovn-northd: Limit self originated ARP/ND broadcast
>>> domain.")
>>>
>>> Without 8c6a5bc21847, the "reachable" functions just optimize ARP
>>> request flooding but when the flows using them are not hit we always
>>> fall back to flooding in the MC_FLOOD group on LS-PUB.
>>> With 8c6a5bc21847, we instead flood all logical router self originated
>>> ARP requests in the MC_FLOOD_L2 group.
>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> With the latest changes in ovn-kubernetes I think that the above was
>>>>>>> not
>>>>>>> needed anyway.  Mark, do you have more details about this by any
>>>>>>> chance?
>>>>>>
>>>>>> I'm not sure which changes in ovn-kubernetes you're referring to
>>>>>> here.
>>>>>
>>>>> I was thinking of this:
>>>>>
>>>>> https://github.com/ovn-org/ovn-kubernetes/pull/2055
>>>>>
>>>>> Specifically:
>>>>>
>>>>> https://github.com/ovn-org/ovn-kubernetes/commit/ce2d312c7266c8d55173c9660002c67715a5444b
>>>>>
>>>>>
>>>>>
>>>>> However, after a closer look, I'm not sure if dnat_and_snat usage is
>>>>> completely removed in ovn-kubernetes.  This makes me wonder if
>>>>> reverting
>>>>> 1e07781310d8 ("ovn-northd: Fix logical flows to limit ARP/NS broadcast
>>>>> domain.") will not cause a regression in the ovn-kubernetes use case.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Initial bug report was here:
>>>>>>> https://mail.openvswitch.org/pipermail/ovs-discuss/2020-June/050287.html
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> To quote from that e-mail:
>>>>>>
>>>>>> "Question though is why any Pod on the logical switch would send
>>>>>> an ARP
>>>>>> for an IP that is not in its subnet. A packet from a Pod towards a
>>>>>> non
>>>>>> subnet IP should ARP only for the default gateway IP."
>>>>>>
>>>>>> The above holds true for the scenario I'm working here. The key is
>>>>>> that
>>>>>> this is a self-originated ARP from OVN, not an ARP from a pod/VM.
>>>>>>
>>>>>> The use case I'm solving here is OpenStack floating IPs. It's not
>>>>>> common, but it is expected that a user is able to assign a
>>>>>> floating IP
>>>>>> to a router that does not service that floating IP's subnet. They
>>>>>> also
>>>>>> expect all other VMs in the cluster to be able to reach the target VM
>>>>>> via its floating IP.
>>>>>>
>>>>>> So let's say you have a setup like:
>>>>>>
>>>>>>
>>>>>>                         EXT-ROUTER
>>>>>>                             ^
>>>>>>                             |
>>>>>>                             v
>>>>>> VM1 <-> LS1 <-> LR1 <-> LS-PUB <-> LR2 <-> LS2 <-> VM2
>>>>>>
>>>>>> In this case, LR1 and LR2 both have distributed gateway router
>>>>>> ports on
>>>>>> them (the ports going to LS-PUB). They also have default static
>>>>>> routes
>>>>>> that point to the EXT-ROUTER. LR2's gateway router port IP address is
>>>>>> 172.18.1.1/24. An OpenStack user configures a floating IP for VM2 of
>>>>>> 172.18.2.100. This translates to a dnat_and_snat entry on LR2 for
>>>>>> that
>>>>>> address, which is NOT in the subnet configured on that router. Now
>>>>>> VM1
>>>>>> wants to ping 172.18.2.100. In this case, when the ping reaches
>>>>>> LR1, LR1
>>>>>> needs to ARP to find where 172.18.2.100 is. The ARP needs to
>>>>>> target LR2
>>>>>> since that's where the floating IP is configured. If the ARP only
>>>>>> goes
>>>>>> to EXT-ROUTER, EXT-ROUTER won't be able to respond to the ARP
>>>>>> properly,
>>>>>> and connectivity cannot be established.
>>>>>
>>>>> I might be wrong but I'm still not sure how valid it is to define a
>>>>> floating IP (172.18.2.100) on a router that doesn't have network that
>>>>> includes 172.18.2.100.
>>>>
>>>> Unfortunately, this battle has been fought a couple of times already,
>>>> and in the case of OpenStack, this is something they expect to work and
>>>> currently does work with ml2-ovs.
>>>>
>>>>>
>>>>> Would it work if neutron-ovn configured a 172.18.2.100/32 network
>>>>> on the
>>>>> LR2-to-LS-PUB router port?
>>>>
>>>> Yes, that would make the router appear to be "reachable" with the
>>>> current OVN code.
>>>>
>>>>>
>>>>> Otherwise, I think we need to clarify if ovn-kubernetes would still
>>>>> have
>>>>> a scale issue or not.
>>>>>
>>>>> A less pretty alternative is to allow such floating IPs (without a
>>>>> network that includes that IP) only if a new config knob is set.
>>>>
>>>> I think there are a couple of valid options here:
>>>>
>>>> 1) A configuration option. The option could be configured on the NAT
>>>> itself to allow for that address to be considered "reachable" on the
>>>> router even if it lies outside the router subnet.
>>>>
>>>> 2) Automatically allow this behavior for "floating IP" NATs
>>>> (dnat_and_snat, external_mac set, and port set).
>>>>
>>>>
>>>> (2) Has the advantage of "just working" for the OpenStack FIP case
>>>> while
>>>> also not causing a flow explosion for ovn-kubernetes. However, (1) is
>>>> more flexible, allowing for the option to be set on other DNATs as
>>>> well.
>>>
>>> Maybe an option (3) that would make all CMSs happy could be:
>>>
>>> For all dnat_and_snat logical IPs that are not part of router subnets
>>> and a single logical flow in table in_l2_lkup on LS-PUB flooding ARP
>>
>> s/and a single/add a single/
>>
>>> requests that target them in the MC_FLOOD group.  This shouldn't create
>>> a scale issue because it's a single flow for all such IPs. If this
>>> flow has higher priority than the flow restricting router self
>>> originated ARPs to MC_FLOOD_L2 then it should also work fine.
> 
> OK, the idea here is that you put the known "unreachable" IP
> destinations in a higher priority than the known "reachable" ones, so
> that if one of the unreachable ones is targeted, we flood out all ports
> instead of just L2. That should work well for self-originated ARPs or
> ARPs from outside.
> 
> I think this would work. But also see my comments below.
> 
>>>
>>> The downside is that we *might* hit the OVS resubmit limit for ARPs
>>> targeting such FIPs on scaled setups but, I guess, that's a different
>>> story.
> 
> One thing Terry brought up with me yesterday is that it's odd we have to
> self-originate ARPs at all for NAT addresses that are configured on the
> logical network. We should be able to put add MAC_Binding entries for
> those addresses without actually sending an ARP. You told me yesterday
> in chat that in pinctrl, we have send_garp_locally(), which is supposed
> to do this. In my setup (and presumably with OpenStack), this is not
> happening, hence the need for ARPs.

That's true, we should figure out why the MAC_Binding entries are not
automatically inserted by ovn-controller in this case.

> 
> If we could get the NAT addresses added to MAC_Binding without sending
> any ARPs, I think this would be for the best. And for deployments
> worried about the size of the MAC_Binding table, Han added the
> "always_learn_from_arp_request" option to force the need for an ARP in
> those cases. And in those cases, your suggestion about adding the new
> flow in ls_in_l2_lkup should work.

But I think we need the new flow regardless of the
always_learn_from_arp_request value.  That's because an operator/the CMS
can clear the MAC_Binding table externally and ovn-controller will not
repopulate it automatically.  IIRC, the number of gARPs sent by
ovn-controller is limited.

> 
>>>
>>>>
>>>> I'm just bracing for when someone wants the same behavior for load
>>>> balancers...
>>>>
>>>
>>> Also, I think the system test you're adding needs a small change as it
>>> currently passes both with and without your patch.  We need to remove:
>>>
>>> ovn-nbctl lr-route-add lr1 172.18.2.12 172.18.1.173 lr1-ls-pub
>>>
>>> What do you think?
> 
> In this particular case, I can remove that static route. The only reason
> I can remove it in this case is because the floating IP happens to be in
> lr1's subnet. Otherwise, lr_in_ip_routing would drop the ping request.
> We would not even reach lr_in_arp_request. In the generalized case where
> a FIP could be outside the subnet of EITHER of the logical routers, then
> the static route would be necessary.

Makes sense.

> 
> In the reference network I used from Terry, a difference was that there
> was an external router connected to ls-pub via a localnet port. lr1 and
> lr2 had default static routes added. Essentially:
> 
> ovn-nbctl lr-route-add lr1 0.0.0.0/0 <ext_router_ip> lr1-ls-pub
> ovn-nbctl lr-route-add lr2 0.0.0.0/0 <ext_router_ip> lr2-ls-pub
> 
> This way, the pings (both request and response) would always make it
> past lr_in_ip_routing.
> 
> In my test, I was trying for a more minimal setup, with no external
> router. In order for the test to work, I needed to add static routes.
> 
> I think I need to just add a fake "external router" VM to the test and
> include the same default routes on lr1 and lr2 in order to simulate the
> OpenStack scenario better. Having no external router means I have to add
> static routes on each router (for the general case), which trivializes
> the scenario. I also will change the test so that FIP so that it is not
> in either logical router's subnet.

I think the current test is fine as long as we remove the route on lr1.
We're actually testing the way lr2 replies to ARPs for the FIP so it
should be enough like this IMO.

> 
>>>
>>> Regards,
>>> Dumitru
>>>
>>
> 

Regards,
Dumitru