[ovs-dev] [PATCH 00/15] Netdev vxlan-decap offload

Tue Feb 9 15:11:34 UTC 2021

On 2/8/2021 6:21 PM, Sriharsha Basavapatna wrote:
> On Mon, Feb 8, 2021 at 7:33 PM Eli Britstein <elibr at nvidia.com> wrote:
>>
>> On 2/8/2021 3:11 PM, Sriharsha Basavapatna wrote:
>>> On Sun, Feb 7, 2021 at 4:58 PM Eli Britstein <elibr at nvidia.com> wrote:
>>>> On 2/5/2021 8:26 PM, Sriharsha Basavapatna wrote:
>>>>> On Fri, Feb 5, 2021 at 4:55 PM Sriharsha Basavapatna
>>>>> <sriharsha.basavapatna at broadcom.com> wrote:
>>>>>> On Wed, Jan 27, 2021 at 11:40 PM Eli Britstein <elibr at nvidia.com> wrote:
>>>>>>> VXLAN decap in OVS-DPDK configuration consists of two flows:
>>>>>>> F1: in_port(ens1f0),eth(),ipv4(),udp(), actions:tnl_pop(vxlan_sys_4789)
>>>>>>> F2: tunnel(),in_port(vxlan_sys_4789),eth(),ipv4(), actions:ens1f0_0
>>>>>>>
>>>>>>> F1 is a classification flow. It has outer headers matches and it
>>>>>>> classifies the packet as a VXLAN packet, and using tnl_pop action the
>>>>>>> packet continues processing in F2.
>>>>>>> F2 is a flow that has matches on tunnel metadata as well as on the inner
>>>>>>> packet headers (as any other flow).
>>>>>>>
>>>>>>> In order to fully offload VXLAN decap path, both F1 and F2 should be
>>>>>>> offloaded. As there are more than one flow in HW, it is possible that
>>>>>>> F1 is done by HW but F2 is not. Packet is received by SW, and should be
>>>>>>> processed starting from F2 as F1 was already done by HW.
>>>>>>> Rte_flows are applicable only on physical port IDs. Vport flows (e.g. F2)
>>>>>>> are applied on uplink ports attached to OVS.
>>>>>>>
>>>>>>> This patch-set makes use of [1] introduced in DPDK 20.11, that adds API
>>>>>>> for tunnel offloads.
>>>>>>>
>>>>>>> Travis:
>>>>>>> v1: https://travis-ci.org/github/elibritstein/OVS/builds/756418552
>>>>>>>
>>>>>>> GitHub Actions:
>>>>>>> v1: https://github.com/elibritstein/OVS/actions/runs/515334647
>>>>>>>
>>>>>>> [1] https://mails.dpdk.org/archives/dev/2020-October/187314.html
>>>>>>>
>>>>>>> Eli Britstein (13):
>>>>>>>      netdev-offload: Add HW miss packet state recover API
>>>>>>>      netdev-dpdk: Introduce DPDK tunnel APIs
>>>>>>>      netdev-offload-dpdk: Implement flow dump create/destroy APIs
>>>>>>>      netdev-dpdk: Add flow_api support for netdev vxlan vports
>>>>>>>      netdev-offload-dpdk: Implement HW miss packet recover for vport
>>>>>>>      dpif-netdev: Add HW miss packet state recover logic
>>>>>>>      netdev-offload-dpdk: Change log rate limits
>>>>>>>      netdev-offload-dpdk: Support tunnel pop action
>>>>>>>      netdev-offload-dpdk: Refactor offload rule creation
>>>>>>>      netdev-dpdk: Introduce an API to query if a dpdk port is an uplink
>>>>>>>        port
>>>>>>>      netdev-offload-dpdk: Map netdev and ufid to offload objects
>>>>>>>      netdev-offload-dpdk: Support vports flows offload
>>>>>>>      netdev-dpdk-offload: Add vxlan pattern matching function
>>>>>>>
>>>>>>> Ilya Maximets (2):
>>>>>>>      netdev-offload: Allow offloading to netdev without ifindex.
>>>>>>>      netdev-offload: Disallow offloading to unrelated tunneling vports.
>>>>>>>
>>>>>>>     Documentation/howto/dpdk.rst  |   1 +
>>>>>>>     NEWS                          |   2 +
>>>>>>>     lib/dpif-netdev.c             |  49 +-
>>>>>>>     lib/netdev-dpdk.c             | 135 ++++++
>>>>>>>     lib/netdev-dpdk.h             | 104 ++++-
>>>>>>>     lib/netdev-offload-dpdk.c     | 851 +++++++++++++++++++++++++++++-----
>>>>>>>     lib/netdev-offload-provider.h |   5 +
>>>>>>>     lib/netdev-offload-tc.c       |   8 +
>>>>>>>     lib/netdev-offload.c          |  29 +-
>>>>>>>     lib/netdev-offload.h          |   1 +
>>>>>>>     10 files changed, 1033 insertions(+), 152 deletions(-)
>>>>>>>
>>>>>>> --
>>>>>>> 2.28.0.546.g385c171
>>>>>>>
>>>>>> Hi Eli,
>>>>>>
>>>>>> Thanks for posting this new patchset to support tunnel decap action offload.
>>>>>>
>>>>>> I haven't looked at the entire patchset yet. But I focused on the
>>>>>> patches that introduce 1-to-many mapping between an OVS flow (f2) and
>>>>>> HW offloaded flows.
>>>>>>
>>>>>> Here is a representation of the design proposed in this patchset. A
>>>>>> flow f2 (inner flow) between the VxLAN-vPort and VFRep-1, for which
>>>>>> the underlying uplink/physical port is P0, gets offloaded to not only
>>>>>> P0, but also to other physical ports P1, P2... and so on.
>>>>>>
>>>>>>        P0 <----> VxLAN-vPort <----> VFRep-1
>>>>>>
>>>>>>        P1
>>>>>>        P2
>>>>>>        ...
>>>>>>        Pn
>>>>>>
>>>>>> IMO, the problems with this design are:
>>>>>>
>>>>>> - Offloading a flow to an unrelated physical device that has nothing
>>>>>> to do with that flow (invalid device for the flow).
>>>>>> - Offloading to not just one, but several such invalid physical devices.
>>>>>> - Consuming HW resources for a flow that is never seen or intended to
>>>>>> be processed by those physical devices.
>>>>>> - Impacts flow scale on other physical devices, since it would consume
>>>>>> their HW resources with a large number of such invalid flows.
>>>>>> - The indirect list used to track these multiple mappings complicates
>>>>>> the offload layer implementation.
>>>>>> - The addition of flow_dump_create() to offload APIs, just to parse
>>>>>> and get a list of user datapath netdevs is confusing and not needed.
>>>>>>
>>>>>> I have been exploring an alternate design to address this problem of
>>>>>> figuring out the right physical device for a given tunnel inner-flow.
>>>>>> I will send a patch, please take a look so we can continue the discussion.
>>>>> I just posted this patch, please see the link below; this is currently
>>>>> based on the decap offload patchset (but it can be rebased if needed,
>>>>> without the decap patchset too). The patch provides changes to pass
>>>>> physical port (orig_in_port) information to the offload layer in the
>>>>> context of flow F2. Note that additional changes would be needed in
>>>>> the decap patchset to utilize this, if we agree on this approach.
>>>>>
>>>>> https://patchwork.ozlabs.org/project/openvswitch/patch/20210205180413.43337-1-sriharsha.basavapatna@broadcom.com/
>>>> Thanks for that.
>>>>
>>>> However, we cannot say packets on that flow *cannot* arrive from other
>>>> PFs. For example, consider ECMP. Packets can arrive from either port,
>>>> depending on external routing decisions.
>>> Let us say P0's IP is 192.168.1.x and P1's IP is 172.16.1.x. How can
>>> packets for P0 be received over P1 ? With ECMP, packets could take a
>>> different route in the network. But the destination remains the same.
>> Suppose both P0 and P1 are under br-phy, and the IP is of br-phy.
> How can you have multiple uplink ports connected to the same bridge,
> unless you have them in some kind of bonding/link-aggregation
> configuration ? Also, for br-phy you need to provide the mac address
> of the underlying physical interface to the bridge.
> Take a look at step-5 in this document:
> https://docs.openvswitch.org/en/latest/howto/userspace-tunneling/
> If you have 2 uplink ports, which mac address will you specify ?
>
>> Another example is that each is on a different bridge, and the IP is
>> moved from one to the other.
> Here you are changing the tunnel configuration; this has to go through
> some kind of reconfiguration or flow modification, since you are
> changing the IP/mac of br-phy.
>
>>>> Thus, if we want to support it, we should keep the 1-to-many
>>>> indirection, and not assume that packets may only arrive always from the
>>>> same port as the first packet that created this flow.
>>> The rationale behind offloading to multiple ports as per your patches,
>>> is not to support ECMP, but because there's no way to figure out the
>>> uplink port for the flow in the context of a vport. I don't see ECMP
>>> mentioned anywhere in your patchset ?
>> ECMP is just an example. The point is we cannot guarantee the "correct"
>> in_port of a vport.
> The patch that I sent out provides the correct in_port for a given
> flow [ because we already know the port on which it was received, it
> just needs to be passed as metadata ! ].
>
>>>> Also, please note that the HW rule is applied on PFs only, so in your
>>>> example P1,...,Pn, "n" is expected to be quite small number.
>>> You can't predict or control the number of physical devices that can
>>> be configured in the system (e.g., multiple physical bridges each
>>> connecting to a different network). So you cannot assume that "n" is a
>>> small number. The point here is that with the current proposal, we end
>>> up offloading a flow incorrectly to unrelated physical devices.
>> "n" is limited by the number of physical cards there are in the system
>> (and used by OVS), unlike VFs (or SFs) that can be much more, so though
>> I cannot predict it, the point is that the number of "unrelated" rules
>> is probably not high.
> There are cards with multiple physical ports and there could be
> multiple such physical cards in the system. And since we are talking
> about vport flows (F2), it could be thousands of such flows
> incorrectly offloaded on each of those other ports, on each card.
>
>> Another point regarding this, is that some of the "unrelated" rules may
>> not be offloaded after all, as rejected by PMDs. For example if PF->Px
>> cannot be supported. On the other hand, drop is probably supported on
>> all PMDs.
>>
>> Right, we may end up with more flows, as we cannot be sure what is a
>> "related" and what is an "unrelated" physical device.
> Thanks for agreeing that we will end up with more flows. This is
> precisely what we want to avoid -  ending up with more flows offloaded
> on devices that are not in the picture at all w.r.t the flows in
> question.
>
>> BTW, note that Linux kernel also uses the same approach, using
>> flow_indr_dev_register, used by broadcom, mlx5 and netronome drivers.
> I've seen that and it is also complicated (independent of how many
> drivers use it), I'm sure it can be simplified too.
>
>> I agree that your proposal works for the simple use cases. As there must
>> be some kind of indirectness, because after all we must apply the rule
>> on a physical device that is not the vport itself, i think the cost of
>> this approach to cover more use-cases is worthy.
> Can we go ahead with this approach for now and build on it for more
> use-cases as needed later ?

OK. Let's start with supporting the simple use cases. I will take your 
patch as a reference and re-spin the series.

We may enhance it in the future.

Thanks.

>
>>>>> Thanks,
>>>>> -Harsha
>>>>>> Thanks,
>>>>>> -Harsha