[ovs-dev] [PATCH 00/15] Netdev vxlan-decap offload

Mon Feb 8 14:03:28 UTC 2021

On 2/8/2021 3:11 PM, Sriharsha Basavapatna wrote:
> On Sun, Feb 7, 2021 at 4:58 PM Eli Britstein <elibr at nvidia.com> wrote:
>>
>> On 2/5/2021 8:26 PM, Sriharsha Basavapatna wrote:
>>> On Fri, Feb 5, 2021 at 4:55 PM Sriharsha Basavapatna
>>> <sriharsha.basavapatna at broadcom.com> wrote:
>>>> On Wed, Jan 27, 2021 at 11:40 PM Eli Britstein <elibr at nvidia.com> wrote:
>>>>> VXLAN decap in OVS-DPDK configuration consists of two flows:
>>>>> F1: in_port(ens1f0),eth(),ipv4(),udp(), actions:tnl_pop(vxlan_sys_4789)
>>>>> F2: tunnel(),in_port(vxlan_sys_4789),eth(),ipv4(), actions:ens1f0_0
>>>>>
>>>>> F1 is a classification flow. It has outer headers matches and it
>>>>> classifies the packet as a VXLAN packet, and using tnl_pop action the
>>>>> packet continues processing in F2.
>>>>> F2 is a flow that has matches on tunnel metadata as well as on the inner
>>>>> packet headers (as any other flow).
>>>>>
>>>>> In order to fully offload VXLAN decap path, both F1 and F2 should be
>>>>> offloaded. As there are more than one flow in HW, it is possible that
>>>>> F1 is done by HW but F2 is not. Packet is received by SW, and should be
>>>>> processed starting from F2 as F1 was already done by HW.
>>>>> Rte_flows are applicable only on physical port IDs. Vport flows (e.g. F2)
>>>>> are applied on uplink ports attached to OVS.
>>>>>
>>>>> This patch-set makes use of [1] introduced in DPDK 20.11, that adds API
>>>>> for tunnel offloads.
>>>>>
>>>>> Travis:
>>>>> v1: https://travis-ci.org/github/elibritstein/OVS/builds/756418552
>>>>>
>>>>> GitHub Actions:
>>>>> v1: https://github.com/elibritstein/OVS/actions/runs/515334647
>>>>>
>>>>> [1] https://mails.dpdk.org/archives/dev/2020-October/187314.html
>>>>>
>>>>> Eli Britstein (13):
>>>>>     netdev-offload: Add HW miss packet state recover API
>>>>>     netdev-dpdk: Introduce DPDK tunnel APIs
>>>>>     netdev-offload-dpdk: Implement flow dump create/destroy APIs
>>>>>     netdev-dpdk: Add flow_api support for netdev vxlan vports
>>>>>     netdev-offload-dpdk: Implement HW miss packet recover for vport
>>>>>     dpif-netdev: Add HW miss packet state recover logic
>>>>>     netdev-offload-dpdk: Change log rate limits
>>>>>     netdev-offload-dpdk: Support tunnel pop action
>>>>>     netdev-offload-dpdk: Refactor offload rule creation
>>>>>     netdev-dpdk: Introduce an API to query if a dpdk port is an uplink
>>>>>       port
>>>>>     netdev-offload-dpdk: Map netdev and ufid to offload objects
>>>>>     netdev-offload-dpdk: Support vports flows offload
>>>>>     netdev-dpdk-offload: Add vxlan pattern matching function
>>>>>
>>>>> Ilya Maximets (2):
>>>>>     netdev-offload: Allow offloading to netdev without ifindex.
>>>>>     netdev-offload: Disallow offloading to unrelated tunneling vports.
>>>>>
>>>>>    Documentation/howto/dpdk.rst  |   1 +
>>>>>    NEWS                          |   2 +
>>>>>    lib/dpif-netdev.c             |  49 +-
>>>>>    lib/netdev-dpdk.c             | 135 ++++++
>>>>>    lib/netdev-dpdk.h             | 104 ++++-
>>>>>    lib/netdev-offload-dpdk.c     | 851 +++++++++++++++++++++++++++++-----
>>>>>    lib/netdev-offload-provider.h |   5 +
>>>>>    lib/netdev-offload-tc.c       |   8 +
>>>>>    lib/netdev-offload.c          |  29 +-
>>>>>    lib/netdev-offload.h          |   1 +
>>>>>    10 files changed, 1033 insertions(+), 152 deletions(-)
>>>>>
>>>>> --
>>>>> 2.28.0.546.g385c171
>>>>>
>>>> Hi Eli,
>>>>
>>>> Thanks for posting this new patchset to support tunnel decap action offload.
>>>>
>>>> I haven't looked at the entire patchset yet. But I focused on the
>>>> patches that introduce 1-to-many mapping between an OVS flow (f2) and
>>>> HW offloaded flows.
>>>>
>>>> Here is a representation of the design proposed in this patchset. A
>>>> flow f2 (inner flow) between the VxLAN-vPort and VFRep-1, for which
>>>> the underlying uplink/physical port is P0, gets offloaded to not only
>>>> P0, but also to other physical ports P1, P2... and so on.
>>>>
>>>>       P0 <----> VxLAN-vPort <----> VFRep-1
>>>>
>>>>       P1
>>>>       P2
>>>>       ...
>>>>       Pn
>>>>
>>>> IMO, the problems with this design are:
>>>>
>>>> - Offloading a flow to an unrelated physical device that has nothing
>>>> to do with that flow (invalid device for the flow).
>>>> - Offloading to not just one, but several such invalid physical devices.
>>>> - Consuming HW resources for a flow that is never seen or intended to
>>>> be processed by those physical devices.
>>>> - Impacts flow scale on other physical devices, since it would consume
>>>> their HW resources with a large number of such invalid flows.
>>>> - The indirect list used to track these multiple mappings complicates
>>>> the offload layer implementation.
>>>> - The addition of flow_dump_create() to offload APIs, just to parse
>>>> and get a list of user datapath netdevs is confusing and not needed.
>>>>
>>>> I have been exploring an alternate design to address this problem of
>>>> figuring out the right physical device for a given tunnel inner-flow.
>>>> I will send a patch, please take a look so we can continue the discussion.
>>> I just posted this patch, please see the link below; this is currently
>>> based on the decap offload patchset (but it can be rebased if needed,
>>> without the decap patchset too). The patch provides changes to pass
>>> physical port (orig_in_port) information to the offload layer in the
>>> context of flow F2. Note that additional changes would be needed in
>>> the decap patchset to utilize this, if we agree on this approach.
>>>
>>> https://patchwork.ozlabs.org/project/openvswitch/patch/20210205180413.43337-1-sriharsha.basavapatna@broadcom.com/
>> Thanks for that.
>>
>> However, we cannot say packets on that flow *cannot* arrive from other
>> PFs. For example, consider ECMP. Packets can arrive from either port,
>> depending on external routing decisions.
> Let us say P0's IP is 192.168.1.x and P1's IP is 172.16.1.x. How can
> packets for P0 be received over P1 ? With ECMP, packets could take a
> different route in the network. But the destination remains the same.

Suppose both P0 and P1 are under br-phy, and the IP is of br-phy.

Another example is that each is on a different bridge, and the IP is 
moved from one to the other.

>> Thus, if we want to support it, we should keep the 1-to-many
>> indirection, and not assume that packets may only arrive always from the
>> same port as the first packet that created this flow.
> The rationale behind offloading to multiple ports as per your patches,
> is not to support ECMP, but because there's no way to figure out the
> uplink port for the flow in the context of a vport. I don't see ECMP
> mentioned anywhere in your patchset ?
ECMP is just an example. The point is we cannot guarantee the "correct" 
in_port of a vport.
>> Also, please note that the HW rule is applied on PFs only, so in your
>> example P1,...,Pn, "n" is expected to be quite small number.
> You can't predict or control the number of physical devices that can
> be configured in the system (e.g., multiple physical bridges each
> connecting to a different network). So you cannot assume that "n" is a
> small number. The point here is that with the current proposal, we end
> up offloading a flow incorrectly to unrelated physical devices.

"n" is limited by the number of physical cards there are in the system 
(and used by OVS), unlike VFs (or SFs) that can be much more, so though 
I cannot predict it, the point is that the number of "unrelated" rules 
is probably not high.

Another point regarding this, is that some of the "unrelated" rules may 
not be offloaded after all, as rejected by PMDs. For example if PF->Px 
cannot be supported. On the other hand, drop is probably supported on 
all PMDs.

Right, we may end up with more flows, as we cannot be sure what is a 
"related" and what is an "unrelated" physical device.

BTW, note that Linux kernel also uses the same approach, using 
flow_indr_dev_register, used by broadcom, mlx5 and netronome drivers.

I agree that your proposal works for the simple use cases. As there must 
be some kind of indirectness, because after all we must apply the rule 
on a physical device that is not the vport itself, i think the cost of 
this approach to cover more use-cases is worthy.

>>> Thanks,
>>> -Harsha
>>>> Thanks,
>>>> -Harsha