[ovs-dev] [PATCH 00/15] Netdev vxlan-decap offload

Mon Feb 8 16:21:47 UTC 2021

On Mon, Feb 8, 2021 at 7:33 PM Eli Britstein <elibr at nvidia.com> wrote:
>
>
> On 2/8/2021 3:11 PM, Sriharsha Basavapatna wrote:
> > On Sun, Feb 7, 2021 at 4:58 PM Eli Britstein <elibr at nvidia.com> wrote:
> >>
> >> On 2/5/2021 8:26 PM, Sriharsha Basavapatna wrote:
> >>> On Fri, Feb 5, 2021 at 4:55 PM Sriharsha Basavapatna
> >>> <sriharsha.basavapatna at broadcom.com> wrote:
> >>>> On Wed, Jan 27, 2021 at 11:40 PM Eli Britstein <elibr at nvidia.com> wrote:
> >>>>> VXLAN decap in OVS-DPDK configuration consists of two flows:
> >>>>> F1: in_port(ens1f0),eth(),ipv4(),udp(), actions:tnl_pop(vxlan_sys_4789)
> >>>>> F2: tunnel(),in_port(vxlan_sys_4789),eth(),ipv4(), actions:ens1f0_0
> >>>>>
> >>>>> F1 is a classification flow. It has outer headers matches and it
> >>>>> classifies the packet as a VXLAN packet, and using tnl_pop action the
> >>>>> packet continues processing in F2.
> >>>>> F2 is a flow that has matches on tunnel metadata as well as on the inner
> >>>>> packet headers (as any other flow).
> >>>>>
> >>>>> In order to fully offload VXLAN decap path, both F1 and F2 should be
> >>>>> offloaded. As there are more than one flow in HW, it is possible that
> >>>>> F1 is done by HW but F2 is not. Packet is received by SW, and should be
> >>>>> processed starting from F2 as F1 was already done by HW.
> >>>>> Rte_flows are applicable only on physical port IDs. Vport flows (e.g. F2)
> >>>>> are applied on uplink ports attached to OVS.
> >>>>>
> >>>>> This patch-set makes use of [1] introduced in DPDK 20.11, that adds API
> >>>>> for tunnel offloads.
> >>>>>
> >>>>> Travis:
> >>>>> v1: https://travis-ci.org/github/elibritstein/OVS/builds/756418552
> >>>>>
> >>>>> GitHub Actions:
> >>>>> v1: https://github.com/elibritstein/OVS/actions/runs/515334647
> >>>>>
> >>>>> [1] https://mails.dpdk.org/archives/dev/2020-October/187314.html
> >>>>>
> >>>>> Eli Britstein (13):
> >>>>>     netdev-offload: Add HW miss packet state recover API
> >>>>>     netdev-dpdk: Introduce DPDK tunnel APIs
> >>>>>     netdev-offload-dpdk: Implement flow dump create/destroy APIs
> >>>>>     netdev-dpdk: Add flow_api support for netdev vxlan vports
> >>>>>     netdev-offload-dpdk: Implement HW miss packet recover for vport
> >>>>>     dpif-netdev: Add HW miss packet state recover logic
> >>>>>     netdev-offload-dpdk: Change log rate limits
> >>>>>     netdev-offload-dpdk: Support tunnel pop action
> >>>>>     netdev-offload-dpdk: Refactor offload rule creation
> >>>>>     netdev-dpdk: Introduce an API to query if a dpdk port is an uplink
> >>>>>       port
> >>>>>     netdev-offload-dpdk: Map netdev and ufid to offload objects
> >>>>>     netdev-offload-dpdk: Support vports flows offload
> >>>>>     netdev-dpdk-offload: Add vxlan pattern matching function
> >>>>>
> >>>>> Ilya Maximets (2):
> >>>>>     netdev-offload: Allow offloading to netdev without ifindex.
> >>>>>     netdev-offload: Disallow offloading to unrelated tunneling vports.
> >>>>>
> >>>>>    Documentation/howto/dpdk.rst  |   1 +
> >>>>>    NEWS                          |   2 +
> >>>>>    lib/dpif-netdev.c             |  49 +-
> >>>>>    lib/netdev-dpdk.c             | 135 ++++++
> >>>>>    lib/netdev-dpdk.h             | 104 ++++-
> >>>>>    lib/netdev-offload-dpdk.c     | 851 +++++++++++++++++++++++++++++-----
> >>>>>    lib/netdev-offload-provider.h |   5 +
> >>>>>    lib/netdev-offload-tc.c       |   8 +
> >>>>>    lib/netdev-offload.c          |  29 +-
> >>>>>    lib/netdev-offload.h          |   1 +
> >>>>>    10 files changed, 1033 insertions(+), 152 deletions(-)
> >>>>>
> >>>>> --
> >>>>> 2.28.0.546.g385c171
> >>>>>
> >>>> Hi Eli,
> >>>>
> >>>> Thanks for posting this new patchset to support tunnel decap action offload.
> >>>>
> >>>> I haven't looked at the entire patchset yet. But I focused on the
> >>>> patches that introduce 1-to-many mapping between an OVS flow (f2) and
> >>>> HW offloaded flows.
> >>>>
> >>>> Here is a representation of the design proposed in this patchset. A
> >>>> flow f2 (inner flow) between the VxLAN-vPort and VFRep-1, for which
> >>>> the underlying uplink/physical port is P0, gets offloaded to not only
> >>>> P0, but also to other physical ports P1, P2... and so on.
> >>>>
> >>>>       P0 <----> VxLAN-vPort <----> VFRep-1
> >>>>
> >>>>       P1
> >>>>       P2
> >>>>       ...
> >>>>       Pn
> >>>>
> >>>> IMO, the problems with this design are:
> >>>>
> >>>> - Offloading a flow to an unrelated physical device that has nothing
> >>>> to do with that flow (invalid device for the flow).
> >>>> - Offloading to not just one, but several such invalid physical devices.
> >>>> - Consuming HW resources for a flow that is never seen or intended to
> >>>> be processed by those physical devices.
> >>>> - Impacts flow scale on other physical devices, since it would consume
> >>>> their HW resources with a large number of such invalid flows.
> >>>> - The indirect list used to track these multiple mappings complicates
> >>>> the offload layer implementation.
> >>>> - The addition of flow_dump_create() to offload APIs, just to parse
> >>>> and get a list of user datapath netdevs is confusing and not needed.
> >>>>
> >>>> I have been exploring an alternate design to address this problem of
> >>>> figuring out the right physical device for a given tunnel inner-flow.
> >>>> I will send a patch, please take a look so we can continue the discussion.
> >>> I just posted this patch, please see the link below; this is currently
> >>> based on the decap offload patchset (but it can be rebased if needed,
> >>> without the decap patchset too). The patch provides changes to pass
> >>> physical port (orig_in_port) information to the offload layer in the
> >>> context of flow F2. Note that additional changes would be needed in
> >>> the decap patchset to utilize this, if we agree on this approach.
> >>>
> >>> https://patchwork.ozlabs.org/project/openvswitch/patch/20210205180413.43337-1-sriharsha.basavapatna@broadcom.com/
> >> Thanks for that.
> >>
> >> However, we cannot say packets on that flow *cannot* arrive from other
> >> PFs. For example, consider ECMP. Packets can arrive from either port,
> >> depending on external routing decisions.
> > Let us say P0's IP is 192.168.1.x and P1's IP is 172.16.1.x. How can
> > packets for P0 be received over P1 ? With ECMP, packets could take a
> > different route in the network. But the destination remains the same.
>
> Suppose both P0 and P1 are under br-phy, and the IP is of br-phy.
How can you have multiple uplink ports connected to the same bridge,
unless you have them in some kind of bonding/link-aggregation
configuration ? Also, for br-phy you need to provide the mac address
of the underlying physical interface to the bridge.
Take a look at step-5 in this document:
https://docs.openvswitch.org/en/latest/howto/userspace-tunneling/
If you have 2 uplink ports, which mac address will you specify ?

>
> Another example is that each is on a different bridge, and the IP is
> moved from one to the other.
Here you are changing the tunnel configuration; this has to go through
some kind of reconfiguration or flow modification, since you are
changing the IP/mac of br-phy.

>
> >> Thus, if we want to support it, we should keep the 1-to-many
> >> indirection, and not assume that packets may only arrive always from the
> >> same port as the first packet that created this flow.
> > The rationale behind offloading to multiple ports as per your patches,
> > is not to support ECMP, but because there's no way to figure out the
> > uplink port for the flow in the context of a vport. I don't see ECMP
> > mentioned anywhere in your patchset ?
> ECMP is just an example. The point is we cannot guarantee the "correct"
> in_port of a vport.
The patch that I sent out provides the correct in_port for a given
flow [ because we already know the port on which it was received, it
just needs to be passed as metadata ! ].

> >> Also, please note that the HW rule is applied on PFs only, so in your
> >> example P1,...,Pn, "n" is expected to be quite small number.
> > You can't predict or control the number of physical devices that can
> > be configured in the system (e.g., multiple physical bridges each
> > connecting to a different network). So you cannot assume that "n" is a
> > small number. The point here is that with the current proposal, we end
> > up offloading a flow incorrectly to unrelated physical devices.
>
> "n" is limited by the number of physical cards there are in the system
> (and used by OVS), unlike VFs (or SFs) that can be much more, so though
> I cannot predict it, the point is that the number of "unrelated" rules
> is probably not high.
There are cards with multiple physical ports and there could be
multiple such physical cards in the system. And since we are talking
about vport flows (F2), it could be thousands of such flows
incorrectly offloaded on each of those other ports, on each card.

>
> Another point regarding this, is that some of the "unrelated" rules may
> not be offloaded after all, as rejected by PMDs. For example if PF->Px
> cannot be supported. On the other hand, drop is probably supported on
> all PMDs.
>
> Right, we may end up with more flows, as we cannot be sure what is a
> "related" and what is an "unrelated" physical device.
Thanks for agreeing that we will end up with more flows. This is
precisely what we want to avoid -  ending up with more flows offloaded
on devices that are not in the picture at all w.r.t the flows in
question.

>
> BTW, note that Linux kernel also uses the same approach, using
> flow_indr_dev_register, used by broadcom, mlx5 and netronome drivers.
I've seen that and it is also complicated (independent of how many
drivers use it), I'm sure it can be simplified too.

>
> I agree that your proposal works for the simple use cases. As there must
> be some kind of indirectness, because after all we must apply the rule
> on a physical device that is not the vport itself, i think the cost of
> this approach to cover more use-cases is worthy.
Can we go ahead with this approach for now and build on it for more
use-cases as needed later ?

>
> >>> Thanks,
> >>> -Harsha
> >>>> Thanks,
> >>>> -Harsha

-- 
This electronic communication and the information and any files transmitted 
with it, or attached to it, are confidential and are intended solely for 
the use of the individual or entity to whom it is addressed and may contain 
information that is confidential, legally privileged, protected by privacy 
laws, or otherwise restricted from disclosure to anyone else. If you are 
not the intended recipient or the person responsible for delivering the 
e-mail to the intended recipient, you are hereby notified that any use, 
copying, distributing, dissemination, forwarding, printing, or copying of 
this e-mail is strictly prohibited. If you received this e-mail in error, 
please return the e-mail to the sender, delete it from your computer, and 
destroy any printed copy of it.