[ovs-dev] [PATCH 00/25] netdev datapath vxlan offload

Hemal Shah hemal.shah at broadcom.com
Mon Feb 10 21:16:59 UTC 2020


Eli,

There are some fundamental architecture issues (multi HW tables vs. single
HW table, use of rte_flow JUMP action for mapping ovs sw tnl_pop action,
etc.) that we need to discuss before we can get into the details of the
patchset.
I'm inlining my comments on the discussion between you and Harsha below.

Hemal


On Wed, Feb 5, 2020 at 6:31 AM Eli Britstein <elibr at mellanox.com> wrote:

>
> On 2/3/2020 7:05 PM, Sriharsha Basavapatna wrote:
> > Hi Eli,
> >
> > Thanks for sending this patchset. I have some questions about the
> > design, please see my comments below.
> >
> > Thanks,
> > -Harsha
> >
> > On Mon, Jan 20, 2020 at 8:39 PM Eli Britstein <elibr at mellanox.com>
> wrote:
> >> In the netdev datapath, packets arriving over a tunnel are processed by
> >> more than one flow. For example a packet that should be decapsulated and
> >> forwarded to another port is processed by two flows:
> >> 1. in_port=<PF>, outer-header matches, dst-port=4789 (VXLAN),
> >>     actions:tnl_pop(vxlan_sys_4789).
> >> 2. in_port=vxlan_sys_4789, tunnel matches (VNI for example),
> >>     inner-header matches, actions: vm1.
> >>
> >> In order to offload such a multi-flow processing path, a multi-table HW
> >> model is used.
> > Did you explore other ways to avoid this multi-flow processing model ?
> > For example, may be support some additional logic to merge the two
> > datapath flow-rules for decap into a single flow-rule in the offload
> > layer ?
> Yes, we did explore this approach and our research lead us to conclude
> that following the OVS multi-table SW model in HW proves to be a more
> robust architecture.
>
[Hemal] It will be good and prudent for you to share details of that
research to help the community understand how you arrive to the conclusion
above. Can you share it?

> Single table architecture for tunnel offloads, for example, has the
> following issues:
> 1. Statistics – flow #1 counts bytes including outer header. If the
> outer is IPv6, it can have extensions, to be counted by the packet. Flow
> #2 counts the bytes of the inner packet only.
>
[Hemal]  For fixed format tunnels, you can calculate count bytes of flow #1
from the flow #2 counts.

> 2. Update rate – parsing multi-table to single one means cartesian
> product of all the flows. Any change of a single flow in one table means
> a lot of removals and insertions of rules in HW.
>
[Hemal] Can you quantify "a lot" with specific examples? flow #2 can't be
offloaded without flow #1 being offloaded. It is less likely that flow #1
will change as frequently as flow #2. If flow #1 is updated (tunnel change)
and flow #2 is using the fields from tunnel headers that are changed in
flow #1, then you have to anyways update flow #2. This is no different from
the software case when a tunnel is changed.

> The single table model also breaks when offloading more complex actions
> such as CT, dp_hash and others.
>
[Hemal] Can you be more specific about what it breaks?

> >
> >> Flow #1 matches are offloaded as is, but tnl_pop is
> >> translated to MARK and JUMP to a tnl-port table actions.
> > The response to tnl_pop action is now different between non-offload
> > and offload datapaths, since offload layer changes it into entirely
> > different
> > actions. The user would expect consistent response to tnl_pop action,
> > especially if you treat this flow rule as an independent rule by
> > itself (i.e,
> > assuming there's no flow #2 subsequently).
> >
> > Regarding the JUMP action:
> > IMO, offload framework should avoid anything that is too vendor
> > specific; and make use of simple/common abstractions. I'm not sure if
> > JUMP action is really needed to support offload of tnl_pop action.
> > Also, it may not map to the real use case for which JUMP action seems
> > to have been defined, which is to support flow-grouping (see
> > rte_flow.h).
> The rte_flow API does not offer an equivalent action to moving outer
> headers to some metadata buffer, like the SW tnl_pop action. The JUMP
> action is equivalent to recirculate on the vport, and the MARK action is
> to make sure the user will have a consistent response in SW/HW (again,
> sticking to the SW model), so in case of a miss the packet is recovered
> to continue SW processing normally.
>
[Hemal] This seems like a gap in rte_flow API mapping of ovs sw tnl_pop
action. Should we look into extending rte_flow API to cover this case
rather than changing the use case for JUMP?

> >
> >> Flow #2 is
> >> offloaded to that tnl-port table, its tunnel matches are translated to
> >> outer-header matches, followed by the inner-header ones. The actions
> >> start with an implicit DECAP action followed by the flow's actions.
> >> The DECAP is done in flow #2. If it was done in flow #1, the outer
> >> headers of the packets would have been lost, and they could not have
> >> been matched in flow #2.
> > The only missing match field that would allow decap on flow #1 itself,
> > seems to be the tunnel-id. If that is provided in flow #1, then we may
> > not need to match it again in flow #2. And flow #2 could just match on
> > the inner packet headers. This would also make handling the flow-miss
> > simple, since you wouldn't need the recovery logic (offload layer
> > popping the header etc). The packet would be in the right state for
> > the next stage of processing in SW, if there's a  flow-miss after the
> > first stage in HW. It also means tnl_pop can be done in flow #1,
> > making the behavior consistent with non-offload processing.
> >
> > The other advantage of providing the tunnel-id in flow #1 is that,
> > then we wouldn't need to match on the entire packet including the
> > outer-headers in the second stage and only the inner header fields
> > would need to be matched in flow #2.
> Even if we could somehow get the matches from flow #2 to flow #1
> (leading back to single flow model…) and do the DECAP in flow #1, then
> we would have to change the SW matches in flow #2 in case it is
> processed in SW as it won’t have the outer header/tunnel anymore.
>
[Hemal] Flow matches are defined by some policy. How is the changing of
match fields of flow #2 an issue? Can you elaborate? Aren't these matches
defined by a user or an agent?

> >
> >> Furthermore, once the HW processing path is composed of a multi table
> >> processing, there is a need to handle a case where the processing starts
> >> in HW, but is not completed in HW. For example, only flow #1 is
> >> offloaded while flow #2 doesn't exist. For that, the MARK value set in
> >> the tnl_pop action is received in SW, the packet is recovered by the
> >> mapped vport, and processing continues in SW.
> > If it is not possible to provide tunnel-id in flow #1, then the other
> > alternative might be to defer offloading until flow #2 and offload
> > only flow #2. You are anyway matching almost the entire outer-headers
> > and the inner-header in flow #2 and the decap action is also added
> > implicitly to flow #2. Why not totally avoid offloading flow #1 ? That
> > way, none of the miss handling code is required.
> See the first argument about following the SW model.
> >
> >> As vports are virtual and cannot have rte_flows, offloading a vport
> >> generates rte_flows for every PF in the system, as we cannot know from
> >> which one the packets arrive from.
> > Is it possible to track the "original in-port" through this two-stage
> > flow processing and generate rte_flow only on the specific PF that
> > received the
> > packet ?
> We could “track” it only if we tried to merge both flow #1 and flow #2.
> See the first argument about following the SW model.
> >
> > Thanks,
> > -Harsha
> >> This series adds support for IPv6, tunnel push actions, and the
> >> multi-table model for the decap processing.
> >>
> >> v1 Travis passed:
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftravis-ci.org%2Felibritstein%2FOVS%2Fbuilds%2F639484233&data=02%7C01%7Celibr%40mellanox.com%7Cec94952030bb4352740308d7a8cb42ac%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637163463282513495&sdata=hDni3f9CGRr6FvUwLKlBDtlLNfrpHm5p%2Ffe%2FWzdCZuE%3D&reserved=0
> >>
> >> Eli Britstein (14):
> >>    dpif-netdev: Add mega ufid in flow add log
> >>    netdev-offload-dpdk: Remove pre-validate of patterns function
> >>    netdev-offload-dpdk: Add IPv6 pattern matching
> >>    netdev-offload-dpdk: Support offload of clone tnl_push/output actions
> >>    netdev-offload-dpdk: Support tnl/push using vxlan encap attribute
> >>    netdev-offload: Don't use zero flow mark
> >>    netdev-offload-dpdk: Introduce map APIs for table id and miss context
> >>    netdev-offload-dpdk: Implement HW miss packet recover for vport
> >>    netdev-offload-dpdk: Refactor disassociate ufid function
> >>    netdev-offload-dpdk: Support tunnel pop action
> >>    netdev-offload-dpdk: Implement flow dump create/destroy APIs
> >>    netdev-dpdk: Getter function for dpdk devargs API
> >>    netdev-dpdk: Introduce get netdev by devargs function
> >>    netdev-dpdk-offload: Add vxlan pattern matching function
> >>
> >> Ilya Maximets (4):
> >>    netdev: Allow storing dpif type into netdev structure.
> >>    netdev-offload: Use dpif type instead of class.
> >>    netdev-offload: Allow offloading to netdev without ifindex.
> >>    netdev-offload: Disallow offloading to unrelated tunneling vports.
> >>
> >> Ophir Munk (6):
> >>    dpif-netdev: Make mark allocation public API
> >>    dpif-netdev: Add HW miss packet state recover logic
> >>    netdev-dpdk: Introduce an API to query if a dpdk port is an uplink
> >>      port
> >>    netdev-dpdk-offload: Infrastructure for multiple rte_flows per UFID
> >>    netdev-dpdk: Add flow_api support for netdev vxlan vports
> >>    netdev-offload-dpdk: Support offload for vxlan vport
> >>
> >> Oz Shlomo (1):
> >>    netdev-offload: Add HW miss packet state recover API
> >>
> >>   Documentation/howto/dpdk.rst  |    5 +-
> >>   NEWS                          |    6 +-
> >>   lib/dpif-netdev.c             |   70 +--
> >>   lib/dpif-netlink.c            |   23 +-
> >>   lib/dpif.c                    |   21 +-
> >>   lib/netdev-dpdk.c             |   68 +++
> >>   lib/netdev-dpdk.h             |    6 +
> >>   lib/netdev-offload-dpdk.c     | 1275
> +++++++++++++++++++++++++++++++++++------
> >>   lib/netdev-offload-provider.h |    6 +
> >>   lib/netdev-offload-tc.c       |   11 +-
> >>   lib/netdev-offload.c          |  115 ++--
> >>   lib/netdev-offload.h          |   23 +-
> >>   lib/netdev-provider.h         |    3 +-
> >>   lib/netdev.c                  |   16 +
> >>   lib/netdev.h                  |    2 +
> >>   ofproto/ofproto-dpif-upcall.c |    5 +-
> >>   tests/dpif-netdev.at          |   14 +-
> >>   tests/ofproto-macros.at       |    3 +-
> >>   18 files changed, 1394 insertions(+), 278 deletions(-)
> >>
> >> --
> >> 2.14.5
> >>
> >> _______________________________________________
> >> dev mailing list
> >> dev at openvswitch.org
> >>
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.openvswitch.org%2Fmailman%2Flistinfo%2Fovs-dev&data=02%7C01%7Celibr%40mellanox.com%7Cec94952030bb4352740308d7a8cb42ac%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637163463282513495&sdata=TkLL3cBksB6tKghQVnqHuxxPlK5dxYuLoJHxko9DNIc%3D&reserved=0
> _______________________________________________
> dev mailing list
> dev at openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>


More information about the dev mailing list